Auto Scaling Group with Rolling Deployment Module
This Terraform Module creates an Auto Scaling Group (ASG) that can do a zero-downtime rolling deployment. That means
every time you update your app (e.g. publish a new AMI), all you have to do is run terraform apply
and the new
version of your app will automatically roll out across your Auto Scaling Group. Note that this module only
creates the ASG and it's up to you to create all the other related resources, such as the launch template, ELB,
and security groups.
** Note: This module used to use Launch configurations but has been updated to use Launch templates. This has been recommended by AWS for some time and Launch configurations will finally be deprecated entirely on Dec 31st 2023.
What's an Auto Scaling Group?
An Auto Scaling Group (ASG) is used to manage a cluster of EC2 Instances. It can enforce pre-defined rules about how many instances to run in the cluster, scale the number of instances up or down depending on traffic, and automatically restart instances if they go down.
How does rolling deployment work?
Since Terraform does not have rolling deployment built in (see https://github.com/hashicorp/terraform/issues/1552), we
are faking it using the create_before_destroy
lifecycle property. This approach is based on the rolling deploy
strategy used by HashiCorp itself, as described by Paul Hinze
here. As a result, every time you
update your launch templates (e.g. by specifying a new AMI to deploy), Terraform will:
- Create a new ASG with the new launch templates.
- Wait for the new ASG to deploy successfully and for the instances to register with the load balancer (if you associated an ELB or ALB with this ASG).
- Destroy the old ASG.
- Since the old ASG is only removed once the new ASG instances are registered with the ELB and serving traffic, there will be no downtime. Moreover, if anything went wrong while rolling out the new ASG, it will be marked as tainted (i.e. marked for deletion next time) and the original ASG will be left unchanged, so again, there is no downtime.
Note that if all we did was use create_before_destroy
, on each redeploy, our ASG would reset to its hard-coded
desired_capacity
, losing the capacity changes from auto scaling policies. We solve this problem by using an
external data source that runs the Python script
get-desired-capacity.py to fetch the latest value of the
desired_capacity
parameter:
- If the script finds a value from an already-existing ASG, we use it, to ensure that the changes form auto scaling events are not lost.
- If the script doesn't find an already-existing ASG, that means this is the first deploy, and we fall back to the
hard-coded
desired_capacity
value.
Reference
- Inputs
- Outputs
Required
desired_capacity
numberThe desired number of EC2 Instances to run in the ASG initially. Note that auto scaling policies may change this value. If you're using auto scaling policies to dynamically resize the cluster, you should actually leave this value as null.
launch_template
object(…)The ID and version of the Launch Template to use for each EC2 instance in this ASG. The version value MUST be an output of the Launch Template resource itself. This ensures that a new ASG is created every time a new Launch Template version is created.
object({
id = string
name = string
version = string
})
max_size
numberThe maximum number of EC2 Instances to run in the ASG
min_size
numberThe minimum number of EC2 Instances to run in the ASG
vpc_subnet_ids
list(string)A list of subnet ids in the VPC were the EC2 Instances should be deployed
Optional
custom_tags
list(object(…))A list of custom tags to apply to the EC2 Instances in this ASG. Each item in this list should be a map with the parameters key, value, and propagate_at_launch.
list(object({
key = string
value = string
propagate_at_launch = bool
}))
[]
Example
default = [
{
key = "foo"
value = "bar"
propagate_at_launch = true
},
{
key = "baz"
value = "blah"
propagate_at_launch = true
}
]
deletion_timeout
stringTimeout value for deletion operations on autoscale groups.
"10m"
enabled_metrics
list(string)A list of metrics the ASG should enable for monitoring all instances in a group. The allowed values are GroupMinSize, GroupMaxSize, GroupDesiredCapacity, GroupInServiceInstances, GroupPendingInstances, GroupStandbyInstances, GroupTerminatingInstances, GroupTotalInstances.
[]
Example
enabled_metrics = [
"GroupDesiredCapacity",
"GroupInServiceInstances",
"GroupMaxSize",
"GroupMinSize",
"GroupPendingInstances",
"GroupStandbyInstances",
"GroupTerminatingInstances",
"GroupTotalInstances"
]
Time, in seconds, after an EC2 Instance comes into service before checking health.
300
load_balancers
list(string)A list of Elastic Load Balancer (ELB) names to associate with this ASG. If you're using the Application Load Balancer (ALB), see target_group_arns
.
[]
max_instance_lifetime
numberThe maximum amount of time, in seconds, that an instance inside an ASG can be in service, values must be either equal to 0 or between 604800 and 31536000 seconds.
null
min_elb_capacity
numberWait for this number of EC2 Instances to show up healthy in the load balancer on creation.
0
tag_asg_id_key
stringThe key for the tag that will be used to associate a unique identifier with this ASG. This identifier will persist between redeploys of the ASG, even though the underlying ASG is being deleted and replaced with a different one.
"AsgId"
target_group_arns
list(string)A list of Application Load Balancer (ALB) target group ARNs to associate with this ASG. If you're using the Elastic Load Balancer (ELB), see load_balancers
.
[]
termination_policies
list(string)A list of policies to decide how the instances in the auto scale group should be terminated. The allowed values are OldestInstance, NewestInstance, OldestLaunchTemplate, AllocationStrategy, ClosestToNextInstanceHour, Default.
[]
Whether or not ELB or ALB health checks should be enabled. If set to true, the load_balancers or target_groups_arns variable should be set depending on the load balancer type you are using. Useful for testing connectivity before health check endpoints are available.
true
A maximum duration that Terraform should wait for the EC2 Instances to be healthy before timing out.
"10m"