Deep dive on ECS desired count and circuit breaker rollback

31 Jan 2023
Written by: Rehan van der Merwe

This post explores the available options for DesiredCount and the correlation with respect to time on the circuit breaker that initiates CloudFormation rollback. We will show how over scaling the DesiredCount on deployment for small/inactive services greatly reduces the time it takes for the circuit breaker to roll back on failure.

Desired count in CloudFormation

The DesiredCount parameter was made optional on 12 November 2020 as announced in the GitHub issue. There was no real announcement for this, it just got updated in the docs and I think many are still using hacks to work around it.

We are making following improvements to ECS integration with Cloudformation:

-- DesiredCount becomes an optional field in CFN CreateService and DesiredCount=1 would be used as the default value if it is missing
-- UpdateService will also omit DesiredCount when it is missing in CFN template

Customers expect the current behavior (i.e. UpdateService will use the set DesiredCount in the CFN template) can just add DesiredCount in their CFN templates. Existing customers wanting the new behavior can get it by removing DesiredCount from their CFN templates. The changes will be released soon.

Let's take the following scenario to understand why this update is so important. CloudFormation forcefully sets the DesiredCount when it is specified in a CloudFormation template. It will ignore the value previously set by auto-scaling.

A deployed service with DesiredCount set to 1
Auto-scaling policies scales it out by setting the DesiredCount to 10
You deploy a new version of your app and CloudFormation sets the DesiredCount back to 1

In this scenario, CloudFormation forcefully set the DesiredCount back to 1. It starts to deactivate and remove 9 tasks without taking into consideration the existing DesiredCount as set by autos-scaling. Your service is now overloaded, the demand exceeds the existing capacity and this directly affects your users in the form of timeouts or failed requests in the case of an API service.

Desired count 1

This behaviour forced people to create a lot of custom hacks, like looking up the DesiredCount with the AWS API/SDK and then setting it in CloudFormation before deploying.

We can now omit the DesiredCount as per the announcement because now ECS takes into consideration the current DesiredCount that is set at 10. It spins up 10 new tasks because the service minimum health is set to 100% and the maximum to 200%, then it deactivates the old tasks. This is the intended behaviour and will not impair our service like before when setting DesiredCount explicitly.

Desired count omitted

So going forward, you should not specify the DesiredCount parameter if you have autoscaling enabled.

Circuit breaker

AWS announced ECS deployment circuit breaker on 30 Nov 2020, which allows for automatic roll back of unhealthy service deployments without the need for manual intervention.

The circuit breaker is only used for failed deployments, it communicates the error from ECS back to Cloudformation in a timely manner so that a CloudFormation rollback can occur. This would then deploy the previous version of your stack and allows you to calmly fix the new version without having an impaired service.

It is enabled by specifying the DeploymentConfiguration.DeploymentCircuitBreaker property on the service.

Previously if the service was unable to reach a healthy state, the scheduler would retry deployments in perpetuity using the service throttling logic. Okay, technical it does not retry forever but 3 hours sure feels like forever when have to do a hotfix in production and now need to wait for the rollback to complete before being able to deploy again.

The new retry count is a function of the service's DesiredCount and is described in the docs here.

failed task threshold = minimum threshold <= 0.5 * desired task count => maximum threshold
failed task threshold = 10 <= 0.5 * desired task count => 200

The deployment circuit breaker has a minimum threshold of 10 and a maximum threshold of 200 and uses the values in the following formula to determine the deployment failure.

Minimum threshold <= 0.5 * desired task count => maximum threshold

When the result of the calculation is less than the minimum of 10, the failure threshold is set to 10. When the result of the calculation is greater than the maximum of 200, the failure threshold is set to 200.

We can visualize this in a table varying the DesiredCount and determine how many failed tasks are required before the circuit breaker is triggered for rollback.

DesiredCount	Formula	Failed task count threshold
1	10 <= 1 * 0.5 => 200	0.5, so then 10 tasks
10	10 <= 10 * 0.5 => 200	5, so then 10 tasks
25	10 <= 25 * 0.5 => 200	12.5, so then 13 tasks
100	10 <= 100 * 0.5 => 200	50 tasks
500	10 <= 500 * 0.5 => 200	250, so then 200 tasks

As expected, the DesiredCount is the only variable that determines how many failed tasks are required for the circuit breaker to trip and roll back. Next, we set the DesiredCount and observe the time it takes before the circuit breaker trips.

Desired count 1 ± 40 mins

When the DesiredCount is set to 1, ECS deploys 1 task after the other sequentially until it registers 10 failures and then rolls back. In the image below the DesiredCount before deployment is 1, it is kept at 1, but it deploys a broken container so that we can trigger the circuit breaker.

The CloudFormation deployment took about 40 minutes to complete.

Desired count 10 ± 10 mins

When the DesiredCount is set to 10, ECS deploys 10 tasks in parallel until it registers 10 failures (which should happen at about the same time) and then rolls back. In the image below the DesiredCount before deployment is 1, it is then changed to 10 and also deploys a broken container so that we can trigger the circuit breaker.

The CloudFormation deployment took about 10 minutes to complete

Controlling the time for rollback

Given that the failed task threshold and the time it takes for a container to become unhealthy are constant, then the time for the circuit breaker to kick in is indirectly proportional to the DesiredCount at deploy time.

circuit breaker time = (failed task threshold / desired count) * time to container healthcheck failed

Let's take an example and define the time a container takes before reporting it is unhealthy as the sum of:

The health check grace period, 1 min
The health check failure counts, 3 with a timeout value of 30 seconds, so that means 1 minute 30 seconds.

This totals to 2 minutes 30 seconds before a container reports as unhealthy for a new deployment.

We can clearly see the effect of setting the DesiredCount has on the circuit breaker timeout when we compare the timeouts for the two DesiredCounts used in the example.

circuit breaker time = (10 / 1) * 2 min 30 sec = 10 * 2 min 30 sec = 25 min
circuit breaker time = (10 / 10) * 2 min 30 sec =  1 * 2 min 30 sec = 2 min 30 sec

It reduces the deployment time from 25 minutes to 2 minutes and 30 seconds, which is 10 times faster!

You can further optimize this by reducing the time it takes for a task to be marked as healthy or unhealthy. Consider playing around with options like health checks, retries, timeouts and grace periods.

Over scaling on deployment

Over scaling on deployment is the only way to reduce the circuit breaker timeout and is only effective if the current task DesiredCount is less than 10.

Over scaling on deployment happens when we explicitly set the DesiredCount the same as the maximum of the auto-scaling policy, so:

desired count = maximum scaling policy

By doing this we ensure:

That the circuit breaker rolls back as soon as possible by registering and deregistering multiple containers in parallel.
Do not have an impaired service as we noticed before because the DesiredCount will never be set less than what was set by autoscaling.

This increases the tasks to their maximum after every successful deployment and might take a while to scale in depending on the number of containers and your scale-in policy.

In the image below the DesiredCount was 1 before deployment, it was then set to 10 in the deployment with a new task definition. This time the deployment succeeds and we can see how the DesiredCount scales up to 10 and then gradually scales back to 1 task as a result of auto-scaling.

We also observe that CloudFormation completed successful.

Something worth pointing out is how this affects your AWS bill. You should barely see a change in pricing given that:

You don't deploy a thousand times a day otherwise, your average DesiredCount will lean towards your maximum task count.
Scaling in does not take too long so that you quickly return to your baseline DesiredCount value as was before deployment.

Caveat for non API containers

Tasks that are fronted with an ELB (Elastic Load Balancer) have settings to prevent a scale-in event from terminating the container if it is serving traffic. When ECS scales in, the ELB will stop sending tasks traffic and deregister them in the deactivating phase. Ongoing connections are not closed immediately but are closed eventually after the deregistration_delay ( which defaults to 5 minutes) lapses.

Long-lived background tasks can not use this connection-draining feature of ELB. Consider the task is an SQS consumer, ECS can not know if your application is in the middle of processing a message when it wants to scale in. This means, it will kill your container while it processes the event and the message will return to the queue to be processed again after the visibility timeout lapses. This can cause problems if your processing is not idempotent.

This was the case up until 10 November 2022 when AWS launched ECS task scale-in protection. This has been a long-awaited feature as evident from this GitHub issue.

When using scale-in protection you can now rest assured that tasks will not be killed while they are processing. Application logic can now indicate when it is safe for a task to scale in, making this over scaling on deployment viable for all types of tasks.

CDK code use for this blog

The CDK code used for this blog can be found here: https://github.com/rehanvdm/ecs-scalng

I manually commented out parts to set the right conditions for the CloudWatch screenshots taken. An artillery load test ran on my local machine to generate traffic and trigger the task auto-scaling. Everything can be found in the repo.

import cdk = require('aws-cdk-lib');
import {Construct} from "constructs";
import * as ecs from "aws-cdk-lib/aws-ecs";
import * as ec2 from "aws-cdk-lib/aws-ec2";
import {SubnetType} from "aws-cdk-lib/aws-ec2";
import * as ecsPatterns from "aws-cdk-lib/aws-ecs-patterns";
import {AdjustmentType, MetricAggregationType} from "aws-cdk-lib/aws-applicationautoscaling";

export class Backend extends cdk.Stack {
  constructor(scope: Construct, id: string, stackProps: cdk.StackProps) {
    super(scope, id, stackProps);

    const vpc = new ec2.Vpc(this, 'Vpc', {
      ipAddresses: ec2.IpAddresses.cidr('10.0.0.0/21'),
      subnetConfiguration: [
        {
          subnetType: ec2.SubnetType.PUBLIC,
          name: 'Public',
          cidrMask: 24,
        },
      ],
      natGateways: 0,
    });

    const cluster = new ecs.Cluster(this, 'Cluster', {
      vpc,
      enableFargateCapacityProviders: true,
      containerInsights: true,
    });

    // https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_ecs_patterns-readme.html#application-load-balanced-services
    const loadBalancedFargateService = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'Service', {
      cluster,
      assignPublicIp: true,
      taskSubnets: {
        subnetType: SubnetType.PUBLIC
      },
      memoryLimitMiB: 512,
      cpu: 256,
      taskImageOptions: {
        // image: ecs.ContainerImage.fromRegistry("amazon/amazon-ecs-sample"),
        image: ecs.ContainerImage.fromAsset("./src"),
      },
      minHealthyPercent: 100,
      maxHealthyPercent: 200,
      circuitBreaker: {
        rollback: true,
      },
      // desiredCount: 1,
      desiredCount: 10,
    });

    const scaling = loadBalancedFargateService.service.autoScaleTaskCount({
      minCapacity: 1,
      maxCapacity: 10,
    });
    scaling.scaleOnMetric('AggressiveCpuScaling', {
      metric: loadBalancedFargateService.service.metricCpuUtilization({
        statistic: 'Maximum',
        period: cdk.Duration.seconds(60),
      }),
      adjustmentType: AdjustmentType.CHANGE_IN_CAPACITY,
      scalingSteps: [
        { upper: 15, change: -1 },
        { lower: 40, change: +1 },
        { lower: 60, change: +3 },
      ],
      datapointsToAlarm: 1,
      evaluationPeriods: 1,
      metricAggregationType: MetricAggregationType.MAXIMUM,
      cooldown: cdk.Duration.seconds(60),
    });

    loadBalancedFargateService.targetGroup.configureHealthCheck({
      /* Uncomment to break deployment and test ECS Circuit breaker */
      // path: "/does-not-exist",
      path: "/",
    });
    loadBalancedFargateService.targetGroup.setAttribute("deregistration_delay.timeout_seconds", "30");
  }
}

export default Backend;

Conclusion

There is only one method to achieve fast rollback times for small/inactive services with an auto-scaling policy that specifies a minimum less than 10.

Set the DesiredCount the same as the auto-scaling policy's maximum value which must be equal or more than 10, or in other words: desired count = maximum scaling policy.

This over scaling on deployment will ensure the circuit breaker quickly kicks in and rolls back if required. This method increases the tasks to their maximum after every deployment and relies on auto-scaling to reduce the DesiredCount after the deployment.

The circuit breaker will already kick in fairly quickly if you have an active service with more or equal to 10 running tasks. In this is case, it is best to not set desired capacity explicitly and omit it from your CloudFormation templates.

If you like deep dives, check out this excellent article by Philipp Garbe on ECS deployments.

Have you been frustrated managing multiple AWS accounts and regions?

Have a look at Cloud Glance:

✅ CloudGlance manages your .aws/credentials and securely generates temporary credentials with STS . MFA, SSO and Role are supported.
✅ Open multiple AWS account consoles at the same time with Firefox Containers.
✅ Securely stores long-lived IAM credentials on disk, this encryption is opt-in.
✅ Port forwarding using either SSH or SSM
✅ Tracked Security Groups that sync your computer IP with the rules in an AWS Security Group.
✅ Unique multi-account & multi-region CloudFormation stack and events view. Events are grouped by deployment and show the first error alongside the final stack status.

Check it out here: https://cloudglance.dev/

Deep dive on ECS desired count and circuit breaker rollback ​

Desired count in CloudFormation ​

Circuit breaker ​

Desired count 1 ± 40 mins ​

Desired count 10 ± 10 mins ​

Controlling the time for rollback ​

Over scaling on deployment ​

Caveat for non API containers ​

CDK code use for this blog ​

Conclusion ​