For services running on Amazon ECS, auto-scaling based on CPU demand is critical for maintaining performance during traffic spikes while avoiding over-provisioning during quiet periods.
But how can you be certain your auto-scaling rules will work as expected when you need them most? After you configure auto-scaling in the AWS console, you’ll need to validate that it actually is working in the way that you expect.
In this guide, we’ll walk through some basics on configuring auto-scaling so your ECS services will respond to variable CPU demands. Then, we’ll outline how you can validate that your system is working as you expect by running a chaos experiment with a tool like Steadybit.
First, you’ll need to set up the necessary components in AWS to enable your ECS service to scale automatically. That includes creating CloudWatch alarms that monitor CPU usage and link them to scaling policies on your ECS service.
ECS Service Auto Scaling dynamically adjusts the number of tasks running in your service based on demand. When metrics like CPU or memory utilization cross a certain threshold, it automatically increases or decreases the desired task count.
The key benefits are performance, by automatically adding capacity to meet rising demand and preventing outages, and cost optimization, but automatically scaling down resources when they are not needed.
CloudWatch alarms are the triggers for your scaling actions. You’ll create an alarm that monitors the average CPU utilization of your ECS service and fires when it exceeds a predefined threshold.
To create an alarm:
By creating two alarms, you can use one as the threshold for scaling up (e.g., CPU > 75%) and the other for scaling down (e.g., CPU < 25%).
Once you have your alarms ready, you can now configure the scaling policies.
Here’s a step-by-step guide:
Once saved, your ECS service is now configured to scale automatically based on CPU load.
How can you tell if your policies are set up correctly? Beyond a binary of working or not, how can you ensure they are optimized to your desired performance?
You won’t be able to really know until you test them.
This is where chaos engineering with Steadybit comes in. By running an experiment to simulate a CPU spike, you can validate that your auto-scaling policies trigger correctly and that your system responds within your required SLOs. With a proactive experiment, you can answer questions like:
If you wait for these events to occur naturally in Production, you are subjecting your end users to untested application performance and locked into a reactive reliability approach.
To run chaos experiments easily, you can use a platform like Steadybit. Just install one agent per network and open source extensions for each technology you want to target. With the AWS extension installed, Steadybit will automatically discover infrastructure components like ECS services and tasks. If you are using an Observability tool like Datadog, Grafana, or Dynatrace, you can easily connect with Steadybit for more performance visibility.
Once connected, you can select our pre-built experiment template for validating ECS scaling and customize the following aspects:
With the experiment configured, you are ready to run it. Executing the experiment in Steadybit is as simple as clicking a button.
Steadybit Experiment Template: AWS ECS Service Scaled Up Within Reasonable Time
You’ll see a real-time view of the experiment’s progress, and you can monitor the CPU attack as it begins and abort the action at any time.
Once the experiment is complete, the results view will clearly show whether your scaling validation was successful.
This data-driven approach removes guesswork and allows you to methodically harden your system’s resilience.
And remember, infrastructure and application code change. What worked last month might not work today. We recommend adopting the practice of continuous verification, where you automate and integrate reliability tests like this into your CI/CD pipeline. This helps you catch issues early and maintain a strong reliability posture over time.
Are you interested in hearing more about how you can use proactive experiments to optimize your system performance? With Steadybit, we help teams build confidence in their systems by making it easy to run experiments and get meaningful performance insights.
Ready to put your systems to the test? You can start a 30-day free trial today or schedule a quick demo with our experts. We’d be happy to share why teams across industries are choosing Steadybit as their go-to platform for reliability.