How to do more with fewer servers
March, 2020, won’t be easily forgotten. As schools, offices, restaurants and more began to close all around the world due to the COVID-19 pandemic, it seemed like “normal” life was coming to a halt.
While we at Babbel were adjusting our daily office routines for makeshift desks in our living/bedrooms at home, we suddenly began to see a drastic increase in traffic on both our web and mobile applications.
This led to an exhaustion of our servers resource pool for auto scaling, which meant that we had to make a decision: Do we increase the maximum number of server instances that can be allocated or do we optimize the current setup? We chose the latter.
Find out how tuning auto scaling alarms and switching to Puma threads has resulted in a 2 × faster application run by only one third of the servers.
Throughout this post we will provide some background on application, its infrastructure and auto scaling rules. Followed by optimizations plan and all its phases and steps. At the end we will cover improvements overview across 3 key performance indicators: resources utilization, budget and application performance.
Table of contents:
This post is about one of our core applications which handles the account and session management using Ruby on Rails.
Originally we were running it on AWS OpsWorks, with a fixed number of servers. Some time ago we migrated it to run within Docker on Amazon Elastic Container Service (or Amazon ECS for short) with AWS Fargate instances. This has enabled us to use auto scaling, which we haven’t changed since.
- had 1 vCPU and 2GB of memory
- was running a Puma application server in clustered mode with 6 processes and disabled threaded mode - 1 thread per process (since the service was never prepared for thread-safety)
All of them were running within one ECS Service behind AWS Application Load Balancer (or ALB for short) in default round-robin fashion.
AWS auto scaling is composed out of 3 elements:
- Target defines what to scale and what are the min/max boundaries of size being adjusted
- Policy defines how much should the size of the target change, and how often it can be triggered
- CloudWatch Alarm acting as a trigger for a policy, whenever a defined threshold on a given metric is reached
We had target configured to run minimum of 15 instances and maximum of 32 instances for the given application with the following mix of policies and alarms:
- Slow scale up: change size by +1
- alarm on maximum service CPU being over 60% within two consecutive checks, each check spanning 5 minutes
- alarm on ALB p99 latency above 1 second within one check across 1 minute
- Fast scale up: change size by +3
- alarm on ALB p99 latency above 3 seconds within one check across 1 minute
- Slow scale down
- alarm on average service CPU being below 15% within two consecutive checks, each check spanning 5 minutes
Combination of the above caused fast scale up early in the morning, and scale down only in the middle of the night. Since increase of the traffic we were running maximum 32 instances, each day for longer period of time.
Because of this we didn’t have any more room to breathe in case traffic keeps on growing, which prompted us to revise the above alarms and policies.
The other effect of increased traffic and constant maximum instances utilization was increased cost of running the given application - which almost doubled compared to baseline from months before.
What made us think we could squeeze out more out of existing configuration was:
- Resource underutilization: average CPU utilization was between 15% and 20%
- On average 75% of time spent in application while waiting on IO, 90% in case of the most requested endpoint.
- Some 3rd party APIs can be super slow: above 1 second - causing spikes for auto scaling.
To avoid increasing maximum capacity, we decided to revise our auto scaling and application setup. To achieve better utilization and scaling we planned three phases:
- Throttle scaling: scale up slower and scale down faster, by being more resilient during single latency spikes.
- Web Server: due to app IO nature - test Puma with threads.
- Revise auto scaling: try to utilize single instance as much as possible before losing performance.
All the charts focus on 3 key areas:
- total number of instances running, indicating how auto scaling is working
- average CPU utilization across the application
- p95 latency on ALB, multiplied by 10 to increase visibility on the chart, to see whenever there’s degradation in performance
Phase 1: Throttle scaling
At the moment a single spike in latency could be a trigger to boot up 3 new instances at once. To avoid it, we increased number of consecutive checks from 1 to 3 that had to be above the latency threshold to trigger either one of the scaling up policies.
For scale down we increased the CPU threshold from 15% to 20% and reduced a single check time from 5 minutes down to 2 minutes. This should help scale down sooner.
The result were more or less as expected:
- we didn’t scale up to full 32 instances at the beginning of the day
- we were able to scale down during the first half of the day
- and then scale down a little bit sooner after the main traffic goes away
- no visible change in p95 latency
While checking latency graphs we noticed that p99 was spiking a lot, and wasn’t stable nor representative. Also there wasn’t much we could fix, as the vast majority of the spikes was caused by less frequent calls to OAuth providers.
Because of that we decided to switch from p99 latency for alerts, to p95 latency instead. p95 was way more stable, so if it’s high, it means that the system slows down for too many people. With this we also adjusted the thresholds: from 3 to 2 seconds for fast scale up, and from 1 to 0.7 second for slow scale up policy.
As we did not observe any regression in performance due to the last change , we decided to also:
- increase the CPU threshold again from 20% to 25% for scale down policy
- reduce the minimum amount of instances from 15 to 8, as during night we were way below 20% CPU
- way better scaling: from 8 to 26 instances, instead of 15 to 32 instances
- a bit better CPU utilization
- a bit of latency regression, but not significant enough
With the last two improvements we had a bit more room to think about the CPU patterns we were seeing so far:
- quite low average utilization
- constantly spiking maximum peaks
CPU utilization showed that some instances were constantly doing more work than others which were waiting for IO. With the amount of IO operations this application had, it must have meant that requests routing wasn’t optimal with round robin on ALB. This could lead to situations where some instances while still processing long running IO requests, get even more requests assigned - and thus causing high latency spikes.
It turns out that AWS have released on November 25th 2019 a new routing algorithm for ALB, called “least outstanding requests”:
With this algorithm, as the new request comes in, the load balancer will send it to the target with least number of outstanding requests. Targets processing long-standing requests or having lower processing capabilities aren’t burdened with more requests and the load is evenly spread across targets. This also helps the new targets to effectively take load off of overloaded targets. AWS / What’s New
We were planning on using it earlier, but it wasn’t available in AWS terraform provider till March 6th 2020. Just in time.
After enabling the new algorithm, results looked good with better latency, but worse with the auto scaling. As it turned out, the timing of our deployment was unfortunate. We deployed during one of the attacks we were handling at that time.
This made us rethink an alarm for slow scale up policy based on maximum CPU. We changed it to average CPU being higher than 40% - instead of maximum above 50% - which was happening even more often now.
The last thing that we adjusted was the minimum number of instances: We increased it from 8 to 10, as auto scaling was behaving unstable.
This yielded good results:
- p95 got a little bit better than before employing new ALB algorithm
- even less instances running to support the traffic
- better CPU utilization, still it meant that we were scaling on latency spikes instead of CPU utilization
The last point was a perfect indication that we should move on to web server configuration.
Phase 2: Web server
When we switched to ECS we also switched from Apache with Passenger to Puma. We weren’t running with multiple threads, due to the age of the application we weren’t sure if it was thread safe.
So now we had 3 options:
- Increase number of processes in Puma cluster
- Switch from processes to threads with Puma
- Use a combination of both
We decided to give it a try for threaded mode in Puma first, for 2 reasons:
- We were using an instance with a single CPU, so it didn’t make much sense to increase processes count
- We wanted to test app for thread safety anyway 😊
To make sure accounts application is thread safe we:
- Went over used rack middlewares, to see if there’s nothing suspicious
- Switched from using redis connection directly to using connection pool instead
- Configured Puma to run only in threaded mode, using default 16 threads for maximum instead of 6 processes
With integration tests doing good in staging environment, and application not showing any unexpected errors for few days, decision was made to do a test run in production.
The results were more than satisfying:
- huge drop in latency across the board: p99, p95, p90, p75
- a bit less instances supporting same traffic
- no real change in average CPU utilization
- stabilized maximum CPU spikes, less spikes
After this we also tested a combination of 3 processes in cluster mode, 8 threads each, but without any improvements over plain threaded mode. Which lead to conclusion that we could safely move on to last phase.
Accounts application is exposed as gem to other services. Majority of services already leverage Datadog Application Performance Monitoring with distributed tracing to know how various layers behave. This allowed to see what was the impact on the gem side alone and thus all the clients that were utilizing it.
Phase 3: Revised auto scaling
As application was able to handle way more traffic, we decided to better utilize the CPU, mainly by increasing the auto scaling alarm thresholds:
- for scale up, we increased average CPU threshold from 40% to 55%
- for scale down, we also increased average CPU from 25% to 40%
This yielded nice results of running with maximum of 8 instances and minimum of 5. The only downside was higher p95 latency. Because of which we decreased CPU thresholds by 5%. The minimum number of instances was reduced down to 2 - so if the traffic comes back to baseline from February, even less instances would be used to support it.
The initial work on optimizations brought some required headroom for scaling and better cost efficiency.
Switch to threaded mode improved and stabilized performance, which contributed to the final cost - mainly due to the nature of the application which spends most of the time on IO operations.
Auto Scaling & Resources Utilization ✔️
At the beginning we were utilizing 15 up to 32 instances, now we run from 4 up to 11 instances, which is 3 × less.
Which means we were basically overprovisioning, and underutilizing resources. Now it looks better, where CPU utilizations is on average higher by 25%.
At the moment, with traffic still being higher than at beginning of March, the ECS cost is:
- 4 × lower compared to cost from before optimizations
- 2.4 × lower compared to cost from before traffic increase
With performance we didn’t want it to worsen. This went two ways:
For the application clients it looks way better, as we reduced the latency on ALB:
The performance achieved on ALB compensates the worsened one on the application level, as in the end application is 2 × faster.
There’s still possibility that it could be improved by trying Puma Threaded mode together with Clustered mode, for example by using 3 processes with 8/16 threads each on instance with 2 CPUs.