The Auto Scaling functionality provided by AWS makes it easy to scale application clusters using spot or on demand instances - however it does not have the ability to fall back to an on-demand instance if a spot instance is not available or try getting instances in other availability zones. Another issue is that while AWS provides a spot termination notice, it does not guarantee it – so clean up actions like moving logs to a persistent store before instance terminates become tricky. Considering these shortcomings, we decided to implement our own auto scaler framework. Our custom auto scaler allows us to add in custom scaling paradigms, build more sophisticated bidding strategies and implement robust fallback logic. We have also added additional features like spot termination detection and recovery handlers, extensive logging for alerts/metrics and Slack integration.
Scale Up: Spot Bidding and Fallback
When the ScaleX module generates a scale up event because of a high threshold on the configured scaling metric like CPU or QPS, the Orchestrator will try to launch a spot instance in the primary availability zone based on the past pricing history and a max bid cap. If the spot request fails for any reason, the Orchestrator will try to launch a spot instance in any of the additional configured availability zones or launch an on-demand instance as a last resort. Instance details like its identifier, hostname, IP address, lifecycle and priority are persisted in a database, the instance is initialized and added to production cluster.
Scale Down: Instance Prioritization and Cleanup
On detecting the need for a cluster scale down, the Orchestrator determines the best instance to terminate. It will prefer an on-demand instance over a spot instance and instances in secondary availability zones over those in the primary availability zone. The instance is marked as terminated in the database, removed from production cluster and clean-up scripts are executed either external to the instance or as a shutdown script on the terminating instance.
Spot Termination: Detection and Recovery
Spot instances can be terminated abruptly by AWS in response to market price surge, sometimes your entire spot fleet can be terminated. The scaler needs to be able to replace these instances immediately to avoid service disruption as well as do clean up actions. Every spot instance polls for its own termination using the notice API provided by AWS and will trigger clean-up actions. Since we cannot rely solely on the termination notice, the Reaper module checks the state of known spot instances frequently and launches replacements for terminated instances.
Monitoring: Metrics and Alerts
We use ScaleX to scale some of our core bidder services, hence monitoring statistics and alerting on exceptions is immensely important. Metric collection becomes complicated because instances can come and go frequently and hostnames/IP addresses change constantly. Each LIVE instance reports metrics like CPU utilization, requests per second and 95% latency into Graphite – these are then aggregated to report cluster wide health in Grafana dashboards. Any scaling exceptions like spot unavailability or pre-emptive terminations are reported via Nagios and daily counts are sent to Slack. These metrics help us optimize our bidding strategies and monitor if the scaler is able to handle exceptions quickly.
Not being the types to rest on our laurels, we plan to make some key improvements and add more bells and whistles. Our roadmap includes support for multiple instance types (spot fleet) with lowest price optimization, intelligent zone selection based on price/availability history, spot hibernation to resume from saved instance state and instance termination prediction. Over the next few months, we hope to make ScaleX available as an open source project that other startups can leverage to optimally scale their services in AWS.
RevX is an app retargeting platform that powers growth for mobile businesses through dynamic retargeting. The platform is built on integrated and transparent technology combining four key pillars - audience intelligence, programmatic media, personalized ads, and ROI optimization. Mobile marketers across verticals like e-Commerce, travel, lifestyle, hyperlocal and gaming use RevX to enhance user engagement by activating new users, converting existing users and re-activating lapsed users.