The day the cloud stood still. Lessons learned roundup…
The well-publisized outage of EBS on multiple availability zones in the US-EAST-1 Region of AWS last week kicked off some excellent blog posts from companies who, through robust architectural choices, managed to weather the storm quite well. It lasted five days, it’s been called the worst cloud computing disaster ever, and Amazon’s communications strategy didn’t exactly shine, but it has presented an opportunity to learn from the companies that are hosting their sites on the AWS cloud better than many of their peers.
This is just a round-up of some of these posts, and the advice given. They’ve been edited down, of course, so be sure to read each of these articles for the whole story:
The Cloud is Not a Silver Bullet — Joe Stump, CTO of SimpleGeo
- Everything needs to be automated. Spinning up new instances, expanding your clusters, backups, restoring from backups, metrics, monitoring, configurations, deployments, etc. should all be automated.
- You must build share-nothing services that span AZs at a minimum. Preferably your services should span regions as well, which is technically more difficult to implement, but will increase your availability by an order of magnitude.
- An avoidance of relying on ACID services. It’s not that you can’t run MySQL, PostgreSQL, etc. on the cloud, but the ephemeral and distributed nature of the cloud make this a much more difficult feature to sustain.
- Data must be replicated across multiple types of storage. If you run MySQL on top of RDS, you should be replicating to slaves on EBS, RDS multi-AZ slaves, ephemeral drives, etc. Additionally, snapshots and backups should span regions. This allows entire components to disappear and you to either continue to operate or restore quickly even if a major AWS service is down.
- Application-level replication strategies. To truly go multi-region, or to span across cloud services, you’ll very likely have to build replication strategies into your application rather than relying those inherent in your storage systems.
How SmugMug survived the Amazonpocalypse — Don MacAskill, CEO of SmugMug
- Spread across as many AZs as you can. Use all four.
- If your stuff is truly mission critical (banking, government, health, serious money maker, etc), spread across as many Regions as you can.
- Beyond mission critical? Spread across many providers.
- Since spreading across multiple Regions and providers adds crazy amounts of extra complexity, and complex systems tend to be less stable, you could be shooting yourself in the foot unless you really know what you’re doing.
- Build for failure. Each component (EC2 instance, etc) should be able to die without affecting the whole system as much as possible.
- Understand your components and how they fail. Use any component, such as EBS, only if you fully understand it. For mission-critical data using EBS, that means RAID1/5/6/10/etc locally, and some sort of replication or mirroring across AZs, with some sort of mechanism to get eventually consistent and/or re-instantiate after failure events.
- Try to componentize your system. Why take the entire thing offline if only a small portion is affected?
- Test your components. I regularly kill off stuff on EC2 just to see what’ll happen.
AWS outage timeline & downtimes by recovery strategy — Eric Kidd, Randomhacks.net
Eric took an interesting look at various potential strategies, and how long a company would have been offline during the EBS outage:
- Rely on a single EBS volume with no snapshots: 3.5 days
- Deploy into a single availability zone, with EBS snapshots: over 12 hours
- Rely on multi-AZ RDS databases to fail over to another availability zone: longer than 14 hours for some users.
- Run in 3 AZs, at no more than 60% capacity in each: This is the approach taken by Netflix, which sailed through this outage without no known downtime
- Replicate data to another AWS region or cloud provider: This is still the gold standard for sites which require high uptime guarantees.
The AWS Outage: The Cloud’s Shining Moment — George Reese, Founder of Valtira and enStratus
The Amazon model is the “design for failure” model. Under the “design for failure” model, combinations of your software and management tools take responsibility for application availability. The actual infrastructure availability is entirely irrelevant to your application availability. 100% uptime should be achievable even when your cloud provider has a massive, data-center-wide outage…
There are several requirements for “design for failure”:
- Each application component must be deployed across redundant cloud components, ideally with minimal or no common points of failure
- Each application component must make no assumptions about the underlying infrastructure—it must be able to adapt to changes in the infrastructure without downtime
- Each application component should be partition tolerant—in other words, it should be able to survive network latency (or loss of communication) among the nodes that support that component
- Automation tools must be in place to orchestrate application responses to failures or other changes in the infrastructure (full disclosure, I am CTO of a company that sells such automation tools, enStratus)
Today’s EC2 / EBS Outage: Lessons learned — Stephen Nelson-Smith, Technical Director of Atalanta Systems
- Expect downtime…What matters is how you respond to downtime
- Use amazon’s built-in availability mechanisms
- Think about your use of EBS:
- EBS is not a SAN
- EBS is multi-tenant…Consider using lots of volumes and building up your own RAID 10 or RAID 6 from EBS volumes.
- Don’t use EBS snapshots as a backup…Although they are available to different availabilty zones in a given region, you can’t move them between regions.
- Consider not using EBS at all
- Consider building towards a vendor-neutral architecture…Cloud abstraction tools like Fog, and configuration management frameworks such as Chef make the task easier.
- Have a DR plan, and practice it
- Infrastructure as code is hugely relevant…one of the great enablers of the infrastructure as code paradigm is the ability to rebuild the business from nothing more than a source code repository, some new compute resource (virtual or physical) and an application data backup.






