An unexpected downtime of critical Amazon cloud services caused disruptions to several services earlier this week. It affected a large number of web sites including some of the big names like Spotify, Slack, Netflix, Pinterest, Trello, Buzzfeed and IFTTT. Ironically, the web site used to check whether other web sites are down – isitdownrightnow, also went down.
Amazon dug into the matter, and the root cause was identified as being the execution of a wrong command. The employee who worked in Amazon S3 services team was doing a regular debugging program, trying to identify the reason behind the slow response rate of S3 billing services.
What he wanted to do was to actually slow down few number of servers that handled the S3 billing process, but a typological error while entering the command resulted in the entire ruckus.
The index subsystem which was having all the metadata and tracked the objects in S3, went down. Complicating the situation further, the placement subsystem which depended on the index subsystem also went down.
TAs a result, Amazon was no longer able to serve any API request from clients in the regions of Northern Virginia, designated as US-EAST-1. It took the company around four hours to get the system back and running.
The company has announced that it will take and is taking every possible step to prevent such mishaps in future.
Although capacity removal is an operational practice, but learning from this outage, they have modified the tool so that the capacity is removed gradually, and not too quickly, as happened in this case and has also restricted it to be removed below the set threshold. This means, even if in future such typo-errors occur, they will not be able to disrupt the services in a similar way.
It is also conducting a system audit to implement such safety checks for all its services. It will not only better the recovery time of important S3 systems, but will refactor parts of services in tiny cells to minimize the impact of such disruptions and better the recovery time. It also plans to further partition index subsystems later in the year.
Amazon lastly posted an apology for its customers, “Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.”