Typo causes Amazon Simple Storage Service to go down, impacting numerous websites
Nearly 150,000 websites were inaccessible last Tuesday after Amazon’s Simple Storage Service (S3) began experiencing issues in their US-EAST-1 region, including isitdownrightnow.com, a website designed to tell users if other websites are working or not.
Other impacted sites included The Washington Post, Imgur, Giphy and indeed.com. The Daily Targum’s web host, SNWorks, also uses S3 for its file-hosting needs.
Programs like GroupMe were unable to use their browser versions, while those like Slack and Imgur could not share any files online during the outage.
Amazon S3 is used to support data hosting online and allows its clients to store data, files, backups and other information on their cloud, which is supported by datacenters located around the world.
In a post-mortem statement, Amazon Web Services explained the outage was caused while a team was trying to fix an issue at one of these datacenters on the East Coast. A mistyped command caused servers at this data center to cease working properly, which ultimately forced the outage.
“An authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that (are) used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” according to the AWS statement.
In other words, an administrator accidentally typed the wrong command which digitally removed a large amount of storage from the AWS datacenter.
The extra servers that were removed supported a pair of S3 subsystems, according to the statement.
“One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT and DELETE requests. The second subsystem, the placement subsystem, manages the allocation of new storage and requires the index subsystem to be functioning properly to correctly operate,” AWS said.
Basically, by removing too many servers from active use, sites which used S3 in the US-EAST-1 region lost the ability to interact with anything their users had uploaded, which ranged from file storage to internet portals.
According to the AWS statement, the team had to reboot the systems they had accidentally crippled, which caused a delay in their ability to recover from the outage.
“Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage … were also impacted while the S3 APIs were unavailable,” according to the statement.
In all, it took about four hours from the point where everything went down to having it all operational again.
“By 1:18 p.m. PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally,” according to the statement. “ … The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54 p.m. PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.”
Put another way, an accidental typo shut down a large number of websites based in the Eastern United States for four hours. Amazon had to restart their systems to fix the issue, and the reason it took four hours is because they had not restarted anything in years and it took longer than expected for everything to come back online.
Amazon only hosts around 1 percent of the internet, but have been expanding their cloud storage services over the last few years, according to Wired.
The internet giant may actually host more than 1 percent, as that number was an estimate from 2012.
According to Similar Tech, AWS hosts more than 152,000 websites with 124,577 unique domains as of March 6.
In their statement, Amazon announced it will work on changing how its S3 backend works to prevent similar incidents from happening again. The company has already modified their system to prevent necessary capacity from being removed by accident.
Nikhilesh De is a correspondent for The Daily Targum. He is a School of Arts and Sciences senior. Follow him on Twitter @nikhileshde for more.