Iron Sysadmin Podcast: Episode 10 – Trouble in the Cloud

Episode 10 – Trouble in the Cloud

Mar 14, 2017

http://ironsysadmin.com/wp-content/uploads/2017/03/IronSysadmin-EP10.mp3

Welcome to Episode 10

News
https://www.bloomberg.com/news/articles/2017-03-08/microsoft-pledges-to-use-arm-server-chips-threatening-intel-s-dominance

Firefox 52 will be the last version of Firefox for Windows XP and Vista

https://www.cnet.com/news/look-out-windows-android-is-catching-up/
https://www.wired.com/2017/03/atari-chip-set-off-bitter-war-among-neuroscientists/?mbid=nl_3817_p2&CNDID=21798766
http://www.nature.com/nature/journal/v543/n7644/full/nature21371.html
NIST’s new password rules – what you need to know
https://xkcd.com/936/

Announcements
Feedback
@Gangrif and @Xenophage make a great pair that will titillate ones’s ears! They cover things in the ops and
infosec news categories and topics that are relatable or at least interesting to discuss. It’s not your typical
format of a podcast, but that is what makes it refreshing.

Keep up the great content guys!

Patreon, you guys are awesome
$10 tier.
The face!

Youtube stream for this episode! https://youtu.be/EeD5y34oKNY

Chat

Main topic
Trouble in the cloud, The 2/28/2017 US East 1 S3 outage
https://aws.amazon.com/message/41926/
An Amazon employee was troubleshooting a problem with their S3 billing mechanisms.
A mistake made in an established playbook, took down systems that were not intended to be taken down
The downtime which was intended only for billing systems, took down systems essential in both reads and writes to he S3 API.
This required that some systems be rebooted.
Reboots on the Index and Placement subsystems (two of the systems mentioned as accidentally rebooted) had not been performed for years
Due to the dependencies between these systems, the restarts took quite some time.
The downtime caused some backlog of requests, and these needed to be processed when the systems were once again operational

Remediation
The core issues here were the amount of systems un-intentionally taken offline, and the fact that systems that depended on eachother were taken down at the same time.
Amazon has made changes to their tools to help pervent systems from dropping below service affecting thresholds.
They are also working to remove some of the inter-dependencies.

On top of everything, the the S3 status page depended on the health of the S3 service in order to operate.
This made it difficult for customers to view the status of S3.

Intro and Outro music credit: Tri Tachyon, Digital MK 2
http://freemusicarchive.org/music/Tri-Tachyon/