Which of the following resulted in an outage for netflix customers on christmas eve 2012?

Which of the following resulted in an outage for netflix customers on christmas eve 2012?

Amazon’s explanation for the problem that took down Netflix and other sites on Christmas Eve: human error.

The Web giant blamed an unnamed developer who ran a maintenance process against state data used by the company’s Elastic Load Balancers, or ELBs. That mistake cascaded into other areas. At its peak, 6.8 percent of the company’s ELBs were affected—which might not sound like a lot, but they were balancing loads across multiple servers.

Netflix was forced to apologize for the outage, publicly pinning the blame on AWS infrastructure. That was small consolation to anyone seeking to escape from family holiday duties with a streaming marathon of American Horror Story.

Amazon also apologized. “We know how critical our services are to our customers’ businesses, and we know this disruption came at an inopportune time for some of our customers,” it wrote. “We will do everything we can to learn from this event and use it to drive further improvement in the ELB service.”

Amazon’s U.S East region has been bitten by several small outages over the past several months. A June 2012 electrical storm, for example, affected its services in a way that knocked high-profile clients such as Instagram and Netflix offline. Amazon’s other U.S. data centers, including ones in Oregon and California, haven’t suffered from widespread outages.

The Problem

The service disruption began at 12:24 PM PST on December 24th, when the aforementioned developer accidentally triggered a maintenance program that erased state data used to manage the region’s load balancers. That generated a high number of API errors—but, in an odd twist, customers were able to create and manage new load balancers, but not the ones that had been previously generated.

“During this event, because the ELB control plane lacked some of the necessary ELB state data to successfully make these changes, load balancers that were modified were improperly configured by the control plane,” Amazon wrote. “This resulted in degraded performance and errors for customer applications using these modified load balancers.”

Amazon disabled several ELB control plane workflows at 5:28 PM Christmas Eve, and worked through the night to try and manually bring back some of the affected ELBs. Amazon also tried and failed to restore the ELBs to their state just before the outage, an automated process that would have solved the problem. But the company was unable to come up with a workable snapshot of the data until an alternate solution was found. It was 12:05 PM PST on Christmas Day before the service returned to normalcy.

Lessons Learned

Amazon’s mea culpa highlights two areas in which the company can improve: access to its infrastructure, and disaster recovery (even if that disaster was self-inflicted).

Data center operators running a private cloud will undoubtedly get a bit of a chuckle from Amazon’s woes; although companies operating a private cloud must bear the costs of infrastructure and deployment, in theory they have the ability to manage access in a way that Amazon does not. And Amazon said that that’s one of the practices it will change: including limited access to production ELB state data to prevent inadvertent modification without specific Change Management (CM) approval. Those processes are currently transitioning over to an automated process that can be directly controlled by Amazon.

Amazon also tacitly acknowledged that its recovery strategy could have been better implemented. However, the company said it had learned from its mistake. “We believe that we can reprogram our ELB control plane workflows to more thoughtfully reconcile the central service data with the current load balancer state,” it said. “This would allow the service to recover automatically from logical data loss or corruption without needing manual data restoration.”

As arguably the highest-profile public cloud, Amazon’s services are closely scrutinized. But even the smallest data-center provider can take away some key lessons, not the least of which is that disaster-recovery strategies need to be as fine-grained, and as fine-tuned, as possible.

Image: Tatiana Belova/Shutterstock.com

Post navigation

A sign is shown at the headquarters of Netflix in Los Gatos, California September 20, 2011. REUTERS/Robert Galbraith

NEW YORK (Reuters) - An outage at one of Amazon’s web service centers hit users of Netflix Inc’s streaming video service on Christmas Eve and was not fully resolved until Christmas Day, a spokesman for the movie rental company said on Tuesday.

The outage impacted Netflix subscribers across Canada, Latin America and the United States, and affected various devices that enable users to stream movies and television shows from home, Netflix spokesman Joris Evers said. Such devices range from gaming consoles like the Nintendo Wii and PlayStation 3 to Blu-ray DVD players.

Netflix, which is based in Los Gatos, California, has 30 million streaming subscribers worldwide, of which more than 27 million are in the Americas region that was exposed to the outage and could have potentially been affected, Evers said.

Evers said the issue was the result of an outage at an Amazon Web Services’ cloud computing center in Virginia and started at about 12:30 p.m. PST (2030 GMT) on Monday and was fully restored before 8:00 a.m. PST Tuesday morning, although streaming was available for most users by 11:00 p.m. PST on Monday.

The event marks the latest in a series of outages from Amazon Web Services, with one occurring in April of last year that knocked out such sites as Reddit and Foursquare.

“We are investigating exactly what happened and how it could have been prevented,” Evers of Netflix said.

“We are happy that people opening gifts of Netflix or Netflix capable devices can watch TV shows and movies and apologize for any inconvenience caused last night,” he added.

Officials at Amazon Web Services were not available for comment. Evers, the Netflix spokesman, declined to comment on the company’s contracts with Amazon.

Reporting by Sam Forgione; Editing by Leslie Gevirtz and Matt Driskill

for-phone-onlyfor-tablet-portrait-upfor-tablet-landscape-upfor-desktop-upfor-wide-desktop-up