VMware’s attempt to recover from an outage in its brand-new cloud computing service inadvertently caused a second outage the next day, the company said.
VMware’s new Cloud Foundry service — which is still in beta — suffered downtime over the course of two days last week, not long after the more highly publicised outage that hit Amazon’s Elastic Compute Cloud. Cloud Foundry, a Platform-as-a-Service (PaaS) offering for developers to build and host Web applications, was announced April 12 and suffered “service interruptions” on April 25 and April 26.
The first downtime incident was caused by a power outage in the supply for a storage cabinet. Applications remained online but developers weren’t able to perform basic tasks, like logging in or creating new applications. The outage lasted nearly 10 hours and was fixed by the afternoon.
VMware official Dekel Tankel explained that the April 25 power outage is “something that can and will happen from time to time,” and that VMware has to ensure that its software, monitoring systems and operational practices are robust enough to prevent power outages from taking customer systems offline. With that in mind, VMware began developing “a full operational playbook for early detection, prevention and restoration” the very next day.
Tankel said. “This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.”
VMware officials accidentally caused a second outage while developing an early detection plan to prevent the kind of problem that hit the service the previous day, but day two of the outage was far more serious.
VMware’s second-day problem illustrated the element of human error in cloud networks, just as the root-cause analysis of Amazon’s cloud outage did. In the case of Amazon, a mistake made during a system upgrade led to trouble that took several days to fully correct.
VMware, which is best known for its server virtualisation technology, is a new player in offering a publicly available cloud service. Previously, VMware sold technology to help customers and service providers build their own clouds. Because Cloud Foundry is so new the customer impact was not as severe as the one caused by Amazon, whose outage forced offline numerous websites that rely on Amazon infrastructure. But VMware is getting a taste of what it’s like to be a service provider when things go wrong.