CloudHostingTech.com on August 04, 2012 : The complete details about the last week outage of public cloud were released through a blog post on Thursday August 2, 2012. The details were released by the general manager of Windows Azure on company blogging website. I reported on July 30, 2012 about this outage of Window Azure few days back that caused more than 2.5 hours outage in western European area.
General Manager of the company Mr. Mike Neil was quoted to have said that the main reason for the outage in western European sub-region was the improper configuration of “Safety-Valve”, which was not properly configured and that lead to this long outage in aforesaid area. Safety valve mechanism is designed to protect against potential cascaded networking failures; it limits the connection capacity to be allowed to be connected to the hardware at data center.
In his statement, he further said, “New capacity was added to the sub-region in response to increased demand, but the limit in the networking devices was not adjusted to meet the new capacity. The threshold was exceeded with the rapid increase in usage in the cluster, which resulted in a sizeable amount of network management messages. The increased management traffic in turn, triggered bugs in some of the cluster’s hardware devices, causing these to reach 100 percent CPU utilization impacting data traffic”.
It was further stated that now the issue have already been resolved by increasing the capacity of the hardware connection handling by reconfiguring the safety valve; meanwhile, the management system has also been improved to provide the better monitoring and maintenance capabilities.
Mr. Neil further explained in his statement on the blog that a complete detail is being provided to keep their customers posted and discarding any kind of ambiguity or vagueness in the situation.