Skype Ltd. today blamed last week's Windows security updates for triggering a bug in its software that brought down the Internet telephony service for more than 48 hours ... spokesman Villu Arak [said], "The disruption was triggered by a massive restart of our users' computers across the globe within a very short time frame as they rebooted after receiving a routine set of patches through Windows Update." According to Arak, the large number of restarts started a chain reaction that brought down the service.Here's Villu Arak (for it is he):
...
Although Skype fingered Tuesday's Windows updates for triggering the outage, it said the root cause was "a previously unseen software bug within the network resource-allocation algorithm" that prevented the network from recovering on its own, as it was supposed to do ... The company did not ... explain how this month's updates were different from past rounds of patches.
For example, Microsoft Corp.'s security updates have been on their current schedule of the second Tuesday of each month since October 2003, before Skype left beta testing. And required restarts are the norm for many of Microsoft's security updates. Nor was the quantity of restarts last week ... out of line with previous months. [more]
Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly. Regrettably, as a result of this disruption, Skype was unavailable to the majority of its users for approximately two days.Long Zheng tilts:
The issue has now been identified explicitly within Skype. We can confirm categorically that no malicious activities were attributed or that our users’ security was not, at any point, at risk. [more]
Am I the only one to find this claim a little bit of a puzzler? ... Think about it. If Windows Update did in fact cause the restart of millions of Skype users worldwide, which it can do without argument, then how come Skype doesn’t crash the second Tuesday of every month when of course Microsoft distributes its Windows patches like they have for the past 3 years and years of unscheduled patches prior to that? As far as I recall, last week wasn’t any different.This is the time on sprockkets when we post to Slashdot:
Am I missing something? I’m not saying it was not Windows Update, but why only last week did it do what it could have done 36 times already? [more]
Here is why it caused a problem. Sometimes, in their great wisdom, Microsoft deems it necessary to reboot your computer remotely (if the security vulnerability is severe). They've done it at least once before. This week, I just happened to catch it that the automatic updates had a 3 minute countdown until it rebooted, and unless I had been there, it would have, so if all were told to do so, all would have rebooted at once.And this Anonymous Coward concurs:
Oh, might as well add that my computer is set to only download the updates and not install them. I did not recall telling any of them to install. [more]
Something was different last week ... I had six servers reboot that had autoupdates turned off. My desktop system running 2003R2 and my laptop running XP also rebooted w/o my permission. We have quite a few pissed-off customers because of the updates. It was an unusual situation. [more]Bruno Giussani has a gun under his bed (probably):
In August 2004, I interviewed Niklas Zennström, one of the two Skype founders. At the time, there were about 550'000 concurrent Skype users on average. I asked him about scaling to serve a larger crowd ... "We won't need to invest in infrastructure. What we will need to do at some point is to make some changes in the technology to be able to scale more. If we didn’t do anything, when we reach 10 million concurrent users (20 times more than now) we believe there will be problems."Martin MC Brown looks back:
...
The 10-million mark has been reached. So one can wonder whether the issue with Skype is not larger than what their blog post says. [more]
Skype makes great pains to explain that you cannot rely on Skype for calling emergency services. However, people do rely on Skype in the same way as they expect to rely on their phone system, and I cannot remember once in the last 30 years that I've been using the phone system here in the UK that the phone hasn't worked when you picked up the handset.Mark Evans says it's worth every penny:
...
Today we rely on the Internet more and more, and Skype is a major part of a wider puzzle that includes other VoIP services, email, newsgroups, IM, and social networking sites, not to mention access to movies and web sites. Given that, while failures can be tolerated, perhaps we ought to be looking at wider issues on how important the internet and services that it provides are, and how we go about ensuring availability of those services. [more]
What are the expectations of people using a free service? If, for example, you’ve never paid a penny to use Skype can you really complain too much when it goes down for awhile? Sure you’ve become dependent on it as an everyday communications tool but what do you expect for nothing?Rick Aristotle Munarriz is a Fool:
...
Don’t get me wrong, having your service go down is a bad thing because if enough users get frustrated and decide to leave for a rival, it means advertisers could go away too. But the reality is this sense of entitlement among online users is unrealistic because expecting to get everything for nothing is just wrong. [more]
Publicized outages can also be merit badges. In its absence, consumers begin to realize how important a service like Skype is in their lives. In that sense, eBay knows outages well. It has suffered through regrettably well-publicized downtime at both PayPal and eBay in the past. Each service emerged stronger than ever.The failure mode reminds this Anonymous Coward of a 50-year-old outage:
A sticky telco service like Research In Motion's (Nasdaq: RIMM) BlackBerry suffered an email outage for several hours back in April, and you don't see folks heaving their wireless smartphones into the nearest dumpster. [more]
The story is that many years ago an earthquake rattled a California town ... The earthquake had jostled thousands of telephones off hook. The central office switches survived the quake just fine, but crashed due to a bug ... the switch kept a list of phones that were off hook ... the central office only had a certain number of units that could play dial tone and listen for dialing. So the first "n" phones off hook got dial tone; the rest were put into a FIFO list of phones waiting for dial-tone equipment. There were so many phones off hook due to the earthquake that the FIFO list overflowed, crashing the switch.Buffer overflow:
When the switch rebooted, it had to figure out which phones needed dial-tone ... thus overflowing the list and crashing the switch again. And again. And again. [more]
Around the NetAnd finally... Sarcastic WednesdayAround Computerworld Previously in IT Blogwatch
- Groklaw: Judge Kimball Sets the Rules of the Road for SCO v. Novell
- DrunkenData: Let the Battle Be Joined
- Peter R Everitt: Will CIO's value a IT knowledge network?
- Joel on Software: Even the Office 2007 box has a learning curve
- Mary Jo Foley: Partnering and competing with Microsoft: There's nothing new under the sun
- Network Security: More data points in the disclosure argument
- Realtime IT Compliance: Social Security Number No Match Rule: Employers Will Need to Prove Compliance
- Blah, Blah! Technology: Google to kill the PageRank?
- Tom Olzak: IBM secures mainframe OS
Richi Jennings is an independent analyst/adviser/consultant, specializing in blogging, email, and spam. A 20 year, cross-functional IT veteran, he is also an analyst at Ferris Research. You too can pretend to be Richi's friend on Facebook, or just use boring old email: blogwatch@richi.co.uk.