Richi Jennings

What's Skype's excuse for huge outage? (and sarcastic Wednesday)

August 21, 2007 6:11 AM EDT
Is it me you're looking for? Tuesday's IT Blogwatch: in which we examine Skype's reasons for going down for so long last week. Not to mention Hoops and Yoyo...

Gregg Keizer managed to get through and say
Skype Ltd. today blamed last week's Windows security updates for triggering a bug in its software that brought down the Internet telephony service for more than 48 hours ... spokesman Villu Arak [said], "The disruption was triggered by a massive restart of our users' computers across the globe within a very short time frame as they rebooted after receiving a routine set of patches through Windows Update." According to Arak, the large number of restarts started a chain reaction that brought down the service.
Although Skype fingered Tuesday's Windows updates for triggering the outage, it said the root cause was "a previously unseen software bug within the network resource-allocation algorithm" that prevented the network from recovering on its own, as it was supposed to do ... The company did not ... explain how this month's updates were different from past rounds of patches.

For example, Microsoft Corp.'s security updates have been on their current schedule of the second Tuesday of each month since October 2003, before Skype left beta testing. And required restarts are the norm for many of Microsoft's security updates. Nor was the quantity of restarts last week ... out of line with previous months. [more]
Here's Villu Arak (for it is he):
Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly. Regrettably, as a result of this disruption, Skype was unavailable to the majority of its users for approximately two days.

The issue has now been identified explicitly within Skype. We can confirm categorically that no malicious activities were attributed or that our users’ security was not, at any point, at risk. [more]
Long Zheng tilts:
Am I the only one to find this claim a little bit of a puzzler? ... Think about it. If Windows Update did in fact cause the restart of millions of Skype users worldwide, which it can do without argument, then how come Skype doesn’t crash the second Tuesday of every month when of course Microsoft distributes its Windows patches like they have for the past 3 years and years of unscheduled patches prior to that? As far as I recall, last week wasn’t any different.

Am I missing something? I’m not saying it was not Windows Update, but why only last week did it do what it could have done 36 times already? [more]
This is the time on sprockkets when we post to Slashdot:
Here is why it caused a problem. Sometimes, in their great wisdom, Microsoft deems it necessary to reboot your computer remotely (if the security vulnerability is severe). They've done it at least once before. This week, I just happened to catch it that the automatic updates had a 3 minute countdown until it rebooted, and unless I had been there, it would have, so if all were told to do so, all would have rebooted at once.

Oh, might as well add that my computer is set to only download the updates and not install them. I did not recall telling any of them to install. [more]
And this Anonymous Coward concurs:
Something was different last week ... I had six servers reboot that had autoupdates turned off. My desktop system running 2003R2 and my laptop running XP also rebooted w/o my permission. We have quite a few pissed-off customers because of the updates. It was an unusual situation. [more]
Bruno Giussani has a gun under his bed (probably):
In August 2004, I interviewed Niklas Zennström, one of the two Skype founders. At the time, there were about 550'000 concurrent Skype users on average. I asked him about scaling to serve a larger crowd ... "We won't need to invest in infrastructure. What we will need to do at some point is to make some changes in the technology to be able to scale more. If we didn’t do anything, when we reach 10 million concurrent users (20 times more than now) we believe there will be problems."
The 10-million mark has been reached. So one can wonder whether the issue with Skype is not larger than what their blog post says. [more]
Martin MC Brown looks back:
Skype makes great pains to explain that you cannot rely on Skype for calling emergency services. However, people do rely on Skype in the same way as they expect to rely on their phone system, and I cannot remember once in the last 30 years that I've been using the phone system here in the UK that the phone hasn't worked when you picked up the handset.
Today we rely on the Internet more and more, and Skype is a major part of a wider puzzle that includes other VoIP services, email, newsgroups, IM, and social networking sites, not to mention access to movies and web sites. Given that, while failures can be tolerated, perhaps we ought to be looking at wider issues on how important the internet and services that it provides are, and how we go about ensuring availability of those services. [more]
Mark Evans says it's worth every penny:
What are the expectations of people using a free service? If, for example, you’ve never paid a penny to use Skype can you really complain too much when it goes down for awhile? Sure you’ve become dependent on it as an everyday communications tool but what do you expect for nothing?
Don’t get me wrong, having your service go down is a bad thing because if enough users get frustrated and decide to leave for a rival, it means advertisers could go away too. But the reality is this sense of entitlement among online users is unrealistic because expecting to get everything for nothing is just wrong. [more]
Rick Aristotle Munarriz is a Fool:
Publicized outages can also be merit badges. In its absence, consumers begin to realize how important a service like Skype is in their lives. In that sense, eBay knows outages well. It has suffered through regrettably well-publicized downtime at both PayPal and eBay in the past. Each service emerged stronger than ever.

A sticky telco service like Research In Motion's (Nasdaq: RIMM) BlackBerry suffered an email outage for several hours back in April, and you don't see folks heaving their wireless smartphones into the nearest dumpster. [more]
The failure mode reminds this Anonymous Coward of a 50-year-old outage:
The story is that many years ago an earthquake rattled a California town ... The earthquake had jostled thousands of telephones off hook. The central office switches survived the quake just fine, but crashed due to a bug ... the switch kept a list of phones that were off hook ... the central office only had a certain number of units that could play dial tone and listen for dialing. So the first "n" phones off hook got dial tone; the rest were put into a FIFO list of phones waiting for dial-tone equipment. There were so many phones off hook due to the earthquake that the FIFO list overflowed, crashing the switch.

When the switch rebooted, it had to figure out which phones needed dial-tone ... thus overflowing the list and crashing the switch again. And again. And again. [more]
Buffer overflow:
Around the Net Around Computerworld Previously in IT Blogwatch
And finally... Sarcastic Wednesday
Richi Jennings is an independent analyst/adviser/consultant, specializing in blogging, email, and spam. A 20 year, cross-functional IT veteran, he is also an analyst at Ferris Research. You too can pretend to be Richi's friend on Facebook, or just use boring old email: