RIM CEO would make a good Amity mayor
- TAGS:Blackberry, change management, ITIL, outage, RIM
- IT TOPICS:Government & Regulation, Hardware, Management, Mobile, Networking
As we all are painfully aware, users of the ubiquitous Blackberry device felt great pain - approaching heroin withdrawal, in some cases - during a three-plus hour outage Monday, February 11th. Research in Motion (hereinafter called RIM), the creators of the Blackberry, had a little "My bad" on a system upgrade. Quoting from an Associated Press (AP) story on the outage:
It was the second major outage for the service in less than a year. In April, a minor software upgrade crashed the system for all users. A smaller disruption in September also was caused by a software glitch.
So what was the cause of this most recent outage? From all appearances, it was bad Change and Release Management. An ITIL faux pas, as in ‘What is ITIL?" By RIM's own admission, an upgrade designed to increase capacity was to blame. The full story can be found at: http://www.msnbc.msn.com/id/23134432/
I will ask what every CIO was thinking when s/he read this story: Who in the wild wild world of sports does capacity upgrades at 3:30 in the afternoon on a Monday? What fool made the decision to upgrade in the middle of a North American workday? Does RIM even know the words Change Management? Or Release Management? And why does RIM not have more redundancy? Again, from the AP story:
Experts said RIM's system is relatively reliable, but its centralized structure means that when there are problems, they can affect millions of users.
E-mail sent to and from BlackBerry phones in North America all goes through a Network Operations Center. It appears the problem occurred there, when one of two Internet addresses that relay e-mail from corporate servers stopped responding, according to Zenprise, a Fremont, Calif., company that helps companies troubleshoot BlackBerry problems.
"Any time you got a system that's got a NOC, a Network Operations Center, you have the potential for a single point of failure," said Jack Gold, with technology analyst firm J.Gold Associates in Northborough, Mass.
"What's a bit surprising to me is that with all the work they've been doing over time ... that they haven't been able to have enough redundancy in the NOC so that there isn't a single point of failure," said Gold, who has done business with RIM.
Whoever was responsible for the outage, apparently s/he will not be too severely punished. According to a quote from CNet News, the co-CEO of RIM is displaying a lack of concern about the outage so outrageous that he is one of my nominees for the Annual Mayor Larry Vaughn Award. Mayor Larry Vaughn, as you may recall, was the character played by the late actor Murray Hamilton in Jaws. As Amity mayor, he steadfastly denied that a shark was eating his constituents, until the weight of evidence was too overwhelming for even him to deny (a beach covered in bloodied swimmers will do that to you).
Anyway, under the headline "RIM's co-CEO downplays BlackBerry outage," RIM's co-leader Jim Balsillie was quoted as saying:
"It was an intermittent delay, a couple of hours," he said. "It's old news. It happened days ago."
That's it? No apology? No contrition? No acknowledgement that your infrastructure is problematic, not to mention your decision-making when it comes to change and release management?
Mr. Balsillie might want to ring up (or email) Canadian Member of Parliament Garth Turner, who was quoted in a CBC News article as stating the impact of the outage with a little more candor and honesty. MP Turner said:
"Everyone's in crisis because they're all picking away at their BlackBerrys and nothing's happening," Turner said. "It's almost like cutting the phone cables or a total collapse in telegraph lines a century ago. It just isolates people in a way that's quite phenomenal."
To read more about the impact of the RIM failure to Canadians, read this story:
http://www.cbc.ca/world/story/2008/02/11/blackberry-outage.html
Three hours might not sound like a long time, and if Brittney or Hannah were not able to text their buddies in high school for an afternoon, we can live (rejoice?) with that. But the customer base for Blackberries has expanded to include law enforcement, homeland defense, first responder and major decision-makers at all levels of government. These people cannot afford to be out of touch, because much of their correspondence takes place via email. These people are also out of their offices much of the day, and cannot afford to be tethered to a desk or even a laptop. Duh, that's why they carry Blackberries in the first place!
So for a North American outage to take place in the middle of the workday, not because of a failure of equipment but rather because of a failure to make a good decision -- and that is the second major failure in a year - well, someone needs to wake up at RIM. Here's some free advice: Do your upgrades in the wee hours, like the rest of us have to do. And based on your rate of growth, you can afford to pay the overtime.
Finally, if you make a mistake, say you made a mistake. It plays better with your customers.

