Thursday, October 13, 2011

The Weakest Link Is A High Stakes Game Show

A Chief Information Officer should have responsibilities at three levels. You enable and support  the business operationally. As a true partner, you work with the business day to day. But as a C level executive you must also help drive the business through innovation, new offerings and go to market strategies.

Of course you can't focus on tactical or strategic issues unless you have a sound, smooth and reliable operation in place. The recent outages with Blackberry service reminded me of how important it is to have a properly designed physical infrastructure. You can't build a skyscraper on sand.

When high reliability is required, as in the case of critical systems, extra steps must be taken to ensure they are always available. You can use this simple principle as your guide. Any system will only be as reliable as the weakest component in the path from the keyboard to the application. In other words, you must consider the probability of failure in each and every element in the both the physical and logical path data takes from end to end.

A quick examination of your infrastructure should reveal single points of failure. These must be addressed first. Having two switches, two routers, two firewalls and one high speed link to the internet will only guarantee minimal probability of failure inside your shop. If the one line goes down all the other redundancy is worthless. This is obvious, but you would be surprised how often the single point of failure is overlooked or tolerated for budget reasons.

Once the single points of failure have been mitigated or at least noted, the balance of the infrastructure has to be tested component by component. Ideally, you should be able to virtually pull the power or network cable from one box at a time and have no one notice. This is true real time fail over.

Not many companies can truly justify or afford total redundancy. However, today it is possible to purchase backup services and spare capacity on demand through cloud service providers. Where high reliability is desired, multiple components with complex fail-over logic will have to be put in place and regularly tested.

Testing is crucial, and, ironically, another often overlooked or deferred task. While RIM had a backup unit in place the fail-over did not happen and ultimately this caused the outage. I bet today someone will be reviewing their design, test procedures and logs.

You should take a moment to ask when was the last time someone pulled a power cord or network cable in you shop (on purpose) and what was the result. Are you prepared or will that single point of failure or logical design flaw suddenly decide to make itself known at a potentially embarrassing or even career limiting moment?

Captain Joe

Follow me on Twitter @JPuglisiLLC