When high reliability is required, as in the case of critical systems, extra steps must be taken to ensure they are always available. You can use this simple principle as your guide. Any system will only be as reliable as the weakest component in the path from the keyboard to the application. In other words, you must consider the probability of failure in each and every element in the both the physical and logical path data takes from end to end.
A quick examination of your infrastructure should reveal single points of failure. These must be addressed first. Having two switches, two routers, two firewalls and one high speed link to the internet will only guarantee minimal probability of failure inside your shop. If the one line goes down all the other redundancy is worthless. This is obvious, but you would be surprised how often the single point of failure is overlooked or tolerated for budget reasons.
Once the single points of failure have been mitigated or at least noted, the balance of the infrastructure has to be tested component by component. Ideally, you should be able to virtually pull the power or network cable from one box at a time and have no one notice. This is true real time fail over.
Not many companies can truly justify or afford total redundancy. However, today it is possible to purchase backup services and spare capacity on demand through cloud service providers. Where high reliability is desired, multiple components with complex fail-over logic will have to be put in place and regularly tested.
You should take a moment to ask when was the last time someone pulled a power cord or network cable in you shop (on purpose) and what was the result. Are you prepared or will that single point of failure or logical design flaw suddenly decide to make itself known at a potentially embarrassing or even career limiting moment?
Follow me on Twitter @JPuglisiLLC