Maximum uptime is a philosophy. It begins with the planning of your facility and remains a continuous process through every step in its design, construction, commissioning, operations, failure analysis and recommissioning.
Data centre operators can follow the example of other industries that operate under mission critical conditions, such as airlines. Whenever there’s some sort of air traffic incident, the National Transportation Safety Board (NTSB) investigates and eventually issues a “lessons learned” document. The idea is to try to prevent a repeat of the incident in the future. Ensuring maximum uptime in mission-critical data centres requires companies to take a similar approach.
The NTSB investigation board says accidents occur due to one or more types of failure: design failures, catastrophic failures, compounding failures and human-error failures Downtime at mission-critical facilities can also be attributed to these four types of failure, and each requires a different approach on prevention and “lessons learned” programs.
1. Design failures
Design failures can be eliminated through proper planning and by engaging with competent vendors. Begin with the end in mind, and come up with a design intent document that clearly spells out your requirements – in detail. Whether it’s new construction, upgrading or operating an existing mission-critical facility, it is important to carefully plan the work and to work the plan. It’s also crucial to have a good design firm, integration firm, construction companies and commissioning team along with a well-trained operations staff to reduce failures.
2. Catastrophic failures
A comprehensive maintenance and operations programs can identify and eliminate many potential problems, helping you avoid catastrophic failures. Your program should include well-defined maintenance windows, with appropriate redundancy built in so services are not interrupted while maintenance is performed. Predictive maintenance is another important consideration, which entails conducting a thorough failure analysis after each incident and using the results to predict and prevent future problems – just as the NTSB does with its “lessons learned” approach. It’s also important to have a comprehensive training program for the operations and maintenance staff, starting with training from equipment manufacturers or installers but continuing with regular training to keep operational and maintenance staff current.
3. Compounding failures
At times, multiple events occur to create a failure, a situation known as a compounding failure. Lack of attention to detail is a leading cause of compounding failures. Consider what happens should your data centre suffer a power outage. Your generator should receive a start signal and fire up immediately. But if you’ve neglected to check the generator battery, fuel and coolant levels for months on end, it may let you down. Similarly, little nuisance items in a large facility are sometimes left unnoticed and by themselves cause no ill effect to the facility, but along with other problems, can combine to create a system failure.
4. Human-error failures
Human error is a leading cause of failures in mission-critical facilities. As noted above, training can help reduce the incidence of human failure but another requirement is detailed methods of procedure (MOPs). MOPs define in detail how to perform various maintenance functions, ensuring they are consistently performed in the same way. Too often, in the rush to bring the facility online, organisations fail to develop, document and deploy MOPs. These procedures should be developed early and tested before the facility is fully operational. Waiting to develop a procedure to transfer the UPS system to maintenance bypass could prove much more costly than investing the time upfront to prepare for the inevitable. MOPs should also be executed with a pilot/co-pilot approach, to ensure the procedure is followed to a T.]
To learn more best practices, check out Schneider Electric white paper 7,Maximizing Uptime in Mission-Critical Facilities.