I just read an interesting article providing a nice contrarian perspective on airline travel domestic and abroad. It compares the cost of a life saved relative to money spent in other areas. Keep in mind, I’m writing this at the airport so I am painfully considering the implications of the points I am making. While this may seem ludicrous at first, the idea of making something less safe, I realized that this might help me make an interesting point effectively, one that I struggle to articulate. The article points out how more effective spending on vaccines, safety belts in cars, and other areas of our lives relate to the money spent on aviation safety. Because of very effective FAA regulations, poor countries spend a lot of money on aviation safety while school children are unvaccinated and traffic deaths occur at significantly higher rates.
So what point, relating to data centers, could I be getting at? Are we spending money in the most effective manner to realize high availability. Much like the net loss of life in poor countries would be lower if they allocated resources where they have the most impact, so too we can improve our loss of compute by investing where impacts are highest.
We realize the trade offs between infrastructure and reliability and can see the diminished returns in capital deployment from Tier 4 to 3 to 2 etc, and understand how buying the criticality you need for your business is smart, but do we consider spending money in the best manner possible for these purposes. We could spend $1,000,000 adding a generator to our configuration (not an actual figure), or we could instead spend that money on well trained operators. Which spend will more effectively enhance uptime? While I haven’t seen any statistical studies on the matter (and I think they would be unbelievably interesting and useful), I know from my background in the ultimate of mission critical environments, nuclear submarine operation, operator training goes much further.
Human error causes a lot of outages, and I am willing to bet if the root cause was more effectively determined, that root cause would even more predominately be human error. Even the most robust, redundant system can still drop a server or several servers if the operators are untrained. And as the systems become more robust and sophisticated, so too does the potential for human error increase.
Here is another question: How much more reliable is a Tier 4 type data center than a Tier 3? Many Tier 3 data centers have operated for over a decade without dropping a single plug. Furthermore, if you analyzed the drops across a large portfolio, how many would have been prevented if the infrastructure was built to a higher level? In fact, many data centers that may have been typical of a Tier 2 rating have operated without dropping a plug for decades. They aren’t deferring maintenance or skipping it, but a trained staff finds ways to keep the lights on throughout. While this may mean planning maintenance for slow times, or cool days when the HVAC system has additional capacity, or even being fungible on space temperature and humidity requirements, the lights stay on and the servers keep humming, year after year.
The value of a well trained operator, understanding his system impacts and responses, following a well rehearsed method of procedure which has been peer reviewed for accuracy, can do remarkable things for operations.
Being a Navy Nuclear Program graduate, I have an intimate understanding of how important these facets are: if they are not employed, people could die or the ship might not be able to perform its mission when needed. On a submarine, there is not room for system after system of redundancy. No, the US Navy relies on it’s highly skilled and trained operators.
Only 2 submarines in the US Navy have been lost since the advent of nuclear propulsion, one believed to be related to a torpedo which activated itself and the other a result of a procedure which, when the reactor had an emergency shutdown, removed the ability to provide propulsion coupled with flooding and a design flaw in the emergency surfacing system. That lesson was early in the program, and as we say, is written in blood. Nobody with fish on his chest (the submarine warfare insignia) takes that subject lightly.
Anyway, I think the question expands just beyond where to put our safety capital. By not spending money on training, whether we realize it or not, we are placing a low value on reliability. The extra generators, redundant cooling systems, extra ups systems all become window dressing. Giving the appearance of high availability, but without the staff and knowledge to back it up, is not a good philosophy. The level of intensity demonstrated in the staffing and in the equipment should be similar. You don’t buy a plane with wings that last 10 years and a fuselage that lasts 50. Likewise, the weakest link in your chain will create the failure. If your staff creates 53 minutes of outage each year, and your equipment creates 3 seconds, you still end up at 99.99 % availability instead of the 99.99999% your equipment was predicted to provide. But if you built a less resilient system designed around 5 minutes of annual downtime and a staff trained to provide on 5 minutes of downtime a year, would have a net availability of 99.998, much better than 99.99 and with less capital deployment ( relative to generators, ups systems, and switchgear, training is cheap).
Expanding beyond that, by optimizing spend to optimize uptime, it is easier to determine the trade offs and right size your budget. Extra capital can be diverted to areas which improve the business case, such as additional energy efficiency and capacity, and project costs can be reduced.
Also, remember the lessons of engineers past: designing something to be unbreakable often results in disappointment when all the money in the world fails to prevent the fifty foot tsunami or a captain that fails to heed iceberg or rock outcroppings. Nearly 100 years separate the Titanic and the Costa Concordia. A piece of foam brought down the Challenger, part of a multi-billion (that’s billion with a B) program developed by the greatest engineers on the planet. Again, this is a lesson written in blood.
A trained operator that is intimately familiar with his systems, however, can often make the equipment handle events engineers never thought possible. Submarines have returned from deeper than crush depth, planes have landed on the Hudson, and SR-71 pilots have recovered from flat spins at several times the speed of sound all because of training. No automated provision saved them, it was all operator.
Don’t focus on slashing costs. Your budget are likely to be dictated to you anyway. Rather, focus on how to maximize capacity, minimize operating costs, and cost effectively improve your sites reliability, holistically.
One last thing, a trained operator will operate the site the way it was meant to run, and as efficiently as possible, while also finding opportunities to improve operation and utilization. Untrained operators will frequently override features, lock in set points, and in general wreak havoc on the efficient and safe operation of your site. Train train train!
*My use of tier terms above is not to imply Uptime Certified but rather, tier 3 as concurrently maintainable, tier 4 as concurrently maintainable and fault tolerant, and tier 2 as requiring outages for certain maintenance activities.*;