Unfortunately, fires sometimes happen. Right now I’m thinking about real fires: flames, danger, damage, injury, the whole deal. Fires aren’t something you think about very deeply while they’re happening: your goal is getting them out, while minimizing danger and injury. Actually, let’s take injury out of the analogy for a moment – pretend the building is empty of life and just burning on its own. Still bad, but perhaps easier to discuss clinically.
After the fire is out, the fire department almost always conducts an investigation. They may be looking for evidence of crime, of course, but they’re also gathering data. Insurance companies often investigate, too, looking to see if anything untoward or uncovered contributed to the fire – but again, also gathering data.
In many cases, the eventual outcome of all that data-gathering is substantive change to help minimize damage and prevent fires in the future. For example, back in the day, the designer of the cruise ship United States was really concerned about fires at sea. His generation had witnessed Titanic and numerous other at-sea disasters, and he was determined for his country’s new commercial flagship to not suffer the same fate. He designed the wall paneling in his new ship with an equally new miracle substance that simply wouldn’t burn. Nobody knew the dangers of asbestos at the time, of course, and the guy’s heart was in the right place. There’s an excellent book about the ship, by the way, if you’re interested, and its hulk is, I believe, still in Philadelphia.
Closer to dry land, you’ve got things like fire escapes, sprinkler systems, construction codes, and other factors that have all come about because of the gradual accumulation of data on fire damage. One client of mine, FM Global, is an insurance company. They spend tons of money helping their customers figure out better ways to not need insurance (thereby saving the insurance company money, of course, but also preventing substantial injury and damage, so it’s win-win). Again, it’s that gradual accumulation of data.
What everyone basically does is look at the cost of the damage, the potential cost of mitigating it, and decide if the one outweighs the other. Is it cheaper to include a sprinkler system in every new home (as is required in Henderson, NV, near where I live), even though many of them will never actually experience a fire? The accumulated data often suggests that yes, it’s cheaper to include the system in new construction when it adds only a few thousand dollars to the cost of the structure, and when it can easily prevent tens or hundreds of thousands of dollars in damages.
Back to the IT industry. Or really, any industry.
We all “fight fires” from time to time. Something breaks, and we scurry to fix it. What we don’t often do is have a fire inspector or insurance adjustor come and look at what happened, figure out what the cost was, and gradually accumulate that data until a solution – and its cost – becomes apparent. In other words, we tend to stick with fighting fires rather than preventing them. That’s often because we’re never bothering to figure out the cost of the fire.
Companies, in my experience, absolutely suck at costing things like productivity. Fixing the e-mail server when it’s down is important because the CEO isn’t getting his mail, and because he’s screaming about it, not because the company is losing money in productivity. With no cost to assign to the fire, it’s difficult to decide whether or not a solution would save money or not. Is it worth the money to implement a mail server cluster, so that the entire system is more available? Tough to say, right? Clusters are expensive, and if you don’t know the cost of the fire, it’s hard to tell if the cost of the sprinkler system is worth it.
I regard myself, at heart, as an engineer. Maybe not the kind that can call himself that in, say, Canada, but for all of my career I’ve designed and built solutions to problems. I’ve designed them to be resilient, reliable, and sound, much like you’d want someone to engineer a structure. Like many folks in my industry, I’m a “see the problem, fix the problem” kind of guy. But when I fix a problem, I want to fix the problem. The problem isn’t that “the email server is down,” the problem is that “the email server can go down.” So I’ve become fairly adept at costing failure.
Obviously, there’s no guarantee of accuracy, but you’re just looking to put some scope around the problem. Solutions cost money, and so you need to try and state the problem in money terms, so that there’s an apples-to-apples decision. How many people were affected by the fire? How much productivity did they lose? Was it a bunch of salespeople who need email to close deals? Then maybe they were 75% unproductive. How much do we pay them? How much of their salary was just wasted? It’s easy math. Was the outage short enough that no deals were lost, and we’re just talking lost salary? Okay, fine. With a ballpark figure, you can at least start having an intelligent, objective conversation about engineering a preventative measure.
I find that one reason I’ve always been valued in the organizations where I’ve worked is that I always seek a way to turn things into rational, objective conversations. It isn’t about the CEO being mad. That’s something the CFO has problem quantifying, although he or she can often empathize. It’s about the financial impact to the organization. Getting problems framed in an objective, impersonal, business-first way is how you get the conversation moving. You take it out of the squishy world of personality and into the more cut-and-dried world of numbers.
Of course, it means you need to know what you cost. A mail cluster isn’t just a collection of hardware and software; it’s also management overhead. How many man-hours does it take to maintain that thing annually? Are you automating most routine maintenance so that the answer is near-zero (if not, why not)? What hardware depreciation will occur?
Let’s take that last one as an example, because it’s an interesting thing that a lot of folks overlook.
All hardware, in most organizations, needs to be depreciated over five years. Even if companies take a first-year write off, they still have to track the asset’s value for five years. So let’s say you spend $50,000 on a new virtualization host. You’ve got to extract $10k in value from that each year for five years. If that host isn’t fully utilized, then you’re not realizing the full return on your investment. Let’s say you want to add a small mail server cluster node to that host, to provide yourself with some redundancy. You expect a mail cluster to help avert $500k in lost productivity per year, and that node will be half of the solution which does so. That means that node is worth $250k in value tot he organization – easily helping to offset the depreciation of that virtualization host, along with its annual maintenance and possibly its licensing costs.
It can seem a little irritating to have to break things down like that, but once you get into the habit, you’ll just know a lot of the basic numbers. An OS costs us x per year to maintain; a single VM represents x power and cooling, each IP address we configure costs us x to provision and maintain, etc. Knowing the cost of solutions makes it a lot easier to sell them against the problems.
Consider becoming, if your organization doesn’t already have one, the fire inspector and insurance adjustor for your team.