The last few months at my new job I’ve been squishing small and medium bugs to get systems up to par. Service packs, patching, firmware updates, software upgrades, or just organized to make life easier for everyone involved.
Our server room has been one of those infested areas… I’ve been squashing the easy bugs but the room is frankly a disaster waiting to happen. It’s not large by data center measurements, frankly it’s just a small class room with three ceiling mounted cooling units and seven racks of equipment. Three two post network racks and four cabinets for servers. Problem is almost nothing is labled. Power cables, random colors of ethernet, and dayglo orange fiber cables are intertwined in a quilt of choas behind the cabinets. The cable ladder above the racks is about 12 inches too far away and has a large power bus bar below it. But that’s not difficult to fix. Yes, it’s a time consuming job – but working for a college has advantages.
The biggest problem is: I don’t know how much power I have to work with. 30 circuits of power and I haven’t a clue what goes where or how much I’m using.
Today was the big day I was waiting for. An electrician arrived and performed a detailed analysis and audit of our power usage. He started from the UPS inputs and worked through the distribution panel and finally labled and measured the outlets in the server room. This is where my worst fears were realized…
we were this close to a massive cascading power failure. Three circuits have been identified as being over 75% utilized, one is at 96%…
Bad news: Nine servers are connected to this circuit.
Worse news: Three servers totally reliant on it, both power supplies are connected to this.
Even worse news: Two of those servers are part of a three node ESX cluster with twenty two virtual machines hosted on them.
Worse bad news: If that circuit trips, it’ll force the other six servers to pull power from another circuit almost as loaded, which will most likely put it over the top and trip that second breaker.
And, to top it all off: Our UPS load is really unbalanced, but not in a way we can fix with medication. You see, this room is fed with three feeds of electricity called “phases” or “legs”. Equipment like large appliances or electric motors run more efficiently using more than one phase. In this case, the UPS (our battery backup device for the servers) pulls electricity equally from all three phases, conditions it, charges its batteries, and then feeds it to a breaker box. In this breaker box are thirty 20A circuits. Each is connected to one of those phases. Our core switches are large units, so they get two circuits (and two phases) for each of their power connections. It’s a bit complicated, but the simple rule is – load the boat evenly and it won’t capsize.
Right now, phase one is running 3% over, phase two is 33% under, and phase three is 24% over average. So the devation between L2 and L3 is 58%! It’s no wonder the UPSs have only been living for two or three years. When a UPS has to supply power to a system, it performs better when the load across all of its connections are close to the same. Deviations up or down simply chew up UPS components and spit them out. Oh, and there is no UPS maintenance by-pass switch so if the UPS dies – the room dies. If we want to replace the UPS we have to kill the room until the hardwire connection is bypassed by an electrician.
But all is not lost.
Now that I have a detailed map of our power usage and outlets that are labeled, I’m throwing together an emergency change plan to migrate servers onto other circuits to reduce the load on the heavily loaded circuits AND to balance the load across phases.
In August we plan on installing new three phase power distribution units from APC with onboard monitoring and access to all three phases on the PDU. This will make balancing and loading a lot easier. Until then, I’m juggling power cables to anonymous power strips… but at least NOW they’re labeled.
Knowing is half the battle.