Is redundant cooling necessary
There is one constant in the data center universe, and that is the necessity of cooling. It seems that without regard to where the data center/computer room is geographically located it must be cooled to keep those servers running. Without sufficient cooling the servers and other electronic equipment will overheat and begin to fail in unpredictable ways.
Having said this, it is also true that advances in microprocessor design and storage technologies are dramatically affecting the data center cooling marketplace. to better understand cooling requirements it’s necessary to understand the sources of heat and the effect of that heat.
Sources of heat
When electrical current flows through a conductor it heats the conductor due to friction. A good example of this is the most simple of electrical circuits the light bulb. When a light bulb provides a path for current to flow from a potential voltage point to the neutral ground, friction occurs along the entire path from the switch through the bulb’s filament and then to the neutral/ground. The friction along the path (electrical path) results in heat being generated. The filament of the bulb is designed to heat up to a point where light is emitted. In this circuit, the intent of the circuit is to generate light. The by-product of the circuit is heat.
It’s important to keep in mind that heat is never wanted in electrical circuits. Heat is an inefficient use of electrical potential and causes components to age and eventually fail. If heat is generated then poor engineering was used to create the circuit. A well engineered circuit uses electrical potential completely and gives off no heat. So, it is the goal of every circuit designer to not generate heat.
Unfortunately, today’s electrical engineering processes make use of other companies’ components and tie the components together to create a product. Or, the engineering behind a product has in mind a short time-frame wherein the product being developed will operate and then be overtaken by other products and no longer be needed (the sales cycle). So, length of service is not a concern, only time to market is important. It is also true that some materials are not efficient with regard to heat. Advances in materials science continue to erode away the heat generating properties of these materials. Consequently, heat is here to stay unless market dynamics force manufacturers to re-design their components to be more power efficient.
There aren’t many bulbs in computers. Where does the heat come from?
Modern computers all have Integrated Circuits (IC’s) to act as switches, diodes to block paths of power, and coils of wire to act as chokes, transformers, filters, etc. These components all make use of potential voltage and all are subject to inrush and operating current in their normal operation.
In an IC, when a switch closes on a circuit it allows current to flow into that IC circuit and to others as is necessary. When a switch opens and closes once per second it is operating at one hertz. When it opens and closes 1,000 times a second it is operating at one Kilohertz. when it opens and closes 1,000,000 times a second it is operating at 1 Megahertz. Most IC’s in your computer operate in the Gigahertz range (1,000,000,000/second). With each closure some current passes through the switch and if it closes 1 billion times in a second that current rushes through the circuit a billion times and effectively multiplies the heating effect by one billion. Now multiply the number of switches by the number of transistors on an IC and you multiply yet again the amount of current used by the IC and the amount of heat generated by the IC. You get the picture – heat, heat, heat.
Heat is quickly generated by the IC opening and closing switches/gates. As an example of how much heat is generated, the currently shipping CPU’s from Intel have a thermal design threshold of 100 degrees Celsius (212 f) without experiencing failures. That’s enough heat to boil water and sterilize against bacteria.
Other IC’s have the same heat generation capabilities and need to get rid of that heat so-as-not to affect their performance or the performance of electrical components in their proximity. Current flowing through coils, diodes, and other components also generate heat contributing to the overall heat production of a computer.
There is another source of heat that has to do with a property called induction. Induction is when a current is created in a coil of wire from a neighboring coil of wire or a magnet passing by. As the amount of current increases in the primary coil of wire, the wire right next to the primary (the secondary) has a current induced into it and that current creates heat. The presence of induction heat is often the greatest in the power supplies of servers and the power supplies of other equipment.
An additional source of heat that is becoming less prominent in computers is the heat generated from mechanical friction. Mechanical friction is found in most hard drives that are spinning using ball bearings for support. It is also found in fan bearings of the cooling fans in a server. Each hard drive that is present in a server or in a storage array is a little heater. As that hard drive spins the heat created by friction is transferred to the case of the hard drive which is then exposed to air flow to wick away heat into the moisture present in the air. Most hard drives operate in the 24-37 C range (75-98 f) before experiencing data and/or drive degradation.
Another source of heat in the data center is in the batteries used for the UPS. When batteries are charging or discharging they give off heat. The amount of heat is a function of the current the battery is working with.
At what point does heat effect performance
For each type of electrical or mechanical component there is a different heat threshold where performance begins to degrade. In the case of the CPU, the manufacturers have built into the CPU a failsafe circuit that shuts down the CPU before it becomes damaged. the heat threshold is right around 105 C that the CPU begins to protect itself. Other support IC’s do not necessarily have safety built into their design. Consequently they just fail at about 90 C.
Coils and transformers will generally continue to operate in an excessive heat situation, However, on occasion, they become overheated to the point of liquefying their insulation between conductors. The conductors then short to one another and the coil/transformer becomes useless.
The same is true for hard drives. As they increase beyond the 37 C threshold the platters in the drive will begin to curve and the drive will fail due to head crashes against the platters. Often the symptom of an impending drive failure is the inability of the drive to retrieve data.
How long does it take for these devices to become overheated.
This is a difficult question to answer due to the number of variables in the mix. Items like room size, number of servers, number of hard drives, UPS operation, building make-up air availability, humidity, etc. each has an effect on the rate of thermal rise in the room.
I ran an experiment at one point, in a data center I was responsible for, to determine the amount of time I had after a cooling failure before equipment needed to be shut down. I calculated a 5 f degree rise every 10 minutes in that room. Other rooms will vary, but it is a worthwhile thermal load test to run.
My test showed that I had about an hour, after a cooling failure, to shut down non essential equipment in an effort to slow the heat rise in the room. Overall, I had two hours to get an AC technician on site to perform repairs before I had to shut down the data center. In this particular data center, I needed a 4 hour window. So, I installed overtemp exhaust fans to vent the data center air into the plenum of the building. This bought me an additional 6 hours giving me a total of 8 hours before I had to shut down the data center. Eight hours was more than enough time to affect repairs to the cooling equipment.
What are you trying to protect anyway.
I mentioned earlier that 4 hours was needed as a window to get a repair technician on site to repair the cooling equipment. The 4 hours included a 1 hour response time a 2 hour repair time, and a 1 hour failover window. In the case of this data center, there were several AC suppliers in a 10 mile radius that could provide parts for repair in the event they were needed.
The question still remains: what is being protected. This is the question that each organization must come to terms with. The answer to this question will determine the necessity for redundant cooling. Is it merely a question of availability or is it a question of protecting the technology investment. When excessive heat is present failure rates of electronic components accelerate. Assuming that your organization has purchased the replacement warranties for all your hardware, accelerated equipment failure may not be a significant consideration.
The nature of your organization will dictate how you make your decision. If your organization is involved with litigation and is in constant contact with the courts, then you have a high availability requirement for your data center. If your organization does not have to react to external events like the courts issuing requirements, or medical feedback on patient cases, but you have highly compensated employees that you don’t want sitting around being unproductive, then you also have a high availability requirement. Suppose your organization has neither of the aforementioned situations, but it would be a great inconvenience to be without your data center, then you have a standard availability environment.
Taking the analysis a bit further, if you have an organization that only uses the data center for research material or it is an archive infrequently accessed, then you might have a low availability data center.
However you categorize your data center, high, standard, low, that categorization will assist in your decision as to whether a secondary cooling system is necessary.
Wisdom from the trenches…
A small tidbit of advice… While it may seem on paper that your organization is protected from equipment failure because of your warranties, The loss of 1% of your server capacity due to component failure can be devastating if that 1% contains the briefing the president is about to give to the board that afternoon and a printed copy is needed now. Warranties are a good tool, but you hope they are not needed in a crisis situation. It’s better to have equipment that has not been stressed and therefore not prone to an early component failure.
How are advances in technology affecting this discussion
Advances in technology are constant. They never seem to abate, even for a short time. With advances come solutions to existing problems, and in the case of heat abatement this is certainly the case.
Over the past decade Intel and its’ competitors have been trying to address the heating and speed problems by reducing the operating voltage of their microprocessors. They have been tremendously successful in this effort. Todays microprocessor operates in the 0.65-1.4 volt arena and as a result can support more cores on a single chip with significantly less operational heat.
The hard drive manufacturers have been close behind in reducing their power requirements and their heat dissipation requirements. By switching to solid state drives (SSD) from standard Hard Disk Drives (HDD) they have completely eliminated the mechanical friction of the drive. This reduced the power requirements, cooling requirements, and overall size.
Having reduced the overall power requirements, the power supplies have correspondingly been reduced in size and their heat dissipation requirements. However, and this is a big however, by reducing the physical sizes of equipment, manufacturers are now able to place more components in a smaller space. This means that even though there is less heat generated by these components, their density has increased significantly and seems to be keeping pace in terms of heat needing to be removed. This will probably change as technology continues to mature.
Do I need redundancy in cooling?
The answer is yes. For all the reasons identified earlier a secondary cooling system is a good idea. However, the secondary cooling system does not necessarily need to be a duplicate system. A secondary system really just needs to buy you time to get the primary repaired. In an ideal world the secondary system would be identical to the primary system. This is not because you need to have the same cooling capability, but because you need to maintain humidity in addition to the cooling capability.
In addition to providing humidity control, a secondary cooling system will allow the primary system to have an off duty cycle. This will allow the systems to both last longer and require less maintenance. Be certain to specify a cooling controller that can cycle between the two.
Why do I need humidity
Humidity is one of those components of cooling that is rarely discussed. Cooling is effective because of humidity. It is the amount of water the air is carrying that allows heat to be transferred (wicked) away from electronic components. The water assists in holding heat suspended in the air. Humidity is so important that commercial cooling systems for data centers all have humidifiers to raise the humidity in a data center.
A typical specification for humidity in a data center would be to have 68 f degrees with 70% non condensing humidity. Keep this in mind when deciding on cooling equipment for your data center.
In addition to wicking heat away from electronics, humidity lessens the potential for static buildup within the data center. When a technician comes into the data center, the technician is covered with static. That potential voltage needs to be discharged before the technician works on any electronic gear. Most technicians are trained to touch a grounded device to cause a discharge and make themselves safe. However, just moving from left to right can cause a static buildup. With humidity in the air static is discharged into the air and a buildup of potential energy becomes far more difficult.
If you are considering installing a cooling system for your data center, remember to think in terms of two. Keep in mind that you will need to plumb for your humidifiers and add an evacuation pump to remove the condensate that the cooling equipment removes from the air.