In this paper, the impact of direct liquid cooling (DLC) system failure on the information technology (IT) equipment is studied experimentally. The main factors that are anticipated to affect the IT equipment response during failure are the central processing unit (CPU) utilization, coolant set point temperature (SPT), and the server type. These factors are varied experimentally and the IT equipment response is studied in terms of chip temperature and power, CPU utilization, and total server power. It was found that failure of this cooling system is hazardous and can lead to data center shutdown in less than a minute. Additionally, the CPU frequency throttling mechanism was found to be vital to understand the change in chip temperature, power, and utilization. Other mechanisms associated with high temperatures were also observed such as the leakage power and the fans' speed change. Finally, possible remedies are proposed to reduce the probability and the consequences of the cooling system failure.
Cooling was reported to consume 30–40% of the total energy used in legacy data centers . Despite the optimistic data center energy consumption trends in a recent survey , the average power usage effectiveness was reported to be 1.8 mostly due to inefficiencies in the cooling system . Several thermal management technologies are used in data centers cooling to address the inefficiency challenges [4,5]. The notion of cooling electronic systems using liquids is not novel; however, potential leaks and the capital cost have greatly restricted its application in real data centers [6–8].
In direct liquid cooling (DLC) technology, a liquid cooled cold plate is situated on top of a chip, which reduces the thermal resistance between the chip junction and the cooling source. This provides an opportunity to enhance the thermal efficiency of the cooling system. A study showed that for computationally equivalent clusters, DLC can save 45% energy compared with standard hot/cold aisle air cooling . The low thermal resistance also promoted designs to operate at higher coolant temperatures (known as warm-water cooling or chillerless systems) in which a water side economizer utilizing ambient temperature replaces a chilled water system. This leads to savings of more than 90% compared to conventional air cooling systems [10–12]. The big potential energy savings using DLC encouraged the industry to develop commercial products, which may be used in data centers [13,14].
Few studies in literature have focused on system level analysis of DLC. These studies mainly investigated the thermal performance such as the effect of coolant inlet temperature, flow rate, ambient temperature, and chip power on the cooling of the DLC system [15–17]. The thermal performance research is mainly focused on increasing the electronics power absorbed by the liquid cooling to increase the cooling system efficiency. The pressure drop of a DLC system is also investigated in literature to reduce the pumping power and the cooling system energy consumption . The characteristics of DLC system under exceptional failure situations are crucial to estimate the downtime and provide intelligent remedies to diminish it. The topic of DLC system failure and its impact on the information technology (IT) equipment have not been investigated in literature.
Cooling failure in air cooling systems has been addressed in literature. Shrivastava and Ibrahim  conducted an experiment that shows the impact of cold aisle containment system compared with conventional hot/cold aisle configuration. The results showed the benefits of using cold aisle containment by increasing the ride through time in a computer room air handler failure situation. This conclusion agrees with a numerical study by Alkharabsheh et al. , which shows that the IT equipment fans can enable a recirculation of the flow through the plenum and the failed computer room air handler. Alkharabsheh et al.  have studied different models to simulate the computer room air conditioner (CRAC) cooling coil in computational fluid dynamics simulations. They have found that the CRAC cooling coil has a significant impact on the rate of change in temperature during cooling failure.
This research addresses the research gap in characterizing the DLC system experiencing failure for the first time. Collaborative thermal engineering and computer science experimental measurements are used to study the impact of the cooling failure on the IT equipment. The major outcome from this work is to determine the available time in case of failure to prevent losing the computing capability of the IT equipment. The factors that are anticipated to affect the failure behavior are: the server central processing unit (CPU) utilization, server type, and cooling set point temperature (SPT) (chilled water cooling versus warm water cooling). These factors are addressed in this study.
Direct Liquid Cooling System Description
The tested system is a rack-level DLC solution for data centers . This system consists of two main loops: the primary loop and the secondary loop. The purpose of the primary loop is to carry the heat from the rack and to dispose of it in the environment outside the data center. It consists mainly of a flow regulating valve, a facility side piping and a dry cooler (for chillerless systems), or a chiller (for chilled water systems). The primary loop in this study is connected to an existing building chilled water system. The supply/return lines from the main building chilled water system to the rack under investigation were modified by placing a novel control system. The control system provides the flexibility to experimentally simulate the DLC behavior under different SPTs.
The purpose of the secondary loop is to carry the heat from the chips inside the servers (via direct contact cold plates) and to dispose it in the primary loop. The coolant in the secondary loop is propylene glycol (15%) water (85%) mixture. The secondary loop consists of three modules: (a) coolant distribution module (CDM) (b) supply and return manifolds module (c) server module. The CDM in Fig. 1(a) contains a liquid to liquid heat exchanger, two 25 recirculation pumps in series and are permanently operative, coolant reservoir, flow meter, and temperature, and pressure sensors at the supply and return lines. The cooling capacity of the CDM is 40 kW.
The manifold module in Fig. 1(b) is 6 ft long with square cross section and is made of a stainless material. The manifold can accommodate 42 server modules connected in parallel using dry-break quick disconnect sockets. The server module in Fig. 1(c) consists of corrugated hoses, fittings, dry-break quick disconnect plugs, and the microchannel cold plate component. The cold plate component consists of a plastic cover and a copper microchannel heat sink, as shown in Fig. 1(d). The copper heat sink contains a V-groove to split the impinging jet between two sets of parallel microchannels.
The experimental facility is located inside the Binghamton University data center laboratory. The data center contains 41 racks divided between three cold aisles with a total area of 215 m2. A rack level DLC solution is installed in one of the racks to perform this experiment, as shown in Fig. 2.
The DLC rack is equipped with 14 liquid cooled servers of three different types, as indicated in Table 1. Type A-C is used as a notation for the servers in this paper for simplicity. The rack power density is 2.9 kW. The power density of type A, B, and C is 350 W, 210 W, and 154 W, respectively. The airflow demand of type A server is 58 CFM and for type B and type C, it is 47 CFM. From a thermal perspective, the major difference between these servers is the thermal design power (TDP) for each server. The TDP of type A, B, and C is 160 W, 95 W, and 80 W, respectively. Therefore, for a certain CPU utilization, the dissipated power from each server model will be different. The servers are connected with the cooling system in parallel, which indicates that each server is connected independently with the cooling system distributor (manifold). The liquid system is a retrofit cooling solution that was installed on air cooled servers. The cold plates were used to cool the CPUs only while the remaining components of the IT equipment (dynamic random access memory, power supply unit, hard disk drive, etc.) were maintained air cooled. The inlet air temperature of the IT equipment is maintained at 16 °C. Since the other racks in the data center laboratory are air cooled, cold aisle containment with 110% provisioning is used.
|Dell PowerEdge R730||Intel Xeon (E5-2687WV3)||2||A|
|Dell PowerEdge R520||Intel Xeon (E5-2440)||9||B|
|Dell PowerEdge R520||Intel Xeon (E5-2430V2)||3||C|
|Dell PowerEdge R730||Intel Xeon (E5-2687WV3)||2||A|
|Dell PowerEdge R520||Intel Xeon (E5-2440)||9||B|
|Dell PowerEdge R520||Intel Xeon (E5-2430V2)||3||C|
A novel design for the primary side of the DLC system is used to conduct this research. A proportional control valve (PCV) is used to control the primary side inlet temperature, as shown in Fig. 3. For instance, if the DLC rack primary loop inlet temperature is sought to be higher than the building chilled water supply temperature, a signal is initiated to the actuator of the PCV. The PCV motor gradually closes the valve, forcing the return flow from the DLC rack to merge with the supply flow. This leads to an increase in the supply temperature of the rack based upon the flow rate of the forced return flow to the supply line.
The fluid temperature in the primary loop is measured using immersion liquid temperature sensors (i.e., in direct contact with the fluid) with a platinum resistance temperature detector's element. The temperature range for the temperature sensors is −1 °C to 121 °C (30–250 °F) and the accuracy can be calculated using (±(0.3 °C + (0.005 × |T ° C|) as indicated by the manufacturer. This results in accuracy of ±0.4 °C and ±0.53 °C at 20 °C and 45 °C, respectively. The flow rate is measured using an inline direct beam path wetted ultrasonic sensors utilizing differential transit time velocity measurement. The flow rate range is 0.6–15 GPM with ±1% accuracy. The PCV assembly consists of a two-way valve of Powermite 599 Series type plus an MT Series SSC electronic valve actuator. The PCV is chosen such that the valve is normally closed and fail open to ensure that the PCV will not halt the flow of the primary side to the DLC racks if the valve fails. The output signals of the temperature and flow sensors, and the PCV are connected to a central building management system for data logging and control.
The fluid temperature in the secondary loop is measured using Measurement Specialties 10K3D682 temperature sensors. The accuracy of these sensors is 2.45% according to the manufacturer specs. The flow rate is measured using an inline Adafruit flow meter. The flow meter has a magnet attached to a pinwheel and a magnetic sensor on the flow meter tube. The flow rate is measured by the number of spins the pinwheel makes. The flow meter is calibrated for this application with 3% accuracy. The secondary loop sensors are connected to an integrated control and monitoring system inside the CDM.
Internal sensors of the servers and performance counters are used to measure the fan's speed, chip's power and temperature, and CPU utilization. The readings of the internal sensors and performance counters are retrieved at the server level using the intelligent platform management interface (IPMI) and transmitted to a network management system via TCP/IP. Power distribution units (PDU) are used to measure the total power of servers. The PDUs are connected to the data center network; therefore, the readings can be retrieved using a network management system.
A Linux-based tool is used to retrieve the data from the building management system, the IPMI, and the PDU network interface . The tool utilizes a simple network management protocol and runs on an administrative machine in the data center. The data are collected in 1 s time-step and eventually becomes available to the user in an Excel sheet format.
One or more components may fail in the primary or secondary loops of the DLC, which leads to a loss of cooling, loss of flow, or loss of flow and cooling combined. Power outage, mechanical failure, and human error are the primary reasons responsible for failures.
This study focuses on investigating a complete and partial loss of flow in the secondary loop. The secondary loop consists of two recirculation pumps in series, as shown in Fig. 1(a). The complete loss of flow is simulated experimentally by shutting down the two recirculating pumps, which declines the flow to zero, as shown in Fig. 4(a). The partial loss of flow can occur due to failure in one of the two recirculating pumps. It is simulated experimentally by shutting down one of the pumps and retaining the other pump operative, which decrease the flow to approximately 68% of the original flow rate, as shown in Fig. 4(b). The recirculating pumps were turned back on after the chip temperatures reach a certain limit to avoid causing damages to the IT equipment due to overheating. The recirculating pump in the primary loop was maintained operative at all times during the experiments.
The factors that are anticipated to influence the system response due to the loss of flow failure are the CPU utilization (affects the chip power), cooling SPT, and the server model. These factors are varied in this study to understand their influence on the IT equipment response. The IT equipment response is determined by measuring the chip temperature, server computing capability (utilization), chip power, fan's speed, and server total power.
Results and Analysis
This section presents the experimental results for the complete and partial failures. The presented results show the effect of each failure on the chip temperature and power, CPU utilization, fan's speed, and total server power.
The complete failure of the pumps leads to a complete loss of flow. The presented experimental results are conducted at different CPU utilization, SPT, and different type of servers. At each coolant SPT of 20 °C and 45 °C, the CPU utilization was changed to 100%, 25%, and 0% (idling state). The chosen coolant SPTs are intended to compare the response of a chilled water DLC system (20 °C) with a warm water DLC system (45 °C) experiencing a failure incident. Additionally, the DLC rack is equipped with three different models of servers, which allows for studying their behavior under the failure mode.
The simulated failure mode starts at time of 300 s for all cases by shutting down the recirculation pumps. This leads to completely halting the flow in the cold plates that are cooling the CPUs inside the servers. During failure, the forced convection cooling through the cold plate microchannels turns into natural convection in the liquid side and conduction through the printed circuit board to air. This change in the cooling mechanism increases the thermal resistance and thus the chip junction temperature.
The chip temperature of type A server increases momentarily after failure then stabilizes at a certain temperature, as shown in Fig. 5(a). The chip state at which the temperature stabilizes is called the throttling state. The chip throttles its frequency when it undergoes an excessive temperature increase to reduce the generated power, which lowers the chip temperature. This can be noticed in Fig. 5(b) by observing the reduction in the chip power (marked as 1) during failure. The throttling state on the other hand reduces the CPU utilization, which makes the CPU computing capability inefficient. This is demonstrated experimentally in Fig. 5(c) (marked as 1). It should be noted that the CPU utilization exceeds the 100% in the presented data due to the Intel Turbo Boost Technology. By turning off the Intel Turbo Boost feature, it is expected that the chip power decreases, which decreases the rate of change in temperature after failure. Power and CPU utilization drops are barely noticeable for the idling state since the utilization at idling is almost 0%.
Figures 5(a) and 5(c) show that the chip temperature rate of change depends on the CPU utilization. The time that the chip takes to start throttling after failure is 23 s and 56 s for 100% and 25% utilization, respectively. This indicates that the available time before the CPU frequency throttles at 25% utilization is 1.4 times that of 100% utilization. At idling state, the CPU utilization is essentially 0% except some intermittent spikes due to the data logging tool attempting to retrieve data from the server. Since utilization is 0% at idling, throttling is unlikely to happen to stabilize the temperature after failure. The temperature stabilizes in the idling state because the chip reaches a new thermal steady-state. The chip power is small at idling state and it is anticipated that in the absence of liquid cooling, air cooling is sufficient to maintain the server operational.
The coolant SPT is found to have a small impact on the server response experiencing failure. At 20 °C SPT, the time that the chip takes to start throttling after failure is 34 s and 82.1 for 100% and 25%, respectively, as shown in Fig. 6(a). This indicates 48% and 47% increase in the available time before throttling starts for 100% and 25% utilization, respectively. This comparison at different SPT can be used to compare the response of chilled water DLC system with the chillerless (or warm water cooling) DLC system
It can be concluded from the aforementioned results that the CPU utilization has a more significant impact on the server response in case of failure than the cooling SPT. However, ultimately if the server has a computing load more than 25% utilization, it takes the CPU less than a minute to throttle. At throttling state, the server practically is not effective as its processing capability deteriorates. Also, the high chip temperature affects the reliability of the chip eventually if it is maintained for a long time. Therefore, the solution for this mode of failure should be proactive, meaning that momentarily sensing the change in temperature and acting in less than 23 s for the worst-case scenario (100% utilization, 45 °C SPT). The time the system takes to recover from failure in the worst-case scenario is 9.5 s. Remedies for this mode of failure are proposed later in this paper.
In addition to the chip temperature and the associated throttling behavior, leakage power is observed in the experimental data. The leakage power occurs when the chip exhibits high temperatures, which further increases the chip power and temperature. Figure 5(b) shows the increase in the power while the server is experiencing failure (marked as 2). The chip power keeps rising until the CPU throttling mechanism initiates. Since the power drops when the CPU throttles, the leakage current decreases as well. Whether the CPU utilization is 100% or idling, leakage current is observed after failure. The leakage power increases the chip power by 6%, 10%, and 15% for 100%, 25% and idling state utilization, respectively. The leakage power is also observed at 20 °C SPT as well, as shown in Fig. 6(b). The leakage current in this case increases the chip power by 7.5%, 12.8%, and 37% for 100%, 25% and idling state utilization, respectively.
The server total power is measured in this analysis as well. The PDU readings are assumed to represent the total power consumed by the various servers' components, however, the server's power supply unit has an efficiency associated with it. Figure 5(d) shows the total server power for different CPU utilizations. There are three distinct changes in the server power at 100% and 25% CPU utilizations. The server power increases after failure because of the leakage power and the increase in the fan's speed. The increase in the fan's speed during failure is demonstrated in Fig. 5(e). When the CPU throttles, the chip power drops as shown in Fig. 5(b), which reduces the total server power. The server power spikes when cooling is recovered since the fans are still ramped up and the chip power returns to the original level. When the chip cools down to the original level, the total server power stabilizes at the original steady-state power. Since the servers do not experience throttling at idling state, the server power never drops during failure.
In the previous analysis, type A server is used to understand the effect of failure on the IT equipment behavior. The type A server has the highest TDP; thus, it presents the worst cooling failure scenario when it runs at 100% utilization and 45 °C coolant SPT. The behavior of type B and type C servers along with type A is shown in Fig. 7.
During a 60s failure incident, it is noticed that servers of type A and type B only reach a temperature at which the CPU frequency throttles. These two types are characterized with a higher TDP than type C. This can be observed by the chip temperature and utilization response in Figs. 7(a) and 7(c). The chip power increases and then decreases after failure for type A and type B servers. This behavior occurs due to the leakage power and then the CPU throttling mechanism. Since type C server does not reach the throttling state, the chip power only increases due to the leakage power. The server power of type C server is also distinct compared with type A and type B servers. The server power of type C does not drop as the chip power does not decrease during failure. The increase in the server power is due to the increase in the fan's speed, as shown in Fig. 7(e).
The previous results showed the IT equipment response when the secondary loop of the cooling system fails. The main factors that are anticipated to affect the IT equipment response are the CPU utilization, coolant SPT, and the server type. These factors are varied experimentally and the IT equipment response is studied in terms of chip temperature and power, CPU utilization, and total server power. The CPU utilization and the type of servers were found to have a more noticeable effect on the IT equipment during failure than the coolant SPT. Additionally, the CPU frequency throttling mechanism was found to be crucial in understanding the change in chip temperature, power, and utilization. Other mechanisms associated with high temperatures were also observed such as the leakage power and increasing the fan's speed.
It is evident that this failure mode is hazardous. There is a high potential that a data center will go offline almost instantaneously if this mode of failure occurs. Data center designers should be aware of the sequences of this failure mode and attempt to avoid it from happening. In the Proposed Remedies section, we propose possible remedies that can reduce the possibility of this mode of failure and a solution to cope with it if it occurs.
Two recirculating pumps are used in the secondary loop to provide cooling for the servers. The two pumps are connected in series and are permanently operative. In the partial failure mode, one of the pumps only was failed to simulate a partial loss in flow. The partial failure leads to a 32% reduction in the flow rate of the secondary loop, as opposed to a complete loss of flow in the complete failure mode. The failure is initiated at time 300 s. The entire servers in the rack are at 100% CPU utilization and the coolant SPT is 45 °C, representing the worst-case scenario.
Interestingly, it is found that the partial failure does not affect the IT equipment, as shown in Fig. 8. The chip temperature, CPU utilization, chip and server power, and fan's speed do not change after failure. The chip temperature is maintained well below the throttling temperature of 87 °C; that is also proven by the absence of the CPU utilization drop. The unchanged chip temperature also leads to a constant fan's speed and chip power.
An experiment is conducted to understand the effect of the coolant flow rate in the secondary loop on the cooling of a server, as shown in Fig. 9(a). In this experiment, the coolant inlet and outlet temperatures are measured for a type A server using J-type thermocouple probes. The coolant flow rate in the server module is measured using OMEGA FTB-314D microflow meter with accuracy of ±0.18 l/min. The junction temperature and chip power are measured using the server registry data via IPMI. The experiment is conducted at 45 °C coolant SPT and 100% CPU utilization for a type A server, while the coolant flow rate is varied.
Figure 9(b) presents the calculated sensible power using the measured coolant inlet and outlet temperatures, and flow rate. The sensible power (power removed by the liquid coolant) is calculated using the energy balance equation on the coolant side. During normal operation, the sensible power is 105.1 W indicating that the liquid coolant removes 75.1% of the total chip power. It is anticipated that the air side removes the remaining chip power, as the server fans were not removed when the liquid cooling module was installed in the server. The chip power can be transferred to the air side via conduction from chip to board then convection from board to air or through the exposed area of the cold pate to the air. Since 75.1% of the chip power is removed by the liquid coolant, the thermal resistance from chip junction to liquid coolant inlet is small compared with chip junction to the air side. The chip junction temperature at normal operation is 61.1 °C, as shown in Fig. 9(b).
After failure of a single pump, the flow rate for the tested server coolant loop drops from 1.1 l/min to 0.75 l/min. By varying the liquid coolant flow rate from 0.3 l/min to 1.15 l/min, it can be noticed in Fig. 9(b) that the sensible power varies approximately from 67% to 75% of the total chip power. The reduction in the flow rate to 0.75 l/min due to partial failure will have a small impact on the cooling of the chip, which explains why the partial failure has a negligible impact on the IT equipment in Fig. 8. Furthermore, the negligible impact of partial failure on the IT equipment can be clearly observed by the chip temperature values of 61.1 °C and 61.8 °C at 1.1 l/min and 0.75 l/min, respectively, as shown in Fig. 9(b).
The observation that the single pump failure does not have an impact on the cooling of the servers is not generic. In this study, 15 server modules are used to cool 14 2 rack unit servers in the experimental setup; however, the DLC system is equipped with outlets that are sufficient to accommodate 42 server modules. In a hypothetical case of using 42 server modules, each server module would receive 0.32 l/m during normal operation and 0.22 l/min during partial failure. The flow rate values are calculated by assuming that the pumps maintain the same operational point when 42 server modules are used instead of 15 and the flow is evenly distributed between the 42 server modules. The flow rate drop from 0.32 l/m to 0.22 l/m is expected to increase the chip temperature in a more noticeable manner than the flow rate drop from 1.1 l/min to 0.75 l/min based on the data shown in Fig. 9(b). Nevertheless, the chip temperature will still be below the throttling temperature of 87 °C and the partial failure will not affect the IT equipment computing capability.
In summary, the partial failure does not affect the IT equipment computing performance even when the cooling system is at maximum capacity of server modules.
In this section, possible remedies are proposed to reduce the possibility of the cooling system failure and the sequences in case failure occurs.
The proposed remedies are stemmed from the root cause of failure. The loss of flow due to pump failure can be due to power outage. Data center IT equipment is connected to an uninterrupted power supply system (UPS) to ensure continuous operation in case of a power outage. On the other hand, the cooling system is connected to a backup generator, which takes a longer time than a UPS system to kick in. A big interest to data center designers is to estimate the time that IT equipment can be maintained operational on a UPS system without overheating until the backup generator reactivates the cooling system. From the results of this research, if a DLC system is used for cooling, such a power system scheme (IT equipment on a UPS system and cooling system on backup generators) will lead to data center shut down. Therefore, it is recommended to connect the pumps of the secondary loop of a DLC system to a UPS system to avoid this scenario from happening.
Mechanical failure of the pumps is another root cause that could lead to loss of flow in the secondary loop of a DLC system. The recommended action to diminish this issue is pump redundancy. If a pump fails, another pump becomes available to provide the required flow. Also, continuous maintenance for the pumps would reduce the possibility of mechanical failure.
Load migration is a mechanism used routinely in data centers to migrate the computing load running on virtual machines between different physical machines . Load migration using virtual machine is used primarily for consolidating workload to as few machines as possible to improve the utilization of IT equipment and operate them at high energy-efficiency regions. Load migration using virtual machines can also be used to cope with cooling system failure when it happens.
The trigger for virtual machine migration on pump failures—partial or complete—can be derived from sensors that measure flow. These flow sensors are located within the rack-level heat exchanger unit and are accessible using a vendor-provided interface. Alternatively, the trigger for migration can be derived from the core temperature trend. Specifically, the CPU core temperatures can be sensed in all servers in the rack (using software interfaces in the kernel) and a rapid increase in the core temperature that is not attributed to an increase in the offered workload (which can be estimated from an immediate history of the server utilization) can be considered as indicative of the cooling system failure.
The key to dealing with pump failure is to live-migrate (that is, migrate running tasks) virtual machines the serve requests as quickly as possible and well before CPU throttling begins to limit server performance and slow down the migration itself. In a typical data center setup, rack-to-rack migration of virtual machines typically go through a common set of switches. The temptation to start migrating all virtual machines simultaneously off the rack with the failed pump can introduce network congestion and actually slow down the migration process. We therefore developed a technique that migrated virtual machines in small groups, one group at a time, to avoid network congestion. With this solution in place and using the migration trigger from the flow sensor within the rack-level heat exchanger, it was possible to migrate virtual machines quickly on a pump failure. The relevant results are shown in Fig. 10.
The results presented encourage the use of the migration technique that was developed. The results show that despite the complete loss of liquid cooling at idling state, the chip does not go through excessive temperature increases. Even if servers are running at a 100% utilization and if the cooling systems fails, the computing load can be moved (migrated) to other servers that are operating normally without any failure in their cooling system. This scenario is simulated and evaluated experimentally, with results as shown in Fig. 10. The chip temperature increases drastically after failure. If the load is removed through virtual machine migration, the CPU utilization drops rapidly from 100% to almost 0% (idling) and the chip temperature stays within the safe limits.
Summary and Conclusions
This study presented an experimental investigation of the impact of failure in a DLC cooling system on the IT equipment. The CPU utilization, coolant SPT, and the server type are varied and the IT equipment response is studied in terms of chip temperature and power, CPU utilization, and total server power.
The CPU utilization and the type of servers were found to have a more significant impact on the IT equipment during failure than the coolant SPT. Additionally, the CPU frequency throttling mechanism was found to be crucial in understanding the change in chip temperature, power, and utilization. Other mechanisms associated with high temperatures were also observed such as the leakage power, and increasing the fan's speed.
Ultimately, it is evident that the DLC system failure is hazardous. There is a high potential that a data center will go offline almost instantaneously if this failure occurs. Data center designers should be aware of this consequence and take the proper precautions. We proposed possible remedies that can reduce the possibility of this mode of failure and a solution to cope with it if it occurs. It is recommended to connect the pumps of the secondary loop of a DLC system to a UPS system and use pump redundancy to avoid failure due to power outage and mechanical failure. Additionally, proactive implementation of load migration technique can protect the computing data if this failure actually happens.
National Science Foundation (Grant No. IIP-1134867).