AI racks push liquid loops into the same decision space as UPS and switchgear: you’re not just sizing capacity—you’re defining fault domains, maintenance windows, and what “redundant” actually means when a pump trips at 2 a.m.
This guide gives a vendor-neutral method to:
calculate rack flow (GPM) from GPU/rack heat load and allowable ΔT,
roll that into CDU capacity at row/pod scale,
choose N+1 vs 2N based on what you need to survive (component failure vs whole-path failure),
implement primary/secondary loop isolation and integrate the CDU into DCIM/BMS with testable alarms and interlocks.
Light brand note: Coolnetpower works in this space; where a vendor page helps with context, it’s linked as optional reading. The sizing logic and schematics below are not tied to any one product.
Table of Contents
TogglePrerequisites: define the inputs you will not “discover later”
Before you touch a CDU datasheet, lock these inputs. They determine flow, pressure, redundancy count, and whether controls can actually protect the IT load.
Rack heat load (kW)
Use measured rack power when available. Otherwise, build a rack estimate from nameplate GPU power, CPU, memory, NICs, and PSU losses.
Liquid-cooled fraction (%)
Many AI racks are hybrid: direct-to-chip for GPUs/CPUs plus air for everything else.
Allowed coolant temperature rise across the rack (ΔT)
This is a design choice with operational consequences. Smaller ΔT means higher flow (bigger pipes, higher pump power).
Cooling topology (at high level)
Cold plate (direct-to-chip) and immersion both ultimately map to the same physics for loop sizing: you’re moving heat with a fluid. The differences show up in allowable ΔT, acceptable fluid quality, and failure response.
Fault tolerance target
Are you trying to tolerate a single component failure (N+1), or an entire path failure (2N)? If you don’t define the fault domain, redundancy language becomes marketing.
Key Takeaway: Redundancy starts with a sentence: “The system must maintain coolant delivery if ___ fails.” Fill in the blank (pump, CDU, branch valve, power feed, controls network, etc.).
Step 1 — Convert rack heat load into required rack flow (GPM): liquid cooling GPM calculation
At steady state, liquid cooling capacity is governed by the energy balance:
(Q = dot m, C_p, Delta T)
In common hydronic shorthand for water, this is often expressed as:
BTU/h = 500 × GPM × ΔT°F (water; factor changes with glycol concentration and temperature)
A convenient conversion:
kW → BTU/h: 1 kW ≈ 3412 BTU/h
So the rack flow requirement (water) is approximately:
GPM ≈ (kW × 3412) / (500 × ΔT°F)
Simplified: GPM ≈ 6.82 × kW / ΔT°F
This aligns with the same relationship described in Advantage Engineering’s heat transfer formulas for circulating water systems: BTU formulas for water circulating heat transfer.
Worked example (60 kW rack)
Assume the liquid loop removes the full rack heat load (worst-case for loop sizing) and pick a ΔT.
Assumption | Value |
|---|---|
Rack heat load | 60 kW |
Heat load | 60 × 3412 = 204,720 BTU/h |
Option A: ΔT | 10°F |
Option B: ΔT | 14°F |
Flow:
At 10°F: GPM ≈ 204,720 / (500×10) ≈ 40.9 GPM
At 14°F: GPM ≈ 204,720 / (500×14) ≈ 29.2 GPM
What to do with “GPU specs” in practice
If you’re starting from GPU/CPU TDP rather than a measured rack number:
Sum GPU TDPs → kW
Add CPU + memory + networking + storage → kW
Decide what % is liquid cooled (e.g., 70–90% for direct-to-chip-heavy designs)
Apply the same flow formula to the liquid-cooled portion
Pro Tip: For hybrid racks, size the IT-side loop for the liquid-cooled fraction, but size your controls and alarms as if the entire rack depends on it. Partial liquid cooling still creates “no-flow = rapid temperature rise” failure modes.
Step 2 — Roll rack flow into CDU flow/capacity at row or pod scale (CDU sizing for AI loads)
Once you have rack flow, size the CDU around:
Total flow (GPM) at the design pod/rack count
Total heat load (kW) the CDU must reject
Head pressure the CDU pumps must overcome (often the real limiter)
Turndown (can the system stay stable at partial load?)
2.1 Determine your “cooling unit of redundancy”
Pick the boundary you want to be able to isolate:
Rack-level (most isolation, more units)
Row-level
Pod/cluster-level
A common engineering pattern is to define a pod (e.g., 6–16 racks) as the loop segment that shares a CDU and can be isolated for maintenance.
2.2 Pod flow example
Say you have 8 racks × 60 kW/rack, and you design around ΔT = 10°F for the secondary loop.
Rack flow ≈ 40.9 GPM
Pod flow ≈ 8 × 40.9 ≈ 327 GPM
Now add allowance for:
balancing and control valve authority
filter fouling
uncertainty in actual ΔT during transients
Many teams carry a conservative design margin (documented) and then verify in commissioning.
2.3 Don’t stop at GPM: confirm head and pump staging
Flow alone is incomplete. You must confirm the pump can deliver that flow at the required differential pressure across:
cold plates / immersion heat exchangers
manifolds and quick disconnects
hoses, fittings, and distribution piping
CDU internal piping, filters, heat exchanger
If you treat pressure drop as an afterthought, you’ll end up with a CDU that “meets GPM on paper” but can’t maintain flow at the rack.
Step 3 — Choose redundancy: N+1 vs 2N redundancy for cooling (based on fault domain)
Two industry-standard definitions are helpful:
N+1: one extra unit beyond what’s required to carry the load, so the system can tolerate a single unit failure or allow one unit to be serviced while still meeting load.
2N: two fully independent systems, each capable of carrying the full load. CoreSite summarizes 2N as mirrored, fully duplicated infrastructure in What is data center redundancy? N, N+1, 2N.
3.1 Map redundancy to what you’re protecting
Use this table to avoid “2N in name only.”
You need to tolerate… | N+1 usually means… | 2N usually means… |
|---|---|---|
A single pump failure inside a CDU | extra pump (online spare) in the CDU pump set | two independent pump trains, each sized for full load (often with independent power) |
A CDU module failure | one extra CDU module beyond duty | two complete CDUs (A and B), each capable of full load |
A plate heat exchanger fault | spare HX module or bypass strategy + capacity margin | duplicated heat rejection paths / duplicated CDUs |
Maintenance without downtime | isolate and service one component while remaining components carry load | isolate an entire path (A or B) while the other path carries full load |
3.2 Redundancy is not just equipment count
For liquid systems, the “path” includes:
power source (A/B feeds, UPS-backed controls vs not)
controls and comms (what happens if the controller or network drops?)
valves and actuators (fail-open vs fail-closed)
sensors (single-point failure on flow or temperature)
A common failure mode in otherwise “redundant” designs is: pumps are N+1, but a single control panel, single power feed, or single sensor makes the protection non-redundant.
⚠️ Warning: If your shutdown logic relies on one flow switch or one leak sensor, test its failure mode. A failed sensor can be as dangerous as a failed pump.
Step 4 — Design for dual-loop isolation (and decide where you want the boundary)
Most modern liquid-cooled data center architectures separate:
the Facility Water System (FWS) loop (primary), and
the Technology Cooling System (TCS) loop (secondary)
In other words: facility water system vs technology cooling system.
These loops are typically coupled through a liquid-to-liquid heat exchanger (often inside the CDU).
4.1 Why dual-loop isolation matters
Dual-loop designs let you:
keep the IT loop cleaner/tighter-controlled than facility water
decouple rack stability from plant-side pressure/temperature fluctuations
contain leaks or chemistry issues within a defined fault domain
4.2 Example schematics (vendor-neutral)
A) Single CDU serving a pod (illustrative N+1 inside CDU)
Facility / Primary (FWS)
Supply -----> [Isolation Valve] ---> [CDU: Plate HX] ---> [Isolation Valve] -----> Return
| |
| |
Technology / Secondary (TCS) | |
[CDU Pumps: N+1] (sensors)
|
+--> [Pod Supply Manifold] --> [Rack branches w/ balancing + QDs] --> [Pod Return] --> back to CDU
B) 2N concept at the pod level (A/B CDUs, separated fault domains)
FWS Supply --> [CDU-A HX + Pumps] --> TCS-A --> Rack/pod load
FWS Supply --> [CDU-B HX + Pumps] --> TCS-B --> Rack/pod load
(Design choice)
- Either: Each rack has A/B liquid feeds (true 2N distribution)
- Or: The pod can be switched between A and B with valving (less isolation, more switching complexity)
4.3 Isolation hardware you should specify explicitly
Isolation valves at CDU primary connections (service boundary)
Isolation + balancing at pod and rack branches
Strainers/filters with differential pressure measurement
Provisions for flush, fill, and sampling
Leak detection zones and response logic (see Step 5)
Step 5 — Integrate CDU telemetry into DCIM/BMS (and define alarm + interlock behavior) — DCIM/BMS integration for liquid cooling
Treat controls as part of capacity. If you can’t detect and respond to loss-of-flow quickly and predictably, redundancy won’t save you.
If you want a deeper control-integration view, Coolnetpower has a practical explainer on protocols and supervisory control framing in DCIM Integration for Liquid Cooling: How to Add AI Supervisory Controls.
5.1 Minimum points list (what to trend)
At a minimum, trend:
Supply and return temperatures (primary + secondary)
Supply/return pressure (secondary) and differential pressure
Flow rate (pod and/or rack aggregate)
Pump speed/status (duty/standby)
Filter differential pressure
Valve position (where modulating)
Leak detection status by zone
5.2 Alarm set categories (what to alert)
Define alarms in categories that operations teams can route:
Thermal risk: high supply temp, high rack return temp, ΔT collapse
Hydraulic risk: low flow, low pressure, abnormal differential pressure
Mechanical/electrical: pump fault, VFD fault, power loss
Quality/condition: high filter ΔP, conductivity out of band (if instrumented)
Leak: leak detected, leak sensor fault
5.3 Interlocks (what the system does automatically)
Examples (site-specific):
On leak detect in a rack zone → close rack branch valve(s) + alarm + keep other branches active
On low-flow with rising supply temp → stage standby pump (N+1) and alarm
On controller loss / comms loss → fail to a safe state (explicitly defined) and alarm
The goal is not “automation for its own sake.” The goal is predictable, testable behavior.
Step 6 — Space and power footprint: translate design choices into rack/pod reality
High-level trade-offs to make explicit in procurement:
Lower ΔT → higher flow
More GPM means bigger piping and often higher pump power.
N+1 vs 2N
2N usually increases: footprint, power feeds, controls complexity, and commissioning scope.
Isolation strategy
More valves and zones can reduce blast radius—but increases points count, spares, and testing.
A practical way to keep this grounded is to require a submittal that includes:
footprint and weight
power draw (normal / worst-case)
heat rejection method and connection sizes
maintenance clearances
consumables (filters, fluid)
For broader context on high-density AI cooling options (direct-to-chip, immersion, hybrid), see The Ultimate Guide to AI Data Center Cooling.
Step 7 — Maintenance SOPs: make redundancy real with procedures
A redundancy design that can’t be maintained safely becomes “N until the first service event.” Write and test SOPs for:
7.1 Planned maintenance (MOP)
Isolate the service boundary (which valves, which zones)
Verify flow remains within limits on remaining path(s)
Replace filters / service pumps without introducing air
Restore, vent/bleed, and verify (see verification section)
7.2 Spare parts strategy
At minimum, define spares for:
pump(s) or pump cartridges (if modular)
critical sensors (flow, temperature)
valve actuators
filter elements
leak detection probes
7.3 Water quality / fluid management
Even when you keep IT loops isolated from facility water, you still need:
sampling plan
acceptance criteria (documented)
flushing/cleanliness criteria at commissioning
periodic inspection cadence
Common mistakes (and how to catch them early)
Sizing only by kW and ignoring head
Fix: require a pressure/flow curve check for worst-case branch and include balancing valve authority.
Calling it 2N without duplicating the real single points of failure
Fix: map fault domains (power, controls, sensors, valves) in a one-page FMEA-lite table.
No defined response for leak detection
Fix: define zone containment behavior, and test it.
No plan for partial load stability
Fix: confirm pump turndown, control valve behavior, and temperature control at low load.
Verification checklist: “done when…”
Use this as a commissioning-ready checklist (adapt to your site standards).
For the 60 kW/rack design case, calculated rack GPM matches the chosen ΔT assumptions and is documented.
Pod total GPM includes documented margin and is matched to pump + HX capability at required head.
N+1 behavior is demonstrated (duty pump failure → standby takes load without unacceptable temperature excursion).
If 2N is required, each path demonstrates full-load carry independently, including controls and power.
Leak detection causes the intended isolation action (rack/pod boundary) and logs the event.
DCIM/BMS trends show the minimum points list and alarms route to the right responders.
Next steps
If you want, you can request a CDU commissioning + alarm-point checklist (points list, alarm categories, and interlock tests) to attach to your procurement package and MOP/SOP documentation.
Optional reading on CDU context: Coolnetpower.







