< img src="https://mc.yandex.ru/watch/103289485" style="position:absolute; left:-9999px;" alt="" />

CDU sizing for AI loads: a how-to guide for flow, redundancy, and controls

AI racks push liquid loops into the same decision space as UPS and switchgear: you’re not just sizing capacity—you’re defining fault domains, maintenance windows, and what “redundant” actually means when a pump trips at 2 a.m.

This guide gives a vendor-neutral method to:

  • calculate rack flow (GPM) from GPU/rack heat load and allowable ΔT,

  • roll that into CDU capacity at row/pod scale,

  • choose N+1 vs 2N based on what you need to survive (component failure vs whole-path failure),

  • implement primary/secondary loop isolation and integrate the CDU into DCIM/BMS with testable alarms and interlocks.

Light brand note: Coolnetpower works in this space; where a vendor page helps with context, it’s linked as optional reading. The sizing logic and schematics below are not tied to any one product.


Prerequisites: define the inputs you will not “discover later”

Before you touch a CDU datasheet, lock these inputs. They determine flow, pressure, redundancy count, and whether controls can actually protect the IT load.

  1. Rack heat load (kW)

  • Use measured rack power when available. Otherwise, build a rack estimate from nameplate GPU power, CPU, memory, NICs, and PSU losses.

  1. Liquid-cooled fraction (%)

  • Many AI racks are hybrid: direct-to-chip for GPUs/CPUs plus air for everything else.

  1. Allowed coolant temperature rise across the rack (ΔT)

  • This is a design choice with operational consequences. Smaller ΔT means higher flow (bigger pipes, higher pump power).

  1. Cooling topology (at high level)

  • Cold plate (direct-to-chip) and immersion both ultimately map to the same physics for loop sizing: you’re moving heat with a fluid. The differences show up in allowable ΔT, acceptable fluid quality, and failure response.

  1. Fault tolerance target

  • Are you trying to tolerate a single component failure (N+1), or an entire path failure (2N)? If you don’t define the fault domain, redundancy language becomes marketing.

Key Takeaway: Redundancy starts with a sentence: “The system must maintain coolant delivery if ___ fails.” Fill in the blank (pump, CDU, branch valve, power feed, controls network, etc.).


Step 1 — Convert rack heat load into required rack flow (GPM): liquid cooling GPM calculation

At steady state, liquid cooling capacity is governed by the energy balance:

  • (Q = dot m, C_p, Delta T)

In common hydronic shorthand for water, this is often expressed as:

  • BTU/h = 500 × GPM × ΔT°F (water; factor changes with glycol concentration and temperature)

A convenient conversion:

  • kW → BTU/h: 1 kW ≈ 3412 BTU/h

So the rack flow requirement (water) is approximately:

  • GPM ≈ (kW × 3412) / (500 × ΔT°F)

  • Simplified: GPM ≈ 6.82 × kW / ΔT°F

This aligns with the same relationship described in Advantage Engineering’s heat transfer formulas for circulating water systems: BTU formulas for water circulating heat transfer.

Worked example (60 kW rack)

Assume the liquid loop removes the full rack heat load (worst-case for loop sizing) and pick a ΔT.

Assumption

Value

Rack heat load

60 kW

Heat load

60 × 3412 = 204,720 BTU/h

Option A: ΔT

10°F

Option B: ΔT

14°F

Flow:

  • At 10°F: GPM ≈ 204,720 / (500×10) ≈ 40.9 GPM

  • At 14°F: GPM ≈ 204,720 / (500×14) ≈ 29.2 GPM

What to do with “GPU specs” in practice

If you’re starting from GPU/CPU TDP rather than a measured rack number:

  1. Sum GPU TDPs → kW

  2. Add CPU + memory + networking + storage → kW

  3. Decide what % is liquid cooled (e.g., 70–90% for direct-to-chip-heavy designs)

  4. Apply the same flow formula to the liquid-cooled portion

Pro Tip: For hybrid racks, size the IT-side loop for the liquid-cooled fraction, but size your controls and alarms as if the entire rack depends on it. Partial liquid cooling still creates “no-flow = rapid temperature rise” failure modes.


Step 2 — Roll rack flow into CDU flow/capacity at row or pod scale (CDU sizing for AI loads)

Once you have rack flow, size the CDU around:

  • Total flow (GPM) at the design pod/rack count

  • Total heat load (kW) the CDU must reject

  • Head pressure the CDU pumps must overcome (often the real limiter)

  • Turndown (can the system stay stable at partial load?)

2.1 Determine your “cooling unit of redundancy”

Pick the boundary you want to be able to isolate:

  • Rack-level (most isolation, more units)

  • Row-level

  • Pod/cluster-level

A common engineering pattern is to define a pod (e.g., 6–16 racks) as the loop segment that shares a CDU and can be isolated for maintenance.

2.2 Pod flow example

Say you have 8 racks × 60 kW/rack, and you design around ΔT = 10°F for the secondary loop.

  • Rack flow ≈ 40.9 GPM

  • Pod flow ≈ 8 × 40.9 ≈ 327 GPM

Now add allowance for:

  • balancing and control valve authority

  • filter fouling

  • uncertainty in actual ΔT during transients

Many teams carry a conservative design margin (documented) and then verify in commissioning.

2.3 Don’t stop at GPM: confirm head and pump staging

Flow alone is incomplete. You must confirm the pump can deliver that flow at the required differential pressure across:

  • cold plates / immersion heat exchangers

  • manifolds and quick disconnects

  • hoses, fittings, and distribution piping

  • CDU internal piping, filters, heat exchanger

If you treat pressure drop as an afterthought, you’ll end up with a CDU that “meets GPM on paper” but can’t maintain flow at the rack.


Step 3 — Choose redundancy: N+1 vs 2N redundancy for cooling (based on fault domain)

Two industry-standard definitions are helpful:

  • N+1: one extra unit beyond what’s required to carry the load, so the system can tolerate a single unit failure or allow one unit to be serviced while still meeting load.

  • 2N: two fully independent systems, each capable of carrying the full load. CoreSite summarizes 2N as mirrored, fully duplicated infrastructure in What is data center redundancy? N, N+1, 2N.

3.1 Map redundancy to what you’re protecting

Use this table to avoid “2N in name only.”

You need to tolerate…

N+1 usually means…

2N usually means…

A single pump failure inside a CDU

extra pump (online spare) in the CDU pump set

two independent pump trains, each sized for full load (often with independent power)

A CDU module failure

one extra CDU module beyond duty

two complete CDUs (A and B), each capable of full load

A plate heat exchanger fault

spare HX module or bypass strategy + capacity margin

duplicated heat rejection paths / duplicated CDUs

Maintenance without downtime

isolate and service one component while remaining components carry load

isolate an entire path (A or B) while the other path carries full load

3.2 Redundancy is not just equipment count

For liquid systems, the “path” includes:

  • power source (A/B feeds, UPS-backed controls vs not)

  • controls and comms (what happens if the controller or network drops?)

  • valves and actuators (fail-open vs fail-closed)

  • sensors (single-point failure on flow or temperature)

A common failure mode in otherwise “redundant” designs is: pumps are N+1, but a single control panel, single power feed, or single sensor makes the protection non-redundant.

⚠️ Warning: If your shutdown logic relies on one flow switch or one leak sensor, test its failure mode. A failed sensor can be as dangerous as a failed pump.


Step 4 — Design for dual-loop isolation (and decide where you want the boundary)

Most modern liquid-cooled data center architectures separate:

  • the Facility Water System (FWS) loop (primary), and

  • the Technology Cooling System (TCS) loop (secondary)

In other words: facility water system vs technology cooling system.

These loops are typically coupled through a liquid-to-liquid heat exchanger (often inside the CDU).

4.1 Why dual-loop isolation matters

Dual-loop designs let you:

  • keep the IT loop cleaner/tighter-controlled than facility water

  • decouple rack stability from plant-side pressure/temperature fluctuations

  • contain leaks or chemistry issues within a defined fault domain

4.2 Example schematics (vendor-neutral)

A) Single CDU serving a pod (illustrative N+1 inside CDU)

Facility / Primary (FWS)
  Supply -----> [Isolation Valve] ---> [CDU: Plate HX] ---> [Isolation Valve] -----> Return
                               |                         |
                               |                         |
Technology / Secondary (TCS)   |                         |
                         [CDU Pumps: N+1]            (sensors)
                               |
                               +--> [Pod Supply Manifold] --> [Rack branches w/ balancing + QDs] --> [Pod Return] --> back to CDU

B) 2N concept at the pod level (A/B CDUs, separated fault domains)

FWS Supply --> [CDU-A HX + Pumps] --> TCS-A --> Rack/pod load
FWS Supply --> [CDU-B HX + Pumps] --> TCS-B --> Rack/pod load

(Design choice)
- Either: Each rack has A/B liquid feeds (true 2N distribution)
- Or: The pod can be switched between A and B with valving (less isolation, more switching complexity)

4.3 Isolation hardware you should specify explicitly

  • Isolation valves at CDU primary connections (service boundary)

  • Isolation + balancing at pod and rack branches

  • Strainers/filters with differential pressure measurement

  • Provisions for flush, fill, and sampling

  • Leak detection zones and response logic (see Step 5)


Step 5 — Integrate CDU telemetry into DCIM/BMS (and define alarm + interlock behavior) — DCIM/BMS integration for liquid cooling

Treat controls as part of capacity. If you can’t detect and respond to loss-of-flow quickly and predictably, redundancy won’t save you.

If you want a deeper control-integration view, Coolnetpower has a practical explainer on protocols and supervisory control framing in DCIM Integration for Liquid Cooling: How to Add AI Supervisory Controls.

5.1 Minimum points list (what to trend)

At a minimum, trend:

  • Supply and return temperatures (primary + secondary)

  • Supply/return pressure (secondary) and differential pressure

  • Flow rate (pod and/or rack aggregate)

  • Pump speed/status (duty/standby)

  • Filter differential pressure

  • Valve position (where modulating)

  • Leak detection status by zone

5.2 Alarm set categories (what to alert)

Define alarms in categories that operations teams can route:

  • Thermal risk: high supply temp, high rack return temp, ΔT collapse

  • Hydraulic risk: low flow, low pressure, abnormal differential pressure

  • Mechanical/electrical: pump fault, VFD fault, power loss

  • Quality/condition: high filter ΔP, conductivity out of band (if instrumented)

  • Leak: leak detected, leak sensor fault

5.3 Interlocks (what the system does automatically)

Examples (site-specific):

  • On leak detect in a rack zone → close rack branch valve(s) + alarm + keep other branches active

  • On low-flow with rising supply temp → stage standby pump (N+1) and alarm

  • On controller loss / comms loss → fail to a safe state (explicitly defined) and alarm

The goal is not “automation for its own sake.” The goal is predictable, testable behavior.


Step 6 — Space and power footprint: translate design choices into rack/pod reality

High-level trade-offs to make explicit in procurement:

  1. Lower ΔT → higher flow

  • More GPM means bigger piping and often higher pump power.

  1. N+1 vs 2N

  • 2N usually increases: footprint, power feeds, controls complexity, and commissioning scope.

  1. Isolation strategy

  • More valves and zones can reduce blast radius—but increases points count, spares, and testing.

A practical way to keep this grounded is to require a submittal that includes:

  • footprint and weight

  • power draw (normal / worst-case)

  • heat rejection method and connection sizes

  • maintenance clearances

  • consumables (filters, fluid)

For broader context on high-density AI cooling options (direct-to-chip, immersion, hybrid), see The Ultimate Guide to AI Data Center Cooling.


Step 7 — Maintenance SOPs: make redundancy real with procedures

A redundancy design that can’t be maintained safely becomes “N until the first service event.” Write and test SOPs for:

7.1 Planned maintenance (MOP)

  • Isolate the service boundary (which valves, which zones)

  • Verify flow remains within limits on remaining path(s)

  • Replace filters / service pumps without introducing air

  • Restore, vent/bleed, and verify (see verification section)

7.2 Spare parts strategy

At minimum, define spares for:

  • pump(s) or pump cartridges (if modular)

  • critical sensors (flow, temperature)

  • valve actuators

  • filter elements

  • leak detection probes

7.3 Water quality / fluid management

Even when you keep IT loops isolated from facility water, you still need:

  • sampling plan

  • acceptance criteria (documented)

  • flushing/cleanliness criteria at commissioning

  • periodic inspection cadence


Common mistakes (and how to catch them early)

  1. Sizing only by kW and ignoring head

  • Fix: require a pressure/flow curve check for worst-case branch and include balancing valve authority.

  1. Calling it 2N without duplicating the real single points of failure

  • Fix: map fault domains (power, controls, sensors, valves) in a one-page FMEA-lite table.

  1. No defined response for leak detection

  • Fix: define zone containment behavior, and test it.

  1. No plan for partial load stability

  • Fix: confirm pump turndown, control valve behavior, and temperature control at low load.


Verification checklist: “done when…”

Use this as a commissioning-ready checklist (adapt to your site standards).

  • For the 60 kW/rack design case, calculated rack GPM matches the chosen ΔT assumptions and is documented.

  • Pod total GPM includes documented margin and is matched to pump + HX capability at required head.

  • N+1 behavior is demonstrated (duty pump failure → standby takes load without unacceptable temperature excursion).

  • If 2N is required, each path demonstrates full-load carry independently, including controls and power.

  • Leak detection causes the intended isolation action (rack/pod boundary) and logs the event.

  • DCIM/BMS trends show the minimum points list and alarms route to the right responders.


Next steps

If you want, you can request a CDU commissioning + alarm-point checklist (points list, alarm categories, and interlock tests) to attach to your procurement package and MOP/SOP documentation.

Optional reading on CDU context: Coolnetpower.

Facebook
Pinterest
Twitter
LinkedIn

Leave a Reply

Your email address will not be published. Required fields are marked*

Tel
Wechat