< img src="https://mc.yandex.ru/watch/103289485" style="position:absolute; left:-9999px;" alt="" />

AI data center cooling for 20–40 kW/rack: the practical retrofit guide

Retrofitting an existing air-cooled data hall to host AI racks in the 20–40 kW/rack (average) band is rarely a single “swap the cooling” project. It’s typically a zoning and integration problem: you’re carving out an AI pod with different heat flux, different failure modes, and a different operating cadence than the rest of the room.

This guide is a neutral, engineering-first landscape of the main options—rear door heat exchanger (RDHx), direct-to-chip (DTC) liquid cooling, immersion, and a hybrid air–liquid cooling retrofit—and how they integrate alongside legacy CRAC/CRAH. It also covers two topics that often decide whether a retrofit succeeds:

  • Density envelopes (what is typically practical at 20–40 kW/rack, and what must be true for it to work)

  • GPU thermal throttling monitoring (what your GPUs and your cooling loop will tell you before performance is impacted)

Key Takeaway: At 20–40 kW/rack average, the right answer is often not a single cooling method. It’s a phased hybrid design that removes most of the AI heat close to the rack (RDHx and/or liquid) while keeping room air cooling stable for the rest of the facility.

Table of Contents

AI data center cooling for 20–40 kW/rack starts with a scope lock

A “20–40 kW/rack” target is only useful if everyone means the same thing. For retrofit work, align these definitions early:

  • Average vs peak: Training jobs and job transitions can create load steps. If your “average” is computed over long windows, the short-window peak may still exceed local cooling.

  • Rack composition: A rack can be 70–90% accelerator heat, or it can include more networking/storage where heat is distributed differently.

  • Residual air load: Even with liquid cooling, many configurations still reject a portion of heat to air (fans remain; not every component is liquid-cooled).

  • Boundary conditions: Are you limited by plant capacity, white-space constraints, water routing, electrical capacity, or operational tolerance for downtime during cutover?

If you don’t lock this scope, teams end up arguing about “cooling capacity” when the real conflict is the time window, rack mix, or allowable risk.

A retrofit-friendly way to document “average”

If you need to make the scope procurement-friendly, write down:

  • averaging window (e.g., 15 minutes, 1 hour, 24 hours)

  • whether average excludes or includes power-capped periods

  • expected job mix (training vs fine-tuning vs inference)

  • ramp behavior (how quickly load can step up)

That definition becomes your commissioning acceptance criteria later.

Cooling methods at a glance (directional envelope, not a guarantee)

The ranges below are directional and strongly dependent on airflow discipline, containment, rack configuration, and the plant. They’re useful for selecting an architecture and a retrofit sequence, but they’re not a guarantee of “safe density.”

The Uptime Institute provides a useful operator-oriented taxonomy and capacity framing in “AI and cooling: methods and capacities” (2025): perimeter air → close-coupled air (in-row, RDHx) → direct liquid cooling (sidecar, cold plates) → immersion.

Cooling method

What it removes (where)

Typical retrofit role at 20–40 kW/rack average

Why it fails when it fails

Perimeter air (CRAC/CRAH)

Room air, far from racks

Baseline for the hall; can support the lower end with strong airflow management

Local hot spots, recirculation, top-of-rack inlet instability, poor containment

Close-coupled air (in-row, RDHx)

Near the rack/row

Common “bridge” for AI carve-outs; reduces burden on room airflow

Water routing/space constraints; poor integration; maintenance clearance; insufficient water-side capacity

Direct-to-chip (DTC)

At the chip (CPU/GPU)

Adds predictability for accelerator heat; enables phased “AI pod” builds

Water quality, leak management, flow/pressure control, commissioning complexity

Immersion

Entire server submerged

Strategic choice when you want an all-liquid operations model and higher density headroom

Operational change (hardware handling, service workflows), standardization constraints

A common success pattern for an AI pod inside an existing facility is:

  1. Stabilize room air behavior (containment, blanking, airflow discipline).

  2. Use RDHx and/or in-row to localize heat in the AI zone.

  3. Add DTC where GPU heat dominates and stability matters.

  4. Retain CRAC/CRAH as the environmental backbone for non-AI loads and residual heat.

What CRAC/CRAH can and can’t do in a mixed legacy + AI hall

CRAC/CRAH systems are designed to condition the room. They work well when heat flux and airflow are predictable.

In high-density AI racks, the failure mode is often localized:

  • hot exhaust recirculation paths behind or above racks

  • unstable inlet temperatures at the top of rack

  • short load steps that exceed local airflow and heat capacity

Before you add liquids, it’s worth tightening the baseline. For reference:

⚠️ Warning: Many “plant capacity” problems in retrofits are actually air management problems: missing blanking panels, cable cutouts, poor containment, and unexpected recirculation paths. Fixing airflow discipline can buy headroom—and makes any hybrid design more stable.

Air-side instrumentation that pays off in retrofits

If you’re doing AI data center cooling for 20–40 kW/rack, you want to be able to answer “is this an airflow issue or a plant issue?” without debate. A practical minimum set:

  • rack inlet temperature at multiple heights (bottom/middle/top)

  • hot-aisle temperature near the top of rack

  • differential pressure across containment zones (if used)

  • CRAC/CRAH supply temperature and fan speed trends

Those signals become your baseline before you introduce liquids.

Rear door heat exchanger (RDHx): the pragmatic bridge for 20–40 kW/rack

An RDHx mounts on the back of a rack and captures heat from the rack exhaust air before it enters the room. For retrofits, the appeal is that you can apply it to a subset of racks without rebuilding the entire hall.

When RDHx is a good fit

RDHx is typically strongest when:

  • you want to keep the majority of the room air-cooled

  • you want to carve out an AI zone with moderate disruption

  • you want a design that still “looks like a rack” operationally (server handling stays familiar)

It also pairs well with a phased approach: you can deploy RDHx on the first AI row, validate stability, then extend.

The integration question: where does the heat go?

With RDHx, you’re choosing to move heat from an air problem to a water-side problem. That means you must decide how the AI pod will reject heat:

  • chilled water via a heat exchanger (if available)

  • dry/fluid coolers

  • cooling towers (if your facility uses them)

  • a heat recovery loop (where applicable)

The “best” answer depends on site constraints. The most important part is that the water-side design has clear capacity and redundancy targets.

Practical RDHx design checks (retrofit reality)

An RDHx project fails more often from layout and maintainability constraints than from heat exchanger physics. Before you commit:

  • Aisle clearance: confirm door swing and service access with the rack fully cabled.

  • Piping routing: plan supply/return paths that don’t obstruct egress, fire suppression, or overhead trays.

  • Isolation strategy: ensure each rack/row has an isolation approach that doesn’t require draining the entire pod.

  • Condensate/condensation risk: if you’re using chilled water, ensure your control strategy maintains a dew-point margin.

  • Leak response: define what happens on a leak alarm (isolate which segment, who responds, what runs in degraded mode).

These checks are also useful procurement inputs because they determine real installation cost and schedule.

What changes operationally with RDHx

  • You introduce piping to the rack and you add constraints around door swing, aisle clearance, and serviceability.

  • You create new failure modes: loss of water flow, fouling/filtration issues, valve failures, and heat rejection bottlenecks.

  • You must integrate alarms and interlocks into the same operational model as your room cooling.

Direct-to-chip liquid cooling: make the AI pod predictable

Direct-to-chip cooling uses cold plates on CPUs/GPUs to remove heat via a liquid coolant loop. The advantage for AI is predictability: you reduce dependence on perfect room airflow distribution.

Coolant distribution unit (CDU) and dual-loop isolation

In many retrofits, the defining design choice is not the cold plate. It’s the boundary between IT cooling and facility water.

ASHRAE TC 9.9 discusses water-cooled server designs, including the role of CDUs and loop separation, in the white paper “Water-Cooled Servers: Common Designs, Components…” (PDF). The operational point is straightforward: a CDU is commonly used to separate the facility water system from the technology cooling system, which simplifies water quality management and helps control condensation risk.

A practical dual-loop model clarifies responsibilities:

  • Facility loop (plant-side): heat rejection, redundancy, overall energy strategy.

  • Technology loop (rack-side): stable delivery of coolant temperature, flow, pressure; monitoring; leak response.

If you blur these loops, troubleshooting gets harder and risks rise.

CDU types (how to think about the choice)

Vendor educational explainers can help with terminology and basic architecture framing. For retrofit carve-outs, it’s usually more useful to compare CDU options by what they change operationally—loop separation, temperature control and dew-point margin, flow/pressure stability, filtration requirements, alarm coverage, and service access—than by brand-specific feature lists.

For retrofit carve-outs, the decision often simplifies to:

  • Do you have facility water access and capacity?

  • If not, are you building a bridge solution with constrained capacity until plumbing/plant upgrades are complete?

Water quality and filtration: treat it as an engineering deliverable

In DTC retrofits, “water quality” isn’t a nice-to-have. It’s part of reliability.

A practical way to scope it:

  • define coolant type (water/glycol mix or other) for the technology loop

  • define filtration and particulate control strategy

  • define corrosion control approach and allowable materials

  • define sampling and maintenance cadence

The key operational principle: the technology loop should be stable and predictable, even if the facility loop has seasonal or operational variability.

Leak management: design for detection and isolation, not for perfection

Liquid in white space changes risk posture. Your goal is to make leak events small, detectable, and isolatable.

Design/ops practices commonly used in successful retrofits:

  • leak detection where leaks actually appear: under manifolds, under CDUs, at connection points

  • segmentation so you can isolate a rack or row without draining the whole pod

  • procedures for planned maintenance (drain, purge, reconnect, test) that don’t require heroics

  • training and spares that match the new failure modes

How much of the rack should be liquid-cooled at 20–40 kW/rack average?

In this band, DTC is often used in one of two patterns:

  • Selective DTC: liquid-cool the accelerators/CPUs in the racks that truly need it; keep the rest air.

  • DTC + RDHx: use cold plates for dominant heat sources and RDHx to capture residual air heat, minimizing what the room must absorb.

The goal isn’t “go liquid everywhere.” It’s to protect performance stability and keep the AI pod from destabilizing the room.

Immersion cooling: choose it for operational fit, not just heat capacity

Immersion is often described as the “highest ceiling” option, but retrofits succeed or fail based on operational fit.

Immersion changes:

  • service workflows (handling submerged hardware)

  • standardization (which server designs and SKUs are compatible)

  • facility interfaces (fluid management, heat rejection, maintenance procedures)

At 20–40 kW/rack average, immersion is usually selected when:

  • you’re standardizing a large AI fleet under one operational model, and/or

  • you have a credible roadmap beyond 40 kW/rack average where immersion’s headroom becomes valuable.

Practical tradeoffs to surface early

If immersion is on the table, make these procurement questions explicit:

  • hardware compatibility and supported server variants

  • service workflow (what is removed, where it goes, how long it takes)

  • fluid handling and contamination control

  • safety considerations and training requirements

Immersion can be an excellent long-term path—but it’s a bigger operational shift than RDHx or selective DTC.

Hybrid air–liquid cooling retrofit: a reference architecture for an AI carve-out

Hybrid is not a compromise; it’s often the intended steady state for mixed facilities.

A practical hybrid reference architecture for an AI pod inside an existing air-cooled hall looks like this:

  • AI racks: RDHx and/or DTC (depending on rack mix and growth trajectory)

  • Adjacent racks (networking/storage): enhanced airflow discipline; in-row as needed

  • Room backbone: legacy CRAC/CRAH retained and tuned for remaining room load

Design with density bands, not a single “target density”

Instead of asking “Can we do 40 kW/rack?”, band the pod:

  • 20–25 kW average: room air can work if containment and airflow discipline are strong.

  • 25–35 kW average: close-coupled approaches (RDHx and/or in-row) become attractive.

  • 35–40 kW average: design a clear liquid path (selective DTC, possibly paired with RDHx).

This banding is retrofit-friendly: it avoids overbuilding the entire hall for the worst rack and it gives you a phased pathway.

Where hybrid retrofits usually fail

In mixed halls, instability typically comes from boundary mistakes:

  • The AI pod rejects more residual heat to air than expected, and the room cooling becomes unstable.

  • The water-side design has hidden capacity limits (e.g., limited flow, limited heat rejection in certain weather conditions).

  • Monitoring is fragmented: GPUs alert in one tool, CDU/plant alarms in another, and correlations are slow.

If you need a baseline on how “precision cooling” differs from comfort HVAC in IT spaces (a common source of stakeholder misalignment), Coolnetpower’s precision vs comfort cooling is a useful internal reference.

DTC-first vs RDHx-first: how to choose a retrofit sequence

Use this to select a retrofit sequence (not a forever answer).

RDHx-first is often the better starting point when:

  • You need the lowest-disruption path that still lifts density in the AI zone.

  • You expect to stay closer to the 20–35 kW/rack average band for a period.

  • You want to minimize new service procedures and training requirements early.

  • Workloads tolerate moderate thermal variation, and the goal is to stabilize the room.

  • Your plant interface makes it easier to add rack/row water services than to build a full technology-loop controls package immediately.

DTC-first is often the better starting point when:

  • You can allocate space/time for CDUs, piping, and a more involved commissioning phase.

  • You expect a near-term push toward the upper end of 20–40 kW/rack average (and beyond).

  • You have operational maturity to manage liquid loops, alarms/interlocks, and response playbooks.

  • Training workloads are performance-sensitive and you want tighter control at the accelerator.

  • You want a clean dual-loop boundary and a repeatable “AI pod” template you can scale.

CDUs and dual-loop isolation: what they do in day-to-day operations

A CDU is often described as “the box between the racks and the plant.” Operationally, it’s also the risk control point:

  • water quality and corrosion control

  • condensation margin relative to dew point

  • flow/pressure control to manifolds and cold plates

  • alarms and interlocks for leak detection, pump failure, flow loss, abnormal ΔT

Treat the CDU as a monitored, maintained asset with its own acceptance tests—not as a minor accessory.

GPU thermal throttling monitoring: the cues that confirm headroom

For AI pods, the best “capacity test” is not a spreadsheet. It’s sustained workload behavior plus telemetry correlation.

GPU-side signals (what the server can tell you)

Monitor a small set consistently:

  • temperature (GPU core/board; memory temperature if available)

  • clock behavior (core/SM clock and memory clock)

  • power behavior (board power, enforced power limit)

  • throttling/violation indicators

NVIDIA explains why clocks, thermal status, and throttling events matter for cluster efficiency in its developer guidance on NVIDIA Data Center Monitoring (DCGM) for GPU clusters (2025).

Server OEM telemetry can surface similar signals out-of-band. Dell provides examples of GPU telemetry fields (including violation durations and event reasons) in iDRAC10 GPU management for PowerEdge AI servers (2025).

Facility-side signals (what the cooling loop should tell you)

Correlate GPU signals with:

  • liquid supply/return temperatures

  • ΔT across the rack/pod/CDU

  • flow rate and differential pressure

  • CDU/pump state and alarms

  • rack inlet/outlet air temperatures (for RDHx and mixed-mode racks)

A useful framing is to design observability so you can correlate facility cooling signals with GPU behavior. Modius outlines this cross-layer mindset in “DCIM for AI: designing power, cooling and observability for GPU-heavy data centers” (2026).

Practical correlation pattern (what “running out of margin” looks like)

When cooling margin shrinks, the pattern is often:

  1. facility-side temperature/flow signals drift under sustained load, then

  2. GPU temperatures trend upward, then

  3. clocks flatten or drop and throttling events appear.

Your monitoring should be tuned to catch steps 1–2 early.

A daily headroom vs risk checklist

Use these correlation patterns as a daily operational check for the AI pod:

  • GPU temperatures trend up under steady workload + clocks flatten/drop or throttling events appear → shrinking thermal margin.

    • First verify: airflow recirculation, RDHx water availability, DTC flow/pressure.

  • GPU temperatures stable but performance lower than expected + enforced power limit / power violation state → likely power-bound rather than cooling-bound.

    • First verify: power capping policy, rack PDU limits, PSU headroom.

  • CDU return temperature rises with normal flow + hall air temperatures rise → more residual heat reaching the room than expected.

    • First verify: containment integrity, RDHx effectiveness, CRAC/CRAH tuning.

  • Flow falls or differential pressure rises + GPU temperatures become more variable → restriction/valve/pump issue.

    • First verify: filters/strainers, pump curves, branch balancing.

Commissioning and acceptance tests that de-risk a hybrid AI pod

Treat commissioning as proof that the system behaves correctly under load steps—not just that it turns on.

Liquid-side checks (RDHx and DTC loops)

  • pressure test and verify leak detection coverage (CDU, manifolds, under-rack)

  • confirm flow at the branches (not just at the pump)

  • validate supply/return temps and ΔT under staged load

  • test alarms and interlocks: leak alarm → valve close / pump changeover / notification

Air-side checks (legacy hall stability)

  • verify containment integrity (doors, grommets, blanking panels)

  • measure rack inlet temps at multiple heights (especially top-of-rack)

  • confirm the AI pod doesn’t destabilize adjacent rows (pressure balance, recirculation)

Load-step test (the AI reality check)

  • start at partial AI load, then step up

  • watch for thermal lag: delayed temperature rise can hide margin issues

  • confirm both the AI pod and the surrounding hall remain within targets

Pro Tip: Treat your monitoring model as part of commissioning. If you can’t see (and alarm on) flow/ΔT and GPU thermal events, you’ll be troubleshooting blind later.

A procurement-friendly retrofit checklist (what to ask before you buy anything)

Most retrofit failures are “integration gaps,” not component failures. Use this checklist to align IT, facilities, and procurement early.

Scope

  • Questions: average vs peak definition; rack mix; growth path.

  • Evidence to request: load profile assumptions; rack bill of materials; acceptance criteria.

Air baseline

  • Questions: containment plan; sensor placement; airflow discipline.

  • Evidence to request: measured inlet temps (top/middle/bottom); recirculation survey.

Water-side

  • Questions: heat rejection method; water routing; redundancy; water quality.

  • Evidence to request: one-line diagram; water chemistry plan; isolation strategy.

CDU and controls

  • Questions: loop separation; dew point strategy; alarms/interlocks; integration to BMS/DCIM.

  • Evidence to request: alarm matrix; commissioning test plan; control sequences.

Operations

  • Questions: leak response; maintenance access; spares; training.

  • Evidence to request: SOPs; spare parts list; training plan; isolation/drain procedure.

FAQ

How do direct-to-chip and immersion differ for GPUs?

Direct-to-chip uses cold plates to remove heat from specific components (GPUs/CPUs), while the rest of the system may still rely on air for some residual heat. Immersion changes the entire server thermal environment by submerging hardware in dielectric fluid, which typically implies a different servicing and standardization model.

Can hybrid air–liquid work alongside legacy CRAC/CRAH?

Yes—this is a common retrofit pattern. The key is boundary clarity: use close-coupled and/or liquid cooling to keep AI rack heat from destabilizing room airflow, while retaining CRAC/CRAH for the rest of the hall and residual heat.

What rack-level densities are safe without throttling?

There isn’t a universal “safe” number because throttling depends on GPU model, airflow discipline, and cooling design. A better method is to use directional density bands for architecture selection, then validate under sustained workload by confirming stable GPU temperatures and clocks with no throttling events.

How do CDUs and dual-loop isolation actually function?

A CDU sits between facility water and the technology cooling loop, providing controlled coolant delivery to racks and enabling loop separation for water quality, pressure, and temperature management.

What telemetry confirms cooling headroom for AI training?

Cooling headroom is confirmed when GPU and facility signals agree under sustained load:

  • GPU temperatures are stable

  • clocks remain stable (no flattening or drops under steady workload)

  • no recurring throttling events

  • the cooling loop shows stable supply/return temperatures, healthy ΔT, stable flow and differential pressure

Next steps

If you’re planning a phased AI carve-out inside an existing facility, start by banding racks by average load, then choose an RDHx-first vs DTC-first pathway, and finally commission against telemetry thresholds.

For related internal references:


Last updated: 2026-05-27

Facebook
Pinterest
Twitter
LinkedIn

Leave a Reply

Your email address will not be published. Required fields are marked*

Tel
Wechat