Table of Contents
ToggleIntroduction
AI/HPC rack density is no longer a linear scaling problem. Once you push beyond the practical limits of room-based air cooling, every additional kilowatt per rack starts to demand changes in mechanical architecture, operational workflows, and risk controls—not just “more cooling.”
This article compares immersion cooling vs cold plate (direct-to-chip)—often called direct-to-chip liquid cooling—for AI/HPC environments.
Audience: CTOs, HPC architects, and thermal engineers evaluating high-density rack roadmaps.
Scope: roughly 20–200+ kW/rack, with attention to what happens at the 40–80 kW transition and at ≥80 kW where architecture choices become harder to reverse.
Metrics lens: PUE/WUE/ERE impacts, with an emphasis on measurement boundaries and heat-temperature “grade” (how reusable the waste heat is).
Standards lens: alignment with ASHRAE TC 9.9 guidance (including H1 high-density considerations) and the Open Compute Project (OCP) ecosystem around direct liquid cooling.
If you’re looking for a shortcut: don’t start with “which is more efficient?” Start with: what density do you need, how fast, and what failure modes can your site safely absorb during rollout?
Decision framework: immersion cooling vs cold plate
Trade-offs overview
Below is a high-level comparison to anchor the rest of the decision.
Dimension | Cold plate (direct-to-chip) | Immersion cooling |
|---|---|---|
Heat capture | Captures most CPU/GPU heat; residual air heat remains for other components | Near-total IT heat capture because the whole server is submerged |
Service model | Familiar rack-based service with quick disconnects and manifolds | Tank-based service; workflows shift to fluid handling and lift/maintenance ergonomics |
Retrofit friendliness | Strong: incremental rollout by rows/zones | Weaker: usually requires dedicated pods/areas and operational retraining |
Standards/ecosystem | Strong momentum via OCP ACS (interfaces, safety practices) | Ecosystem is improving; hardware compatibility and SOP maturity vary |
Scaling speed | Often fastest path for brownfield density uplift | Fastest in greenfield when designed around immersion from day one |
This is the heart of the immersion cooling vs cold plate decision: are you optimizing for incremental rollout in an existing hall—or for maximum heat capture and density in a purpose-built pod?
Two practical observations that often get missed:
Both architectures still need heat rejection (dry coolers, towers, heat reuse, etc.). “Liquid cooling” isn’t a heat sink; it’s a transport method.
Your fastest ramp is usually the approach that best matches your existing ops model—not the one that looks best on a single KPI slide.
Workload and density targets
Use density bands to narrow the choice early:
20–40 kW/rack: Many operators land on hybrid air + direct-to-chip as the lowest-friction transition. This aligns well with AI/HPC clusters that want higher chip-level thermal stability without rewriting the whole facility.
40–80 kW/rack: This is the “architecture commitment” zone. You can still scale with cold plates, but you must design for residual air heat and for liquid distribution (CDU sizing, manifold pressure budgets, leak detection).
≥80 kW/rack (and especially 100 kW+): Immersion becomes more compelling when you need near-total heat capture, fan elimination, and the ability to scale density without a parallel air system carrying meaningful residual load.
Directional industry framing often cites traditional air cooling around ~15 kW/rack as a practical boundary, with liquid cooling extending the effective density range substantially (for a neutral overview, see the Lawrence Berkeley National Lab’s liquid cooling explainer).
Site constraints and risk
Most “immersion cooling vs cold plate” decisions are really decisions about constraints:
Change windows: If you can’t tolerate disruptive construction or retraining, cold plate rollouts typically fit better.
Floor loading and space: Immersion tanks and service clearances can reshape the white space plan.
Operational readiness: Cold plates add many joints (connectors, hoses, manifolds). Immersion reduces that joint count at rack level but raises the stakes of fluid containment and human factors during servicing.
Compliance and audit: If your organization is audit-heavy (energy, water, safety), choose the architecture you can instrument, document, and operate consistently.
Key Takeaway: Scaling speed isn’t just kW/rack—it’s how quickly you can standardize SOPs for leaks, alarms, isolation, and recovery.
Thermal performance
Heat capture and density
The core physics advantage of liquid cooling is obvious: liquids transport heat more effectively than air. The practical difference between cold plates and immersion is where the boundary between “liquid-cooled” and “air-cooled” ends up.
Cold plate (direct-to-chip) focuses liquid flow on the highest heat-flux components (GPUs/CPUs). That can stabilize junction temperatures and reduce throttling risk, but some portion of rack heat remains in the air stream from memory, VRMs, NICs, and power conversion.
Immersion moves the whole server into the liquid domain, which tends to raise total heat capture and reduce dependence on airflow—and can eliminate server fans entirely.
A useful mental model:
Cold plate = targeted heat capture (high-value components first)
Immersion = whole-server heat capture (service model changes with it)
For a market-level comparison that explicitly contrasts heat capture and adoption realities, C&EN’s 2025 overview (“Data centers take the plunge”) is a helpful reference.
Cooling energy and PUE/WUE/ERE
Cooling architecture affects energy and water in three ways:
Transport energy: pumping power (liquid) vs fan power (air) vs chiller lift.
Heat rejection strategy: how many hours you can run economized (dry cooler/tower) instead of compressors.
Measurement boundary: what is counted as “IT” vs “facility” energy.
Two cautions when you compare PUE:
PUE can look better for immersion partly because server fans are removed, shifting energy accounting. The Uptime Institute highlights these kinds of boundary issues when discussing newer thermal guidance and high-density classes (see the summary in “New ASHRAE guidelines challenge efficiency drive” (2021)).
For procurement-grade decisions, pair PUE with WUE (water risk and cost) and with an explicit heat-temperature grade (useful for ERE/heat reuse discussions).
Heat reuse readiness
Heat reuse is not automatic. The deciding factor is typically the temperature and stability of the recovered heat.
Cold plate systems can operate with “warm water” loops depending on design choices, which improves economizer hours and can raise heat reuse viability.
Immersion often runs at higher temperature potential (architecture- and fluid-dependent), which can make heat reuse easier if there is a nearby sink and you have the governance to treat heat as a product.
From an operator standpoint, the biggest blocker is usually not thermodynamics—it’s commercial and contractual alignment (who owns the heat, who maintains the interface, what happens in summer).

Deployment and retrofit
Brownfield: DTC rollout
In brownfield sites, cold plate rollouts often win on speed because they fit the existing data hall geometry.
A practical rollout pattern looks like this:
Start with a single AI row or small zone where you can isolate risk.
Deploy a coolant distribution unit (CDU) and rack manifolds sized for your first density target.
Preserve an air system (at least initially) to handle residual heat and provide operational margin.
Instrument aggressively: flow, ΔT, pressure, leak detection, and alarms to BMS.
ASHRAE’s liquid cooling guidance emphasizes dew point awareness and condensation risk management; the ASHRAE TC 9.9 water-cooled server white paper notes that control systems should keep coolant supply temperature safely above dew point (see ASHRAE’s water-cooled servers white paper).
Greenfield: immersion pods
For greenfield builds, immersion can scale extremely fast when the building is designed around it:
Pods can be standardized as repeatable units (power, tanks, heat exchangers/heat rejection, containment).
The air side of the hall is simplified because the IT heat largely stays in the liquid domain.
Commissioning focuses on fluid management, containment, and service procedures as much as thermal performance.
Greenfield immersion also makes it easier to align your mechanical design to your target density rather than inheriting constraints from an air-cooled legacy layout.
Hybrid phased zones
Many operators land on a hybrid plan because it matches how capacity actually grows: pilots → first production rows → expansion pods.
A pragmatic hybrid blueprint is:
Zone A: legacy air or close-coupled air for lower-density racks
Zone B: cold plate (DTC) for mainstream AI racks
Zone C: immersion pods for the highest-density, highest-growth clusters
Coolnetpower can support reference architectures with CDUs, manifolds, and dual-loop separation so operators can phase DTC and immersion zones without redesigning the facility water side.

Reliability and standards
Leak and fluid management
Reliability is not “no leaks.” Reliability is early detection + fast isolation + controlled recovery.
Cold plate systems tend to have more connection points (quick disconnects, hoses, manifolds). That increases the importance of leak detection and isolation valves.
Immersion reduces the number of small leak points at the rack level, but increases the importance of containment, spill response, and human-factor discipline during maintenance.
If you’re building a standards-aligned SOP set, OCP’s cold plate requirements are a useful benchmark because they include process expectations (safety documentation, spill management, containment). See the canonical OCP document: OCP ACS liquid cooling cold plate requirements.
Materials and filtration
Material compatibility and fluid quality are where “fast scaling” projects often stumble.
Key engineering questions to standardize early:
What wetted materials are allowed (metals, elastomers), and what’s forbidden?
What is your filtration/strainer strategy, and how will you monitor differential pressure over time?
How will you prevent coolant mixing across vendors or loops?
⚠️ Warning: Treat coolant as an engineered consumable. Mixing fluids or ignoring filtration isn’t a maintenance issue—it becomes a reliability event.
ASHRAE/OCP alignment
A clean way to think about standards alignment:
ASHRAE TC 9.9 helps you frame environmental envelopes (including high-density classes) and condensation risk management.
OCP helps you reduce ecosystem lock-in by aligning interfaces, safety documentation, and qualification expectations.
Your procurement checklist should explicitly ask vendors how they align to both.
Economics and roadmap
CAPEX/OPEX and TCO
CAPEX and OPEX break differently across the two approaches:
Cold plate can concentrate spend in CDUs, manifolds, and server liquid kits, while preserving some of the existing air infrastructure.
Immersion can shift spend into tanks/pods, fluid inventory/handling, and new service workflows, while simplifying parts of the air-side build.
TCO is typically dominated by a handful of levers:
energy (cooling kWh and chiller hours)
space efficiency and deferred building expansion
availability/performance (avoiding throttling and unplanned downtime)
operations (labor, training, spares, incident response)
ROI levers and sensitivities
ROI usually becomes attractive when one or more of these are true:
You can’t meet GPU density targets with air without wasting floor space.
Your power cost, water risk, or carbon reporting pressure makes mechanical efficiency a first-class constraint.
Your cluster economics punish throttling (lost training throughput is effectively “hidden OPEX”).
The sensitivity to watch is not just the cost of the cooling kit—it’s how much additional usable compute you unlock per square meter and per megawatt.
24–36 month outlook
Expect the next 24–36 months to be shaped by:
more standardization around direct liquid cooling interfaces and safety practices (OCP momentum)
continued tightening of high-density operational guidance (ASHRAE and operator best practices)
higher scrutiny on fluid lifecycle and environmental impact (especially for certain two-phase fluids)
growth of hybrid architectures: cold plate as “default,” immersion as the high-density tier
Conclusion
A practical decision often comes down to two thresholds:
20–40+ kW/rack: cold plate (direct-to-chip) is usually the fastest scale path, especially in brownfield sites—provided you design for residual air heat and standardize leak detection/isolation.
≥80 kW/rack: immersion becomes more attractive when you need near-total heat capture and want to avoid running a meaningful parallel air system at extreme density.
Retrofit vs greenfield guidance:
Retrofit: start with cold plate zones to protect change windows and preserve service familiarity.
Greenfield: if your target state is very high density, immersion pods can be the cleanest way to scale without inheriting air-side constraints.
Standards checklist and next steps:
Verify dew point/condensation controls and alarms align to ASHRAE guidance.
Confirm direct liquid cooling components map cleanly to OCP expectations (documentation, containment, qualification).
Define your “stop conditions” and recovery playbooks before first production rollout.
If you want a low-friction next step, request a rack-density roadmap review: define your 20–40 kW baseline, your ≥80 kW target zone, and the CDU/manifold/loop boundaries you’ll standardize across phases.






