Five-year TCO for GPU racks (direct-to-chip): a procurement-ready model

Table of Contents

Key takeaways

A defensible five-year TCO for GPU racks (direct-to-chip) starts with a clear system boundary and a baseline you can defend, then separates one-time CAPEX from annual OPEX with tariff logic.
At row/pod scale (8–16 racks), results are usually driven by (1) what fan power you can actually retire, (2) demand charges and peak coincidence, and (3) how you scope redundancy + commissioning.
Use sensitivity analysis to keep arguments honest. If payback only works in one perfect scenario, it’s not a business case.

(Primary keyword variants used in-body: direct-to-chip liquid cooling TCO, liquid cooling ROI.)

1) Define the decision: what “five-year TCO” should include for a GPU pod

Most liquid-cooling ROI discussions fail because they compare a single quote to a single savings assumption.

For procurement-grade decisions, define three things up front:

The unit of analysis: in this guide we use a GPU pod (8–16 racks) because that’s how projects are budgeted and phased.
The boundary: what sits inside your cost model vs what you treat as “facility shared infrastructure.”
The tariff model: energy charges are not the same as demand charges.

A simple, defensible structure:

5-year TCO (undiscounted) = CAPEX + 5 × annual OPEX
5-year NPV TCO (optional) = CAPEX + Σ (annual OPEX ÷ (1+r)^t)

Where annual OPEX includes energy, maintenance, consumables, and any recurring service.

Key Takeaway: The “right” model is the one a skeptical facilities leader and a skeptical procurement lead will both sign.

2) Establish the baseline: air-cooled GPU pod cost model (what you must measure)

If your baseline is vague, your savings will be imaginary.

2.1 Baseline inputs (minimum viable)

For an 8–16 rack pod, capture (or estimate with measurement windows):

IT load profile: average kW and peak kW at the pod boundary (not nameplate only)
Server fan power baseline: measured per server or inferred from server telemetry
Room cooling power attributed to the pod: CRAH/CRAC, pumps, and chiller share (use your internal allocation method)
Ops costs driven by density: hot-spot response, airflow balancing, containment fixes, change windows

2.2 Baseline traps that break ROI math

Double-counting savings: if your PUE model already captures cooling power, don’t also subtract “fan savings” from the same bucket.
Assuming perfect utilization: AI pods often run bursty. Use a utilization factor rather than assuming 100% load.
Ignoring peak coincidence: demand-charge savings only materialize if your kW reduction occurs during peak billing windows.

3) Direct-to-chip at row/pod scale: what you’re actually buying

Direct-to-chip (DTC) isn’t “a cold plate.” It’s a distribution, controls, and risk-management system that has to behave predictably under change (swapped servers, changing loads, partial deployment).

A practical system view:

Primary loop (facility side): building heat-rejection loop (chilled water, warm water, or another plant approach)
Secondary loop (technology cooling): pumps + heat exchanger + controls → supply/return headers → branch lines → rack manifolds → server cold plates → return

An operator-focused description of loops, CDUs, rack manifolds, and leak detection is laid out in Equinix’s 2026 explainer on the anatomy of a direct-to-chip liquid cooling system.

If your team needs a component primer, Coolnetpower’s internal overview of direct-to-chip liquid cooling with cold plates is a helpful starting point.

4) CAPEX model: CDUs, manifolds, piping, and the costs people forget

A clean CAPEX model should be line-itemed, not bundled. Below is a template you can adapt.

4.1 CAPEX line-item template (row/pod)

CAPEX bucket	Typical line items	Units	Scope notes
CDUs	CDU(s), pumps, plate heat exchanger, controls, filtration	per CDU	Decide redundancy target (N, N+1). Clarify monitoring and alarms
Distribution	supply/return headers, rack manifolds, quick disconnects, hoses	per rack + per pod	Define ownership boundary and spares (couplings, valves, sensors)
Piping & valves	secondary loop piping, isolation valves, supports, balancing/control valves	per meter/foot + per pod	Route complexity matters more than per-rack averages
Leak detection & safety	leak cable/spot sensors, drip trays, automatic shutoff valves (if used)	per pod	Make detection + response explicit (not implied)
Commissioning	flushing, pressure test, controls integration, balancing, training	per pod	Often the difference between “works on day 1” and “works in month 6”
Retrofit enabling works (if retrofit)	rack mods, change windows, temporary cooling, remediation	per site	Keep separate so retrofit vs greenfield comparisons stay fair

4.2 CDU sizing and redundancy (why it belongs in the TCO model)

Even if you’re not designing the loop yourself, the inputs that drive CDU sizing are procurement inputs:

design load (kW) and expected growth
target ΔT and supply temperature
redundancy target and allowable failure behavior
monitoring scope and integration into DCIM/BMS

If you want a structured checklist, use Coolnetpower’s guide on how to size a CDU for AI data centers step by step to align stakeholders on required inputs before quotes arrive.

4.3 CAPEX scoping rules that prevent surprises

State the density target and whether you’re sizing for today or the next GPU refresh.
Make redundancy explicit: pumps, power feeds, controls, and “what happens if a CDU is down?”
Document the ownership boundary (often at rack quick disconnects). This affects spares, training, and service responsibilities.

⚠️ Warning: If piping routes and isolation plans aren’t scoped, your model will understate retrofit cost and overstate deployment speed.

5) OPEX model: fan energy removed vs pump energy added (plus maintenance)

The two most common modeling errors are:

assuming you retire all fan power, and
ignoring pump power and maintenance.

5.1 Fan energy savings (baseline → DTC)

Model fan savings as a factor applied to a baseline you can defend.

Baseline annual fan energy (kWh/year):

average fan kW × 8,760 × utilization factor

DTC annual fan energy:

baseline × (1 − fan_reduction_fraction)

Annual savings:

kWh saved × $/kWh

Be conservative: even with DTC, some airflow may remain necessary for non-liquid-cooled components.

5.2 Added pump energy (CDU + distribution)

Add pump energy explicitly rather than burying it.

annual pump energy (kWh/year) = pump kW × 8,760 × utilization factor

If you don’t have pump kW during early-stage budgeting, use a range and let sensitivity analysis show whether it matters.

5.3 Maintenance: make it auditable

A practical way to keep this procurement-friendly is to model an annual “O&M allowance” that covers:

filters and coolant quality management (where applicable)
sensor checks and calibration
periodic inspection of couplings and valves
training refresh and leak-response drills

This also makes it easier to compare to alternatives (RDHx and immersion have different labor profiles, even if they’re all “liquid”).

For a general overview of DTC system considerations (including CDU variants and deployment topics), see Vertiv’s 2024 deep dive on direct-to-chip cooling in HPC infrastructure.

6) Tariffs that can flip the answer: energy, demand charges, escalation

For multi-region organizations, the safest approach is to treat tariffs as inputs with scenario bands rather than argue about one “correct” rate.

6.1 Separate energy and demand charges

Model two components:

Energy charge: $/kWh × kWh saved
Demand charge (where applicable): $/kW-month × kW reduced at peak × 12

This is why “kWh savings” can look impressive but pay back slowly at one site—and pay back quickly at another.

6.2 A tariff input block you can reuse (US/EU/MEA/APAC)

Instead of asserting exact rates, build an input block.

Region	Energy charge ($/kWh)	Demand charge ($/kW-month)	Notes
US	low / base / high	low / base / high	In some territories, demand charges dominate the economics
EU	low / base / high	low / base / high	Contract structure and volatility matter; confirm hedging terms
MEA	low / base / high	low / base / high	Site-specific; confirm subsidies and plant constraints
APAC	low / base / high	low / base / high	Time-of-use can materially change payback

If you want an external, non-vendor perspective on why cooling architecture and “warm water” matter to energy outcomes, the LBNL Data Center Efficiency Center maintains an overview page on liquid cooling.

7) Sensitivity analysis: which assumptions drive five-year TCO and payback

A good sensitivity section does two jobs:

tells decision-makers what to debate
tells engineers what to measure

7.1 One-way sensitivity (tornado-ready variables)

Rank these by impact in your spreadsheet:

energy price ($/kWh)
demand charge ($/kW-month)
fan_reduction_fraction (how much fan power you can actually retire)
CDU + distribution CAPEX
retrofit enabling works CAPEX (if brownfield)
pump kW (CDU + distribution)
annual maintenance allowance
utilization factor (how steady the AI load is)

7.2 Scenario sets that match real GPU pod behavior

Run at least three:

Conservative: smaller fan reduction, higher retrofit cost, lower utilization
Base: best estimate
High tariff exposure: demand-charge-heavy or TOU-heavy; use the same technical assumptions

Pro Tip: If your finance team won’t accept an “optimistic” case, don’t argue—rename it by condition (“high utilization / high tariff exposure”). The point is to reveal dependence.

8) Direct-to-chip vs RDHx vs immersion: where each wins (high-level)

This section is intentionally high-level. Your goal is to identify which architecture deserves detailed design effort.

For an internal orientation, Coolnetpower provides a comparison primer: direct-to-chip vs immersion vs rear-door heat exchanger compared.

8.1 Comparison matrix (what changes your TCO inputs)

Dimension	RDHx	Direct-to-chip (DTC)	Immersion
Retrofit disruption	Lower (no server mods)	Medium–high (server loop integration)	High (ops + layout changes)
Where CAPEX concentrates	rack doors + facility water routing	CDU(s) + secondary loop + rack manifolds	tanks + fluid handling + CDU/HX
Biggest OPEX drivers	partial fan and room cooling reduction	fan reduction + pump power + maintenance	high heat capture; different maintenance model
Ops model shift	Moderate	Moderate–high	High
When it’s a strong fit	density uplift without changing servers	sustained high-density GPU pods	greenfield / extreme density / dedicated ops workflows

If your near-term decision is “RDHx now vs DTC now,” Coolnetpower’s overview of RDHx vs direct-to-chip for 30–80 kW racks: how to choose can help structure evaluation criteria.

9) A buildable template: the 5-year TCO table you can paste into your spreadsheet

Below is a deliberately simple table layout. The goal is to give you a model structure that survives scrutiny.

9.1 TCO line item table (template)

Category	Line item	Notes / assumptions
CAPEX	CDU(s)	quantity, redundancy level, included monitoring
CAPEX	Rack manifolds + QDs + hoses	per rack + spare couplings
CAPEX	Secondary piping + valves	route length, isolation plan
CAPEX	Leak detection + safety	sensor coverage + response plan
CAPEX	Commissioning + training	flush/test/balance + runbooks
OPEX	Energy: fan kWh removed	baseline fan kW × reduction fraction
OPEX	Energy: pump kWh added	pump kW × utilization
OPEX	Maintenance allowance	filters, checks, calibration
OPEX	Spares allowance	couplings, sensors, small valves

9.2 Payback and NPV (optional)

Annual net savings = (baseline OPEX − new OPEX)
Simple payback = CAPEX ÷ annual net savings
NPV = Σ (annual net savings ÷ (1+r)^t) − CAPEX

10) Procurement checklist: questions that protect SLA, auditability, and change control

Use these to keep your TCO model tied to real delivery scope.

10.1 Scope and responsibility

Where is the responsibility boundary (at quick disconnects, at the CDU, elsewhere)?
Who owns coolant quality management and what monitoring is included?
What redundancy is included (pumps, power feeds, controls) and what is the failover behavior?

10.2 Design and commissioning

What facility-side water temperature range is assumed?
What ΔT and flow are assumed on the secondary loop?
What is the commissioning plan (flush, pressure test, leak detection validation, alarm runbooks)?

10.3 O&M readiness

What spares are recommended (couplings, valves, sensors, pump components)?
What training is included and what are the runbooks for leak response?

10.4 Cost transparency

Provide CAPEX broken down by CDU(s), manifolds/distribution, piping/valves, leak detection, commissioning.
Provide OPEX assumptions: pump kW, maintenance visits/year, consumables.

FAQ: common business-case questions

What’s the TCO of direct-to-chip retrofits over five years?

It depends mainly on your baseline (fan power + room cooling), retrofit enabling works (piping routes and change windows), and tariff structure. A model that explicitly separates CAPEX from annual OPEX—and treats demand charges separately—is usually sufficient to compare options without pretending there is one universal number.

What CAPEX should we expect for CDUs, manifolds, and piping?

Treat CDU(s), distribution/manifolds, and secondary piping/valves as separate line items, and insist on commissioning as a scoped deliverable. If any quote collapses these into a single “system price,” you lose the ability to validate assumptions and compare vendors fairly.

Where does OPEX improvement usually come from?

Typically from reduced server fan energy and reduced reliance on room-level airflow for the pod (while adding pumping energy and a different maintenance profile). Exact outcomes depend on the fraction of heat captured and how the facility loop rejects heat.

Next steps

Build your baseline with measurement windows (fan power and peak kW matter more than perfect precision).
Fill in the TCO table template, then run sensitivity on the three assumptions your engineering lead distrusts most.
If you’re evaluating vendors, start with a neutral component overview like Coolnetpower’s liquid cooling solutions to align on common building blocks and interfaces.

If you want to pressure-test your draft inputs, Coolnetpower can review your scope assumptions (piping route, redundancy level, commissioning plan) and flag the places where projects typically drift.