< img src="https://mc.yandex.ru/watch/103289485" style="position:absolute; left:-9999px;" alt="" />

Own Data Center vs Public Cloud for AI Training (2025): 5‑Year TCO at 30–80 kW/rack

If your AI program depends on sustained training runs, the real question isn’t “on‑prem or cloud?”—it’s “what’s my 5‑year cost per effective GPU‑hour at the utilization I can actually achieve?” For high‑density clusters (30–80 kW/rack) enabled by liquid cooling, total cost pivots on a handful of levers: GPU pricing and availability, electricity price, PUE, utilization and scheduling efficiency, egress patterns, and time‑to‑value. For density feasibility and thermal implications, see how liquid cooling enables 30–80 kW/rack in this overview of high‑density liquid cooling for AI/HPC on Coolnet’s site: advanced liquid cooling for AI/HPC.

Key takeaways
  • If you can sustain ≥60% GPU utilization for 36–60 months, an owned or private/colo cluster with liquid cooling and PUE ≈1.05–1.15 often beats public cloud on 5‑year $/GPU‑hour in U.S. power bands; Europe’s higher power prices narrow or erase the gap at the margin.

  • For pilots, bursts, or uncertain roadmaps (≤30% utilization or <9 months), public cloud usually wins due to instant scale and zero lead time, even after 2025 GPU price cuts.

  • Electricity and PUE matter: every 0.05 PUE swing moves energy cost ~5%; at $0.08–$0.12/kWh U.S. industrial/commercial rates, energy can dominate OpEx for dense racks.

  • Data gravity and egress bills are silent killers in cloud TCO; training near data and minimizing cross‑region/Internet egress preserves your effective throughput budget.

  • Modular builds can land usable AI capacity in ~4–9 months; cloud is near‑instant when capacity exists. Your time‑to‑first‑model may be the deciding factor.

How we model 5‑year TCO (at a glance)

For an owned or private deployment, the 5‑year stack combines capital for 8× H100 servers per node, racks, CDUs/manifolds, UPS/PDU, InfiniBand or Ethernet fabric and optics, plus installation and integration. We amortize CapEx over five years (with any financing/residual assumptions) and add operating spend: energy calculated as IT kW × PUE × hours × $/kWh, along with staffing, maintenance contracts, spares, DCIM/monitoring, and space. Uptime’s 2024 survey shows a mixed‑fleet average PUE near 1.56, while optimized liquid deployments can approach ~1.1; treat 1.05–1.15 as a conservative owned liquid band and design accordingly. See the evidence in the Uptime 2024 Global Data Center Survey.

In public cloud, compute spend is driven by $/GPU‑hour times provisioned hours and your effective utilization, plus storage and data transfer. Note the mid‑2025 reductions on NVIDIA GPU instances; AWS announced up to 45% price cuts for H100/H200/A100 families in June 2025—always anchor to your target region and commitment model when you run the numbers: AWS 2025 GPU price reduction announcement. Internet egress is tiered; for EC2/S3 the first 100 GB/month is free, then common tiers around $0.09/GB down to ~$0.05/GB at very high volumes, with inter‑AZ/region charges depending on architecture: see AWS data transfer pricing. For energy benchmarks when you model owned capacity in the U.S., 2025 retail averages commonly sit near ~8.5–9¢/kWh for industrial and ~12–14¢/kWh for commercial customers, per the EIA electricity price series.

On density and cooling feasibility, 30–80 kW/rack is realistic with direct‑to‑chip liquid cooling at ~30°C supply temperatures (W30). ASHRAE’s 2024 guidance frames this as a durable path for high‑density AI: ASHRAE 30°C Coolant — A Durable Roadmap for the Future (2024).

Owned vs Public Cloud for AI Training: parity view (2025)

Alphabetical order; assumptions are indicative and should be tailored to your region and contracts.

Field

Owned / Private (liquid‑cooled)

Public Cloud GPU (hyperscalers)

Specs / assumptions

30–80 kW/rack with direct‑to‑chip liquid, PUE 1.05–1.15; 8× H100 nodes; NDR 400G fabric; modular build

H100/A100 instance families; on‑demand or committed; managed fabrics; region‑specific availability

5‑year cost stack (typical)

CapEx (servers/fabric/cooling) + Energy (IT×PUE×$kWh) + Ops/staff/support

GPU $/hr × hours × utilization + storage + egress + support/managed service premiums

Pricing, as‑of 2025

Hardware varies by vendor; plan $300k± per 8× H100 SXM node; rack‑level with fabric/CDU often $400k–$600k. Energy based on local tariffs

AWS cut H100 on‑demand prices in 2025; use your target region’s calculator. Example GCP A3 H100 per‑GPU on‑demand reference is visible on Compute Engine pricing

Evidence links

Uptime 2024 PUE report; ASHRAE 30°C coolant guidance; EIA U.S. electricity

AWS 2025 price cuts; GCP Compute Engine H100 pricing page

Pros

Lowest $/GPU‑hr at high utilization; predictable throughput; control over data residency; reduced egress; stable performance at high density

Instant scale (when available); no CapEx; easy burst capacity and pilots; multi‑region availability

Cons

Up‑front CapEx; 4–9 month lead time; staffing and maintenance; supply‑chain risk

Ongoing egress/storage fees; region capacity constraints; effective utilization can be lower; potential lock‑in

Constraints

Facility power and heat‑rejection capacity; liquid‑cooling readiness; procurement cycles

GPU availability by region; committed terms for best pricing; preemption risk on spot

References for example cloud pricing snapshots: For Google Cloud H100 (A3) on Compute Engine, see the on‑demand machine rates, where a single‑GPU a3‑highgpu‑1g is published and multi‑GPU variants scale accordingly: Google Cloud Compute Engine VM instance pricing. Re‑verify your region at purchase time.

Scenario conclusions (use these as decision guardrails)
  • Best for sustained utilization (≥60% for ≥36 months): Owned/private or dedicated colo typically wins on 5‑year TCO in regions with power ≤$0.12/kWh and liquid‑cooled PUE ~1.05–1.15. You gain predictability and avoid persistent egress.

  • Best for burst/short‑term (≤30% utilization or <9 months): Public cloud. The elasticity and time‑to‑first‑model outweigh CapEx, especially for experimentation, seasonal projects, or headcount‑limited teams.

  • Hybrid for data gravity and overflow: Train where the data lives to reduce egress; use cloud for overflow peaks, new geographies, or managed services that accelerate specific phases (pretraining vs finetuning).

Deployment speed and density feasibility

Need capacity this quarter or this fiscal year? Cloud is near‑instant when H100 capacity exists. But a well‑planned modular build can deliver liquid‑cooled AI halls in roughly 4–9 months, especially when prefabricated row or container modules are on the bill of materials. For context on deployment options, see Coolnet’s modular MetaRow category page: MetaRow modular data center.

On density, think of liquid cooling like adding express lanes to a congested highway: by moving heat directly off the die through cold plates and a warm‑water loop, you keep thermals stable and maintain performance even as rack power climbs. With W30 supply temperatures and careful CDU/manifold design, 30–80 kW/rack is practical. If you’re mapping loads to existing space, use a structured plan; this primer covers capacity planning steps you can adapt: data center capacity planning best practices.

Methods you can replicate (sanity‑check your numbers)
  • Owned energy cost per year ≈ IT kW × PUE × 8,760 × $/kWh. Add 10–20% for non‑energy OpEx if you lack precise quotes. Calibrate with your maintenance contracts and staffing plan.

  • Cloud effective compute cost ≈ listed $/GPU‑hr × hours × utilization. If you rely on spot/preemptible, subtract the savings you expect and add an interruption overhead for restarts.

  • Breakeven utilization: Solve for the utilization at which owned $/GPU‑hr equals cloud $/GPU‑hr over 5 years. In many U.S. scenarios with PUE ~1.1 and power ~$0.10/kWh, the breakeven often falls in the 50–60% band, moving higher as power costs rise.

Risks and caveats to plan for
  • Supply chain and lead times: H100 systems, NDR switches, optics, and CDUs can carry multi‑month lead times; plan buffers. Cloud can also have regional GPU scarcity.

  • Preemption and scheduling overhead: Spot/preemptible savings can vanish without robust checkpointing and workflow design.

  • Compliance and lock‑in: On‑prem gives stronger control over residency and audit paths; clouds offer sovereign features but review them carefully. Consider the cost of changing platforms in both directions.

Also consider
  • Coolnet provides modular data centers, precision and liquid cooling, power, and DCIM that enable 30–80 kW/rack AI clusters. If you are exploring the owned path, you can review their solution portfolio here: Coolnet Solutions overview. Disclosure: Coolnet is our product.


Ready to pressure‑test your numbers and compress deployment timelines? Book a modular data center design workshop. We’ll model your 5‑year TCO across utilization and power scenarios, and outline a phased build plan aligned to your AI roadmap.

Facebook
Pinterest
Twitter
LinkedIn

Leave a Reply

Your email address will not be published. Required fields are marked*

  • Coolnet