Skip to main content

Datadog, Inc. makes GPU Monitoring generally available

Datadog, Inc. made GPU Monitoring generally available for customers everywhere, positioning the offering around visibility into GPU fleet health, cost, and performance. The company said the update addresses ongoing challenges in managing expanding AI costs.

Datadog said GPU instances account for 14 percent of compute costs, and that teams cannot chargeback GPU spend across business units, see workload context, or identify clear next steps for improvement. It also said most GPU tools provide high-level device health metrics but do not surface cross-functional resource contention issues or explain why training and inference workloads fail.

GPU Monitoring was described as providing unified visibility across the AI stack, linking fleet telemetry directly to the workloads consuming those resources. The release stated it connects GPU fleet health, cost, and performance to the teams using the systems for faster troubleshooting of slow workloads and for avoiding wasted spend. It also cited visibility into which devices are idle or ineffectively used.

In the rollout described by Datadog, the company said the product streams telemetry to support shared investigation by platform engineering and machine learning teams. Kai Huang, Head of Product at Hyperbolic, said, “Datadog GPU Monitoring has made it easy for us to stay on top of our multi-tenant GPU infrastructure. We get per-instance, per-device visibility into core utilization, memory, power and thermals right out of the box with no extra setup. The dashboards are rich out of the gate and simple to customize, and standing up isolated views per customer takes minutes.” Yanbing Li, Chief Product Officer at Datadog, added, “GPU instances account for 14 percent of compute costs—which is a huge issue as companies are struggling to build AI-first technology in scalable and smart ways. While these companies can see their costs climbing, they can’t chargeback GPU spend across business units, see workload context or identify clear next steps for improvement. As a result, it is very challenging to budget and plan in thoughtful ways,”

Datadog included forward-looking statements about the benefits of new products and features, with results that may differ based on risk factors and other uncertainties.