Skip to main content

Datadog’s State of AI Engineering 2026 Report Links Production Failures to Capacity Limits

Datadog said its State of AI Engineering 2026 report found that nearly 1 in 20 AI requests fail in production, as capacity limits became a main bottleneck to scaling AI reliably. The report used real-world usage data from organizations running AI in production.

According to the report, about 5% of AI model requests failed in production, and nearly 60% of those failures were caused by capacity limits, contributing to slowdowns, errors, and broken experiences in AI-powered applications. It also reported that 69% of companies used three or more models alongside increasingly complex agent workflows.

The report described multi-model deployments and agent workflow complexity. It stated that OpenAI accounted for 63% share, and that adoption of Google Gemini and Anthropic Claude rose by 20 and 23 percentage points. It also said the average number of tokens sent per request more than doubled for median use teams and quadrupled for heavy users, based on usage at the 50th and 90th percentiles.

The report framed these issues around operational control and observability across systems, citing speed without control as a risk factor as AI deployment expands. Yanbing Li, Chief Product Officer at Datadog, said, “AI is starting to look a lot like the early days of cloud.” Guillermo Rauch, CEO at Vercel, said, “The next wave of agent failures won't be about what agents can't do but what teams can't observe,” and Li added, “In this new era, AI observability becomes as essential as cloud observability was a decade ago.”