Aviz details ONES 4.0 GPU monitoring

Aviz introduced ONES 4.0 to deliver vendor-neutral monitoring, health dashboards, and customizable alerts after Meta's Llama-3 training found 78% of interruptions were hardware-related, underscoring risk to enterprise Graphics Processing Unit (GPU) clusters at scale.

Research overview

Meta trained Llama-3 405B across 24,000 GPUs on a three-layer Clos fabric and completed a 54-day training run that recorded interruption types and frequencies.

The dataset from that run attributed 78% of interruptions to hardware issues rather than training strategies or model design.

Key findings

Faulty GPUs accounted for about 30% of failures, memory errors for roughly 17%, and network switches or cabling for about 8%, with host maintenance and software bugs contributing additional incidents.

The reported distribution concentrated failures in compute and network hardware layers during large-scale training.

Technical breakdown

Observed GPU problems included uneven device utilization, Elliptic Curve Cryptography (ECC) events that distinguished single-bit warnings from double-bit critical errors, thermal or power throttling, and progressive hardware degradation caught by health checks.

Network-level causes cited switch and cable faults, while host maintenance and software issues generated further interruptions that affected workloads.

Product update

ONES 4.0 provides end-to-end monitoring across GPUs, CPUs, NICs, memory, SSDs, power, and switches and surfaces severity levels mapped to Green, Yellow, and Red; example outputs include single-bit and double-bit ECC events from utilities such as nvidia-smi.

The release includes a GPU dashboard with inventory and error visibility, configurable thresholds and rules for alerts, and integrations with ticketing and messaging platforms such as ServiceNow, Zendesk, and Slack; it also supports NVIDIA, AMD, SONiC, and Arista environments.

Operational impact

Configured thresholds and alerting rules let operators map utilization and error conditions to warning and critical states and automate notifications when conditions persist, enabling earlier intervention before workloads fail.

The vendor brief emphasizes that early warnings, root-cause clarity, and customizable policies can shorten recovery time and provide consistent monitoring across vendor ecosystems.

Enterprises can apply continuous, cross-vendor monitoring and configurable alerts to detect the GPU, memory, and network faults observed in Meta's Llama-3 run and to reduce service interruptions. This “Blog Signals brief” is a fact-based summary of the vendor blog.

Full content from aviznetworks.com.