NVIDIA details ONES 3.1 update for Spectrum-X on Cumulus Linux
The Open Networking Enterprise Suite (ONES) 3.1 release introduces enhanced telemetry capabilities for NVIDIA Spectrum-X switches operating with Cumulus Linux versions 5.9 through 5.11, providing comprehensive, real-time observability across Artificial Intelligence (AI) and GPU-driven data center fabrics. This level of visibility offers network and security professionals detailed insights into performance, state, and traffic flows, which is critical for managing complex enterprise infrastructures.
Research Overview
ONES 3.1 extends its monitoring support to Spectrum-X switches running Cumulus Linux 5.9 to 5.11 using an agentless approach. This approach utilizes NVIDIA's NVUE daemon exposed through Representational State Transfer (REST) APIs and an NGINX front-end, enabling telemetry collection without installing additional software agents on the devices. The suite aggregates performance, health, and traffic data to assist in timely troubleshooting and operational management within AI/ML network fabrics.
Technical Breakdown
The platform captures live telemetry, including detailed RDMA over Converged Ethernet (RoCE) metrics such as Priority Flow Control and queue-level statistics critical for tuning Remote Direct Memory Access (DMA) (RDMA) paths. ONES provides unified dashboards that consolidate monitoring across environments running both SONiC and Cumulus Linux, streamlining operations under a single interface. Additionally, an AI/ML topology visualization enables operators to identify network imbalances and hot spots across data center interconnects.
Operational Impact
ONES incorporates an advanced rule engine designed to automate anomaly detection and response, with customizable thresholds and integrations for notifications via systems like Slack and Zendesk. This facilitates rapid awareness and remediation of network issues. The unified platform supports troubleshooting efficiency by delivering granular insights coupled with contextual information relevant to enterprise-scale Graphics Processing Unit (GPU) cluster environments.
Benefits Analysis
The integration provides enterprises with a singular monitoring solution to bridge different network Operating System (OS) environments, reducing operational complexity. It contributes to capacity planning and security compliance by offering comprehensive visibility into network traffic and device health. Moreover, the suite’s support for Power Factor Correction (PFC) and queue analytics aids in optimizing network performance, particularly for latency-sensitive RoCE traffic.
This release supports scale as GPU cluster demands evolve, aiming to sustain network performance and reliability within data center deployments that utilize Spectrum-X and Cumulus Linux.
ONES 3.1 delivers agentless telemetry collection, detailed RoCE insights, proactive rule-driven monitoring, and unified dashboards aimed at supporting real-time management of AI/ML and cloud workloads on Spectrum-X fabrics. This Blog Signals brief presents a summary of the vendor’s blog content relevant for enterprise IT and security leaders.