Skip to main content

Aviz details monitoring capabilities for NVIDIA Spectrum-X and Cumulus Linux environments

Aviz ONES offers enterprises comprehensive visibility into NVIDIA Spectrum-X and Cumulus Linux environments by delivering real-time monitoring of Graphics Processing Unit (GPU) and RDMA over Converged Ethernet (RoCE) traffic, agentless telemetry, and automated alerts, which supports efficient network management in AI-focused data centers.

Research Overview

Modern data centers employing Artificial Intelligence (AI) workloads require detailed observability spanning switches, network interface cards, and GPUs to maintain low latency and high performance. Monitoring Remote Direct Memory Access (DMA) (RDMA) over Converged Ethernet (RoCE) traffic and fabric health is crucial to support demanding AI and Machine Learning Operations (MLOps).

Product Update

Aviz ONES integrates network monitoring across NVIDIA Spectrum-X and Cumulus Linux platforms, providing agentless telemetry collection via NVIDIA NVUE for real-time hardware and protocol insights. The platform supports multi-vendor network fabrics and delivers visibility into AI-specific topologies and RoCE traffic essential for GPU communication.

Technical Breakdown

By utilizing NVIDIA NVUE and NGINX, Aviz ONES offers high-fidelity telemetry from Cumulus Linux devices without impacting device performance. It monitors device health, interface statistics, and protocol states, ensuring consistent integration across different software versions within AI network clusters.

Operational Impact

Enterprise network teams benefit from Aviz ONES by optimizing distributed AI training through congestion control and fabric balancing. The solution also facilitates multi-tenant isolation in AI-as-a-Service platforms, speeds up troubleshooting by correlating hardware alerts and telemetry anomalies, and enhances throughput monitoring with Non-volatile Memory Express (NVME) over Fabrics visibility.

Leadership Perspective

Aviz ONES incorporates an advanced rule engine to automate alerting and identify anomalies by tracking Central Processing Unit (CPU) and memory utilization, configuration deviations, device faults, and link statuses. This automation aids in proactive network management and operational efficiency.

Additional Capabilities

Beyond telemetry, the solution supports real-time monitoring of protocols such as Border Gateway Protocol (BGP), Link Aggregation Control Protocol (LACP), and Quality of Service (QoS), offers configuration management features including backup and restore operations, provides GPU health metrics, and enables detailed traffic analysis with multi-fabric orchestration capabilities.

These features contribute to an integrated approach for managing complex, AI-driven network fabrics with scalability and reliability.

Aviz ONES centralizes telemetry, configuration management, and performance analytics for Spectrum-X environments with broad multi-network Operating System (OS) compatibility. Its architecture supports scale for hyperscale AI workloads and offers automated alerting to maintain network reliability within zero-trust frameworks.

This Blog Signals brief is a factual summary of Aviz ONES capabilities as presented in the vendor blog, highlighting considerations for enterprise IT leaders overseeing AI and Machine Learning (ML) networks.