NVIDIA details Spectrum-X and Aviz ONES integration for AI network management
NVIDIA and Aviz Networks demonstrated an integrated solution combining NVIDIA Spectrum-X Ethernet fabric optimized for Artificial Intelligence (AI) workloads and Aviz's ONES orchestration platform to address scaling challenges in distributed Graphics Processing Unit (GPU) clusters. This development provides enterprise IT and security leaders with a framework for managing high-performance AI networks with enhanced automation, segmentation, and real-time visibility.
Research overview
NVIDIA Spectrum-X serves as an Ethernet fabric tailored to AI, offering features akin to InfiniBand such as Remote Direct Memory Access (DMA) (RDMA) over Converged Ethernet (RoCE), adaptive routing, and congestion control to maintain efficient GPU cluster performance. The Spectrum-X Reference Architecture (RA 1.3.0) has been validated on supercomputer-scale deployments, combining open network operating systems and NVIDIA digital twin simulations to support reliability and scalability.
Aviz Networks' ONES platform integrates with Spectrum-X to provide end-to-end orchestration. Its capabilities include declarative fabric design, Zero-Touch Provisioning (ZTP), and operational lifecycle management, all designed to simplify deployment and management of multi-tenant AI environments.
Key findings
The integration enables Full Stack Observability (FSO) across network switches, hosts, and GPUs without relying on additional agents, delivering real-time telemetry and automated alerting via communication and incident management tools. Multi-tenant AI workloads benefit from EVPN/VRF segmentation and GPU-aware resource allocation, enforcing isolation policies to secure data flow and workload separation.
Operational workflows demonstrated during the bootcamp included automated orchestration of Spectrum-X fabrics, tenant creation with GPU assignment, policy enforcement validation, and support for configuration comparison and structured hardware replacement procedures.
Technical breakdown
Spectrum-X extends conventional Ethernet’s capabilities by incorporating Remote Direct Memory Access (RDMA) for DMA between GPUs, adaptive routing to circumvent network congestion, and congestion control mechanisms to sustain consistent throughput under heavy loads. The Release Automation (RA) 1.3.0 blueprint leverages SONiC or Cumulus Network Optimization Suite (NOS), NetQ telemetry, and NVIDIA Adaptive Incident Response (AIR) digital twins, alongside BlueField accelerators, to achieve predictable performance at scale.
Aviz ONES uses a declarative approach to network design, enabling administrators to define and validate fabric topology through digital twins before applying configurations automatically. It supports multi-tenant provisioning, leveraging segmentation protocols and GPU awareness to maintain workload separation and policy adherence. The platform collects telemetry directly from native sources in the network and compute layers to provide agentless monitoring.
Operational impact
By uniting Spectrum-X and ONES, enterprises can automate AI network deployment (Day 0), ongoing configuration management, and operational monitoring (through Day 2). The solution addresses challenges unique to AI networking at scale, such as workload isolation, resource allocation, and visibility into both network and compute components. The agentless telemetry supports integrating alert notifications with platforms like Slack, ServiceNow, and Zabbix.
This integration enables GPU cluster operators and AI cloud providers to manage complex environments with a reduced operational burden while supporting high performance and secure multi-tenancy.
This Blog Signals brief presents an impartial summary of the NVIDIA and Aviz Networks blog post, highlighting the combined capabilities of Spectrum-X and ONES suitable for enterprise decision-makers considering AI network infrastructure.