Aviz ONES details orchestration for NVIDIA Spectrum-X AI fabrics
Aviz ONES software provides automation and orchestration for NVIDIA Spectrum-X Ethernet fabric, targeting Artificial Intelligence (AI) and High performance computing (HPC) clusters to streamline deployment and scaling of Graphics Processing Unit (GPU), Central Processing Unit (CPU), and storage fabrics.
Integration of ONES and NVIDIA Spectrum-X
The combination of Spectrum-X hardware, including HGX H100/H200 GPUs, Spectrum-4 switches, and BlueField-3 SuperNICs, delivers an Ethernet-based, lossless GPU fabric. ONES complements this by automating network design, configuration, and ongoing management to support AI workloads.
Features of NVIDIA Spectrum-X Hardware
Spectrum-X incorporates high-throughput switches with 128 ports at 400G speeds, enabling dense connectivity. The system includes GPU accelerators and hardware components that facilitate RoCEv2 traffic handling and security within an open networking framework.
Capabilities of ONES Software Orchestration
ONES offers intent-based topology design, Zero-Touch Provisioning (ZTP) for device and fabric configuration without manual Command-Line Interface (CLI) intervention, and continuous validation to detect configuration drift. It integrates telemetry that monitors various metrics, including Remote Direct Memory Access (DMA) (RDMA) over Converged Ethernet performance and GPU hardware statistics.
Multi-fabric Orchestration and Operational Workflow
ONES manages East-West GPU fabrics optimized for intra-cluster traffic, storage fabrics for data access, and North-South CPU fabrics responsible for management and external communication. It supports automation from initial setup (Day-0), operational management (Day-1), to ongoing maintenance and scaling (Day-2), facilitating activities such as tenant creation, GPU allocation, alerting, and repair workflows.
Scaling Approaches and Visibility Features
Scaling strategies include phased growth by adding spine switches incrementally or upfront investment for maximum capacity to avoid re-cabling. ONES maintains consistent IP mapping and network profiles across scales. For visibility, ONES collects metrics on fabric health, routing states, flow control, GPU utilization, device component statuses, and integrates alerting with communication platforms for rapid issue response.
Implications for AI and HPC Environments
The integrated solution aims to reduce deployment times from weeks to hours, deliver consistent GPU performance in multi-tenant settings, support scaling without infrastructure redesign, and provide operational reliability through validation and automated remediation. These attributes are designed for environments such as AI cloud service providers, research institutions, and enterprises managing HPC infrastructures.
This Blog Signals brief summarizes a vendor blog detailing how Aviz ONES combined with NVIDIA Spectrum-X components offers automated orchestration and management to address the demands of large-scale AI fabric deployment and operation.