Skip to main content

Aviz Networks details ONES 3.1 real-time observability for GPU-accelerated compute environments

ONES 3.1 introduces integrated real-time monitoring across network, compute, and storage components designed for Artificial Intelligence (AI) and High performance computing (HPC) workloads, offering a unified operational perspective to assist IT and security teams in managing cluster performance and reliability.

Unified Monitoring Approach

ONES 3.1 aggregates telemetry data from host systems, accelerators, and network interfaces to provide visibility into the entire data flow, including Public Cloud Interconnect (PCI) Express links and memory operations. This aggregation enables operators to identify causes of latency, packet loss, or performance throttling across the stack.

Network Interface Controller Insights

The solution delivers detailed interface-level metrics such as administrative and operational status, MTU settings, port speeds, and auto-negotiation states. It tracks forward error correction (FEC) modes and collects Link Layer Discovery Protocol (LLDP) counters to aid in validating network topology and detecting configuration inconsistencies.

GPU and Compute Performance Tracking

Performance monitoring extends to both host Central Processing Unit (CPU) and graphics processing units (GPUs), highlighting bottlenecks and load imbalances across nodes and devices over various time frames. The system continuously measures CPU utilization, memory pressure, temperature, platform metadata, and uptime to support proactive resource management and prevent thermal or resource constraints that could degrade workloads.

GPU-Specific Metrics

ONES leverages NVIDIA System Management Interface (SMI) to capture GPU-specific data, including temperature, utilization, power consumption, memory use, bus identifiers, and serial numbers. The collected metrics correlate power and thermal dynamics with workload phases to support resource placement and failure mitigation.

CPU and Memory Utilization Overview

The platform monitors CPU load patterns and memory consumption on host and Graphics Processing Unit (GPU) levels, integrating uptime data to assist with stability assessments, maintenance scheduling, and enforcing Service Level Agreements (SLAs) for AI training and inference cycles.

Storage and Platform Health Monitoring

ONES tracks disk health and utilization, providing metrics such as percentage use, absolute capacity in megabytes, temperature, and health status. It supplements these storage indicators with chassis and platform health data to offer a comprehensive overview of node readiness and to anticipate input/output slowdowns.

Vendor Support and Scalability

The monitoring architecture is built on standard Linux interfaces and accommodates various Network Interface Controller (NIC) vendors, including Intel and Mellanox/NVIDIA. It supports expansive deployments by centralizing monitoring across servers hosting GPUs and NICs without adding resource overhead or hardware dependency.

ONES 3.1 delivers consolidated, real-time telemetry across compute, network, and storage domains. This broad visibility enables operational teams to manage multi-vendor AI and HPC systems effectively, reducing downtime and optimizing performance.