Skip to main content

ONES 3.1 Improves Observability for High-Performance Computing

This update discusses the launch of ONES 3.1, which offers real-time telemetry for High performance computing (HPC) and Artificial Intelligence (AI) workloads. This enhancement is pertinent to IT decision-makers tasked with monitoring and optimizing complex computational environments.

Product Update

ONES 3.1 introduces comprehensive, vendor-agnostic telemetry solutions that cover crucial components including NICs, GPUs, CPUs, memory, and storage. This tool enables faster identification of bottlenecks and aids in preventing potential system failures within HPC and AI infrastructures.

Unified Monitoring Capabilities

The software correlates data from hosts and network components to provide insights into the entire data path, allowing operators to diagnose performance issues effectively. This functionality is designed to reveal true sources of latency and loss, facilitating a shift from reactive troubleshooting to proactive performance management.

Insights on NIC Performance

ONES offers detailed metrics on network interface connections, including operational status and port speed. Such insights assist in maintaining clean and reliable network links by validating network configuration and catching potential issues early.

GPU and Compute Performance Tracking

The system monitors both host and Graphics Processing Unit (GPU) performance, identifying areas of imbalance that could impact operations. This feature ensures that performance issues are addressed across nodes and devices.

Health Monitoring for Compute Resources

Monitoring includes tracking Central Processing Unit (CPU) utilization, memory pressure, and temperature. By establishing proactive thresholds, organizations can avoid conditions that lead to reduced performance or system failures.

GPU Performance Insights

Using NVIDIA SMI, ONES captures various GPU metrics, including power draw and memory allocation. This assists teams in correlating workload phases with performance data to enhance resource utilization.

CPU & Memory Utilization

The solution enables monitoring of CPU load and memory use, allowing for assessments on system stability and infrastructure readiness in relation to Service Level Agreements (SLAs).

Storage and Platform Health Monitoring

ONES provides metrics to prevent I/O slowdowns by tracking disk health and utilization rates, thus ensuring job processes are not hampered by storage-related issues.

Vendor-Agnostic and Scalable Design

ONES supports multiple Network Interface Controller (NIC) vendors while utilizing standard Linux interfaces, making it suitable for extensive monitoring in varied server environments. Its scalability ensures efficient operation without overloading resources.

Conclusion

The ONES 3.1 telemetry solution aids teams in optimizing multi-vendor AI and HPC systems by providing a unified real-time view of performance metrics, thus enhancing operational confidence in these complex environments.