Skip to main content

ONES Rule Engine: Enhanced Monitoring and Alerting for AI-Fabric

The ONES Rule Engine has been updated to enhance network management through an integrated alert and notification system. This update allows for detailed monitoring metrics, supporting rule creation at the device and interface levels, which is significant for IT decision-makers managing AI-Fabric environments.

Enhanced Monitoring Capabilities

This update focuses on monitoring various AI-Fabric metrics, including queue counters, Power Factor Correction (PFC), traffic rates, and link and node failures. Administrators can now acquire improved insights into network performance and proactively address any issues affecting RDMA over Converged Ethernet (RoCE) applications.

Anomaly Detection and Alerting Features

With its updated capabilities, the ONES Rule Engine can now track key metrics such as packet transmit and receive rates, dropped packets, and congestion notifications. This assists in identifying potential anomalies and maintaining performance standards within the RoCE framework.

Performance Monitoring Details

The performance counters now include PFC events which help in signaling and managing traffic to avoid packet loss. This monitoring is essential for enhancing overall network reliability and performance.

Alert Notification System

Alert notifications are sent via channels like Slack and Zendesk, as well as through the internal Watcher – Alerts page when specific conditions are met. Alerts provide essential information about the context of the issues, aiding in swift resolution.

Continuous Monitoring for Performance

The ONES Rule Engine is capable of continuously monitoring network links for stability. When an instability is detected, it generates alerts that include comprehensive details about the affected components, enabling efficient corrective actions to be taken.

Conclusion

This update to the ONES Rule Engine emphasizes its role in enhancing network management through improved monitoring and alerting. The ability to detect and alert on various AI-Fabric metrics underscores its utility for enterprise IT leaders. This summary outlines the essentials as detailed in the original blog post.