Cisco ONES Rule Engine details enhanced monitoring and alerting for AI-Fabric
The latest update to the ONES Rule Engine introduces expanded monitoring capabilities for AI-Fabric metrics, allowing enterprise network managers to gain enhanced visibility into RDMA over Converged Ethernet (RoCE) network performance and potential issues. This development is relevant for IT and security leaders who require detailed alerting and proactive maintenance to support critical workloads effectively.
Expanded Monitoring Metrics
The updated ONES Rule Engine now tracks various AI-Fabric parameters including queue counters, Power Factor Correction (PFC) events, traffic rates, and link or node failures. This extension supports detailed oversight at both device and interface levels, enabling the identification of network congestion, hardware health issues, and traffic performance anomalies.
Queue Counters and Performance Indicators
Key metrics monitored include packet transmit and receive rates on RoCE queues, counts of dropped packets due to overflow or other causes, and packets marked by Explicit Congestion Notification (ECN). Monitoring these counters helps detect bottlenecks or congestion points that can degrade network efficiency. Additionally, PFC events that temporarily pause traffic to prevent loss are tracked to highlight congestion hotspots.
Rule Configuration and Alerting Mechanisms
The ONES 2.1 Rule Engine supports customizable alert rules based on threshold breaches in PFC receive and transmit counters over user-defined time intervals ranging from five minutes to one hour. When these conditions are met, alerts are issued via communication platforms such as Slack and Zendesk, as well as internally through the Watcher Alerts interface. Each alert provides specific information about the affected device, interface, and queue for targeted troubleshooting.
Link Failures and Stability Monitoring
The system continuously monitors links for instability events like flapping, which are particularly critical in RoCE environments. Upon detecting such failures, the rule engine generates detailed alerts including device and physical layer information, facilitating prompt corrective measures such as optical component replacements or traffic rerouting to maintain network stability.
The ONES Rule Engine update offers a comprehensive solution for monitoring critical AI-Fabric network metrics and managing alerts, providing enterprise infrastructure teams with tools to maintain RoCE application performance and address network anomalies proactively. This Blog Signals brief presents a fact-based overview of the vendor's enhancements as described in the source material.