Apstra works by retaining a real-time repository of configuration, telemetry and validation data to make sure a community is doing what the group desires it to do. Corporations can use Apstra’s automation capabilities to ship constant community and safety insurance policies for workloads throughout bodily and digital infrastructures. As well as, Apstra performs common community checks to safeguard configurations. It’s {hardware} agnostic, so it may be built-in to work with Juniper’s networking merchandise in addition to bins from Cisco, Arista, Dell, Microsoft and Nvidia.
Load balancing and visibility enhancements
On the load balancing entrance, Juniper has added assist for dynamic load balancing (DLB) that selects the optimum community path and delivers decrease latency, higher community utilization, and sooner job completion instances. From the AI workload perspective, this ends in higher AI workload efficiency and better utilization of pricy GPUs, based on Sanyal.
“In comparison with conventional static load balancing, DLB considerably enhances material bandwidth utilization. However one among DLB’s limitations is that it solely tracks the standard of native hyperlinks as a substitute of understanding the entire path high quality from ingress to egress node,” Sanyal wrote. “Let’s say we’ve CLOS topology and server 1 and server 2 are each making an attempt to ship information referred to as flow-1 and flow-2, respectively. Within the case of DLB, leaf-1 solely is aware of the native hyperlinks utilization and makes choices based mostly solely on the native swap high quality desk the place native hyperlinks could also be in good state. However in the event you use GLB, you possibly can perceive the entire path high quality the place congestion points are current inside the spine-leaf stage.”
By way of visibility, Sanyal identified limitations in current community efficiency administration applied sciences:
“At the moment, admins can discover out the place congestion happens by observing solely the community switches. However they don’t have any visibility into which endpoints (GPUs, within the case of AI information facilities) are impacted by the congestion. This results in challenges in figuring out and resolving efficiency points. In a multi-training job atmosphere, simply by taking a look at swap telemetry, it’s unattainable to search out which coaching jobs have been slowed down resulting from congestion with out manually checking the NIC RoCE v2 stats on all of the servers, which isn’t sensible,” Sanyal wrote.
Juniper is addressing the problem by integrating RoCE v2 streaming telemetry from the AI Server SmartNICs with Juniper Apstra and correlating current community swap telemetry; that integration and correlation “drastically enhances the observability and debugging workflows when efficiency points happen,” Sanyal wrote. “This correlation permits for a extra holistic community view and a greater understanding of the relationships between AI servers and community behaviors. The actual-time information offers insights into community efficiency, site visitors patterns, potential congestion factors, and impacted endpoints, serving to determine efficiency bottlenecks and anomalies.”
Leave a Reply