Couldn’t attend Transform 2022? Check out all the top sessions in our on-demand library now! Look here.
Artificial intelligence (AI) and machine learning (ML) are about more than algorithms: the right hardware to boost your AI and ML calculations is essential.
To accelerate task completion, AI and ML training clusters require high bandwidth and reliable transport with predictable low-tail latency (tail latency is the 1 or 2% of a task that follows the rest of the responses). A high-performance interconnection can optimize data center and high-performance computing (HPC) workloads across your portfolio of hyper-converged AI and ML training clusters, resulting in lower latency for better model training, increased data packet usage, and lower operational costs.
As AI and ML training tasks become more common, it is critical to have higher radix switches, which reduce latency and power, and faster port speeds for building larger training clusters with flat network topology.
Ethernet Switching for Performance Optimization
While network bandwidth requirements in data centers continue to rise dramatically, there is also a strong push to combine common compute and storage infrastructure with optimized AI and ML training processors. As a result, AI and ML training clusters — where you specify multiple machines for training — drive demand for fabrics with high bandwidth connectivity, high radix, and faster task completion while operating with high network usage.
Event
MetaBeat 2022
MetaBeat will bring together thought leaders to offer advice on how metaverse technology will change the way all industries communicate and do business October 4 in San Francisco, CA.
Register here
To speed up task completion, it is critical to have effective load balancing to achieve high network utilization, as well as congestion control mechanisms to achieve predictable tail latency. Virtualized and efficient data infrastructures, combined with capable hardware, can also improve CPU offloads and help network accelerators improve neural network training.
Ethernet-based infrastructures currently offer the best solution for a unified network. They combine low power consumption with high bandwidth and radix, and the fastest serializer and deserializer (SerDes) speeds, predictably doubling bandwidth every 18 to 24 months. With these benefits, as well as the large ecosystem, Ethernet can provide the highest-performing interconnect per watt and dollar for AI and ML and cloud-scale infrastructure.
According to IDCThe global Ethernet switch market grew 12.7% year-over-year to $7.6 billion in the first quarter of 2022 (1Q22). broadcom offers the Tomahawk family of Ethernet switches to enable the next generation of unified networking.
Today, San Jose-based Broadcom announced the StrataXGS Tomahawk 5 switch series, which offers 51.2 Tbps Ethernet switching capability in a single, monolithic device — more than double the bandwidth of its contemporaries, the company claims.
“Tomahawk 5 has twice the capacity of Tomahawk 4. As a result, it is one of the world’s fastest switching chips,” said Ram Velaga, senior vice president and general manager of Broadcom’s core switching group. “The newly added specific features and capabilities to optimize performance for AI and ML networks make [the] Tomahawk 5 is twice as fast as the previous version.”
The Tomahawk 5 switch chips are designed to support data centers and HPC environments, to accelerate AI and ML capabilities. The switch chip uses a Broadcom approach known as cognitive routing, which is an advanced shared packet buffering, programmable in-band telemetry, with hardware-based link failover built into the chip.
Cognitive routing optimizes network link usage by automatically selecting the system’s least heavily loaded links for each stream passing through the switch. This is especially important for AI and ML workloads, which often combine high bandwidth short and long streams with low entropy.
“Cognitive routing goes a step further than adaptive routing,” Velaga said. “When you use adaptive routing, you are only aware of data congestion between two points, but you are not aware of the other ends.”
Cognitive routing, he added, can make the system aware of conditions except the next neighbor, rerouting for an optimal path that provides better load balance and avoids congestion.
Tomahawk 5 includes real-time dynamic load balancing, which monitors the use of all links at the switch and downstream in the network to determine the best path for each flow. It also monitors the health of hardware links and automatically diverts traffic away from failed connections. These features improve network utilization and reduce congestion, resulting in shorter job turnaround times.
The Future of Ethernet for AI and ML Infrastructures
Ethernet has the features needed for high-performance AI and ML training clusters: high bandwidth, end-to-end congestion management, load balancing, and fabric management at a lower cost than its contemporaries such as InfiniBand.
It is clear that Ethernet is a robust ecosystem that is constantly evolving at a rapid pace of innovation. “Ethernet is relentless and I would expect it to continue to affect areas like AI/ML,” Craig Matsumoto, senior research analyst at 451 Research, told VentureBeat. “The reward is homogeneity – if I can run any workload on Ethernet, assuming the performance is good enough, I can have one homogeneous network that all workloads can share. It’s simpler and it gives me more redundant paths for forwarding of traffic.”
Broadcom has shown that it will continue to improve its Ethernet switches to keep up with the pace of innovation in the AI and ML industries, and continue to be part of the HPC infrastructure going forward.
The mission of VentureBeat is a digital city square for tech decision makers to gain knowledge about transformative business technology and transactions. Learn more about membership.