Beating the AI bottleneck

Novel communication solution drastically speeds up LLM training

By John Bogna,
Special to Rice News

Artificial intelligence (AI) is infamous for its resource-heavy training, but a new study may have found a solution in a novel communications system, called ZEN, that markedly improves the way large language models (LLMs) train.

researchers — Zhuang Wang and T.S. Eugene Ng

The research team at Rice University was helmed by doctoral graduate Zhuang Wang and computer science professor T.S. Eugene Ng with contributions from two other computer science faculty members: assistant professor Yuke Wang and professor Anshumali Shrivastava. Stevens Institute of Technology’s Zhaozhuo Xu and Jingyi Xi, unaffiliated, also contributed to the project.

Distributed training, sparsity and communication

Wang said there are two phases where LLMs can bottleneck during the distributed training process: computation and communication.

The first occurs when the model needs to crunch through a large amount of data. It can bog down the system, consuming time and computing power. Splitting the data among hundreds, sometimes thousands, of graphics processing units (GPUs) helps manage that problem. They process multiple data samples separately, then feed them back into the model.

The second bottleneck happens when all those GPUs need to sync up so they can “talk” to the model and convey what they’ve learned. They need to efficiently communicate with one another to complete each training run smoothly and can slow down if the model gradients they have to sync are very large, which they often are.

“The previous solution was to send all the data out. But in practice, we observe that the data has a lot of zero values in the ‘talk,’” Wang said. “We need a data structure to represent the communication information correctly.”

Removing those zero or near-zero values and leaving only the relevant ones to be synchronized during communication is called “sparsification.” The values that are left are aptly named “sparse tensors.” It’s common practice in LLM training and can save the system the effort of communicating billions of extra gradients. But it still leaves the communication bottleneck, which is where the team focused its research.

“There’s actually not a lot of fundamental understanding of how to support these sparse tensors inside of distributed training,” Ng said. “People propose the idea, but they don’t understand what the optimal way of handling them is. One of the contributions of our work is to analyze these sparse tensors to understand how they behave.”

Mapping the system, finding the structure

There were essentially three parts to this research: Part one was figuring out the characteristics of sparse tensors in popular models. The nonzero gradients left after sparsification aren’t uniformly distributed; their location and tensor density depend on factors like the training model and dataset used.

That scattering of nonzero gradients leads to an imbalance during the communication phase that slows down synchronization and, by extension, slows down the training process. This new understanding shed light on how to design better communication schemes to use with sparse tensors.

Once they knew how to approach their design, part two was figuring out the optimal communication schemes to use. Wang and Ng analyzed several options to figure out what those were.

Because there was no optimal solution before this research, the third and final step was building a real-world system based on their research and applying that system to practical LLM training to see if it worked. ZEN was that system, and it displayed a stark difference in training speed when used on real-world LLMs.

“What we basically show is that we can accelerate the time to completion of the training because the communication is more efficient. … The time it takes to perform one step in the training is much faster,” Ng said.

Since sparse tensors are used often and the field of LLM training is so broad, this discovery can be applied to just about any model with, as Ng phrased it, “the characteristics of sparsity.” Be it text or image generation, ZEN can speed up model training if sparse tensors are present.

Wang isn’t new to this area of research. He and Ng previously collaborated on a project to minimize the failure recovery overhead of LLMs after a hardware or software failure during training, which they named GEMINI — unveiled at the ACM Symposium on Operating Systems Principles in 2023.

Wang recently presented his paper on this newer research, entitled “ZEN: Empowering Distributed Training with Sparsity-driven Data Synchronization,” at the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI) held in Boston.

6100 Main St., Houston, TX 77005-1827 |

Mailing Address: P.O. Box 1892, Houston, TX 77251-1892 |

713-348-0000 | Privacy Policy | Campus Carry