Boosting AI Model Size and Training Speed with Lightwave-Connected Chips

AI growth is capped by data transfer rates between computing chips, but transferring data with light could remove the ceiling.

A new chip-connection system could help topple the “memory wall,” which limits computing speed and the growth of AI models today, by transferring data along reconfigurable pathways of light rather than electrical wires.

The technology will be developed by a U-M led project funded by a $2M grant from the National Science Foundation’s Future of Semiconductors program. The project also includes researchers from the University of Washington, University of Pennsylvania, Lawrence Berkeley National Laboratory, and input and guidance from four industry partners: Google, Hewlett Packard Enterprise, Microsoft and Nvidia.

While data processing speed is now 60,000 times higher today than it was 20 years ago, the speed of data transfer between computer memory and processors is only 30 times faster. Because of the lag, the rate of data transfer has become a bottleneck for the size of AI models, which have been growing 400 times larger every two years since 1998. Faster communication is essential for addressing these limits on AI performance.

“Our proposed technology could enable high-performance computing to keep up with the massive amount of data that’s being fed to rapidly growing AI models,” said Di Liang, U-M professor of electrical and computer engineering and lead principal investigator of the project. “With optical connections between chips, we think we can transfer tens of terabits per second, which is more than 100 times faster than state-of-the-art electric connections.”

Today, data moves between multiple memory and processor chips via metal connections soldered onto a single physical package called an interposer, which is similar to a motherboard. Data can be transferred within a single interposer or across interposers on interconnected servers called computing nodes.

The metal connections are hardwired into the interposer, which limits data transfer bandwidth and signal integrity because faster electrical signals lose energy as heat and can electromagnetically interfere with neighboring connections. As a result, hardwiring connections to all of the different processors and memory chips isn’t tractable. A single supercomputer chip today can contain over 900,000 cores, or individual processing units, and that number will continue to grow with AI model size.

“All of those processors will need to talk to a large amount of memory,” said Mo Li, professor of electrical and computer engineering at the University of Washington, and a co-principal investigator of the project. “Controlling the communication within the whole package is very important. In my view, optical connections will be the only tractable solution in the future.”

Light can travel farther than electrons and transfer a much larger amount of data with far less energy loss, and the researchers will tap into these properties in their new interposer design. Pulses of light will travel between chips through refractive channels in their interposer called optical waveguides. A receiver on each chip translates the data back into an electrical signal for the computer to interpret.

The waveguide network can also be reconfigured—during manufacturing as well as inside a computer—thanks to a special phase-changing material in the interposer. When hit with a laser or exposed to a voltage, the material’s refractive index changes, meaning the light will be bent in different directions as it passes through the waveguide.

“It’s a bit like opening and closing roads,” said Liang Feng, professor of materials science and electrical and systems engineering at the University of Pennsylvania and a co-principal investigator. “If a company sells a chip based on this technology, they will be able to rewrite the connections on different batches of chips and servers without changing the layout of the other components.”

The researchers will design a traffic-controlling software that monitors which parts of the interposer need to communicate at any given time, and make the necessary voltage switch to create ideal connections on the fly.

“Changing the connections allows us to reconfigure the network based on what AI models we want to run, or whether we want to train or run a model,” said Reetuparna Das, associate professor of computer science and engineering and a co-investigator of the project.

Beyond advancing technology, the project will also connect U-M students with industry partners and provide valuable real-world experience.

“These connections allow students to appreciate real-world challenges in designing rapidly evolving technology,” Liang said. “Textbooks don’t address these modern problems sufficiently because the rate of development makes it impossible for textbooks to keep up. The best way to gain relevant skills is by working with industry on the problems they care about.”

Exit mobile version