This NSF-funded project will develop new methods to connect multiple chips within computers using light instead of electrical wires. Using light to transfer data between chips can make data transfer faster and more energy efficient, which is crucial for working with large and complex data needed for societal applications like artificial intelligence, climate modeling, and biomedical research. The project will closely engage with industry partners to facilitate adoption of the proposed research into practice. The close collaboration with industry will help train a new generation of scientists and engineers with interdisciplinary expertise. The skills and insights gained through this project will prepare them to tackle future challenges that lie at the intersection of multiple scientific fields, aligning with the NSF’s mission to advance the frontiers of knowledge and innovation.

The project proposes to optically interconnect accelerators within compute servers using newly viable reconfigurable chip-to-chip optical interconnects. In contrast, today, commercial multi-accelerator compute servers that are workhorses of machine learning, use electrical interconnects to network accelerator chips in the server. However, recent trends show the prominence of an interconnect bandwidth wall caused by accelerator scaling at a magnitude faster rate than the bandwidth of the interconnect between accelerators in the same server. This has led to under-utilization and idling of Graphical Processing Units (GPUs) resources in cloud datacenters. Therefore, it is important to scale interconnect bandwidth in multi-accelerator servers to keep power-hungry and expensive accelerators adequately fed with data and parameters. This project will use novel silicon photonics to create optical interconnections between accelerators within a server to meet this need. This research will benefit the complementary efforts of hyper-scale cloud providers by unlocking customized multi-accelerator topologies that achieve bandwidth-optimal collective communication between accelerators during distributed machine learning and can minimize the blast radius of accelerator failures.