This was written after my time at EXO Labs, but I wrote the initial version of this for my experimentation with SPARTA, Wash Parallel, and DiLoCo. Experimenting with distributed training algorithms is difficult as it generally requires having access to loads of GPUs. The purpose of this repository is to allow simulating distributed training algorithms when the number of nodes is greater than the number of available GPUs. This allows rapid iteration despite hardware constraints.
Traditional algorithms for training multi-billion parameter models require clusters of GPUs connected via proprietary high-bandwidth networking equipment. Modern low-bandwidth training algorithms such as DiLoCo and SPARTA promise to remove this bandwidth constraint. However, testing them still demands multi-node hardware and complex orchestration. We introduce EXO Gym, an open-source library that emulates up to M virtual workers on N physical accelerators, letting researchers prototype and benchmark distributed-training strategies from a single workstation. Communication behaviour is encapsulated in modular Strategy classes, so new optimizers, sparsity schedules or compression schemes can be expressed in a few lines of code and evaluated with full telemetry (loss, wall-clock, GPU utilization, bytes transferred). In experiments, EXO Gym reproduces published DiLoCo scaling on language models, extends the algorithm to convolutional networks, and enables a rapid sweep over SPARTA sparsity rates that would cost weeks on cloud resources. By collapsing the infrastructure barrier, EXO Gym puts exploratory distributed training within reach of small teams and paves the way for broader, faster progress in open-source AI.
Seth Howes, Matt Beton, Mohamed Baioumy, Alex Cheema, Matthew Reed