Title: Reducing communications in decentralized learning via randomization and asynchrony
Decentralized learning is a paradigm where distant nodes collaboratively train a machine learning model without a central node orchestrating computations and communications. While typically applied in internet-connected collaborations of computers, this paradigm also extends to cluster computing. Scaling decentralized training to a large amount of computing nodes requires careful communication management. My approach utilizes randomization and asynchrony to minimize communication overhead. I'll provide a brief introduction to the field and describe Acid, a principled algorithm that significantly reduces communication costs for training Deep Neural Networks on clusters. Notably, Acid achieves remarkable communication cost reductions on ImageNet with 64 asynchronous workers (A100 GPUs), nearly at no additional expense.
https://hal.science/hal-04124318/ Nabli A., Belilovsky E. and Oyallon E. - A2CiD2: Accelerating Asynchronous Communication in Decentralized Deep Learning, NeurIPS 2023.