The paper introduces a routing algorithm that cuts MoE training time by 35%.
Memory overhead drops enough to enable training on smaller clusters.
Code and weights are available under a permissive license.
The paper introduces a routing algorithm that cuts MoE training time by 35%.
Memory overhead drops enough to enable training on smaller clusters.
Code and weights are available under a permissive license.