The paper introduces a routing algorithm that cuts MoE training time by 35%.

Memory overhead drops enough to enable training on smaller clusters.

Code and weights are available under a permissive license.