The Mixture-of-Experts (MoE) 101
The Mixture of Experts (MoE) model is a class of transformer models. MoEs, unlike traditional dense models, utilize a “sparse” approach where only a subset of the model’s components (the “experts”) are used for each input. This setup allows for more efficient pretraining and faster inference while managing a larger model size.
In MoEs, each expert is a neural network, typically a feed-forward network (FFN), and a gate network or router determines which tokens are sent to which expert. The experts specialize in different aspects of the input data, enabling the model to handle a wider range of tasks more efficiently.