MoE (Mixture of Experts)

← Back to Glossary

A model architecture where the total parameter count is large, but only a subset of “expert” sub-networks are activated for any given token. This means memory bandwidth requirements are lower relative to total parameter count. Examples: Llama 4 Scout (17B active across 16 experts), Qwen3 235B-A22B (235B total, ~22B active per token).

Most frontier models today are MoE — it’s how labs scale to very large parameter counts without proportionally scaling inference cost. The efficiency comes from expert specialization: through training, different experts naturally converge on handling different types of inputs (e.g. code, reasoning, language), and model developers actively encourage this through how they structure training. The result is that for any given token, only a small fraction of the network needs to activate, making a 200B+ MoE model practical to run at a cost closer to a much smaller dense model.

Running MoE models locally is possible but remains hardware-intensive — the full parameter count still needs to fit in VRAM even though only a fraction is active per token. This makes large MoE models impractical on most consumer hardware today.

Mixture of Experts - Hugging Face

Referenced in

Memory bandwidth

Non-Linearity

Codex

Matt Oswalt

Title here

MoE (Mixture of Experts)

Referenced in