
Image source: Meta AI Blog - The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation
Llama 4 Architecture and Training
Visualizing the architecture and training techniques behind Meta's latest multimodal AI large language models
Llama 4: The Next Generation
Key innovations in the Llama 4 model architecture and training process
Key Innovations
- 1Mixture of Experts (MoE): More efficient architecture with 17B active parameters out of 400B total
- 2Native Multimodality: Early fusion to integrate text and vision tokens
- 3MetaP Training: New technique for setting critical model hyper-parameters
- 4FP8 Precision: Efficient model training without sacrificing quality
- 5Mid-training: Enhanced capabilities with specialized datasets
- 6Behemoth Model: 2T parameter teacher model that distills knowledge to smaller models
Model Comparison
Llama 370B parameters (dense)
Llama 4 Maverick400B total / 17B active / 128 experts
Llama 4 Scout109B total / 17B active / 16 experts
Llama 4 Behemoth2T total / 288B active / 16 experts
Total parametersActive parameters
Mixture of Experts
Alternating dense and MoE layers with 128 routed experts and a shared expert for inference efficiency.
Learn more →
Multimodality
Early fusion to seamlessly integrate text and vision tokens into a unified model backbone.
Learn more →
Training Process
30+ trillion tokens, FP8 precision, and mid-training for enhanced capabilities.
Learn more →
2T Behemoth
Massive 2T parameter teacher model with 288B active parameters used to train other Llama 4 variants.
Learn more →