Llama 4 Architecture and Training

Visualizing the architecture and training techniques behind Meta's latest multimodal AI large language models

Llama 4: The Next Generation

Key innovations in the Llama 4 model architecture and training process

1
Mixture of Experts (MoE): More efficient architecture with 17B active parameters out of 400B total
2
Native Multimodality: Early fusion to integrate text and vision tokens
3
MetaP Training: New technique for setting critical model hyper-parameters
4
FP8 Precision: Efficient model training without sacrificing quality
5
Mid-training: Enhanced capabilities with specialized datasets
6
Behemoth Model: 2T parameter teacher model that distills knowledge to smaller models

Llama 370B parameters (dense)

Llama 4 Maverick400B total / 17B active / 128 experts

Llama 4 Scout109B total / 17B active / 16 experts

Llama 4 Behemoth2T total / 288B active / 16 experts

Total parametersActive parameters

Mixture of Experts

Alternating dense and MoE layers with 128 routed experts and a shared expert for inference efficiency.

Learn more →

Multimodality

Early fusion to seamlessly integrate text and vision tokens into a unified model backbone.

Learn more →

Training Process

30+ trillion tokens, FP8 precision, and mid-training for enhanced capabilities.

Learn more →

2T Behemoth

Massive 2T parameter teacher model with 288B active parameters used to train other Llama 4 variants.

Learn more →