Llama 4 Models Overview

Llama 4 Architecture and Training

Visualizing the architecture and training techniques behind Meta's latest multimodal AI large language models

Llama 4: The Next Generation
Key innovations in the Llama 4 model architecture and training process

Key Innovations

  • 1
    Mixture of Experts (MoE): More efficient architecture with 17B active parameters out of 400B total
  • 2
    Native Multimodality: Early fusion to integrate text and vision tokens
  • 3
    MetaP Training: New technique for setting critical model hyper-parameters
  • 4
    FP8 Precision: Efficient model training without sacrificing quality
  • 5
    Mid-training: Enhanced capabilities with specialized datasets
  • 6
    Behemoth Model: 2T parameter teacher model that distills knowledge to smaller models

Model Comparison

Llama 370B parameters (dense)
Llama 4 Maverick400B total / 17B active / 128 experts
Llama 4 Scout109B total / 17B active / 16 experts
Llama 4 Behemoth2T total / 288B active / 16 experts
Total parametersActive parameters
Mixture of Experts

Alternating dense and MoE layers with 128 routed experts and a shared expert for inference efficiency.

Learn more →
Multimodality

Early fusion to seamlessly integrate text and vision tokens into a unified model backbone.

Learn more →
Training Process

30+ trillion tokens, FP8 precision, and mid-training for enhanced capabilities.

Learn more →
2T Behemoth

Massive 2T parameter teacher model with 288B active parameters used to train other Llama 4 variants.

Learn more →