This document does not mention Llama Guard 3 (content security classification model) or Prompt Guard (hint injection detection model). Please search for information on these two security auxiliary models yourself.
The entire Llama 3 series is a dense model.
Initially released in April 2024, context length 8K tokens, training data 15T tokens (public data + synthetic data)
Llama-3-8B
Llama-3-70B
Released in July 2024, context length 128K tokens (expanded with RoPE), training data over 30T tokens (including more multilingual data)
Llama-3.1-8B
Llama-3.1-70B
Llama-3.1-405B
Released September 25th & October 2024, with a context length of 128K tokens and multimodal support. Maximum single-image resolution: 1120×1120 pixels.
The plain text model (1B/3B) dataset contains up to 9 trillion tokens (publicly available).
The visual model (11B/90B) uses 6 billion image-text pairs, pre-trained on large-scale noisy data and then refined on medium-scale high-quality data.
The 1B/3B models used knowledge distillation during pre-training (borrowing logit from Llama-3.1-70B).
The 11B/90B models used post-training: synthetic data generated by Llama 3.1 + over 3 million visual instruction examples.
Llama-3.2-1B
Llama-3.2-3B
Llama-3.2-11B-Vision
Llama-3.2-90B-Vision
Released December 6, 2024, with a context length of 128K tokens (using techniques such as RoPE scaling) and training data of 15T tokens (a new mix of public network data).
It is not a completely new model trained from scratch, but rather an advanced post-training optimization based on Llama-3.1-70B. Through more advanced alignment techniques (such as rejection fine-tuning and instruction-following reinforcement), it achieves performance close to Llama 3.1 405B, but with inference costs only about 1/6 of it.
Llama 4 completely shifts to MoE (Mixture-of-Experts, sparse model) + native multimodal (early fusion of images and videos), no longer a dense model.
Scout Derived from Behemoth and Maverick
Official Name: meta-llama/Llama-4-Scout-17B-16E
Parameters: Total parameters ≈ 109B, active parameters 17B, 16 experts
Context Length: 10M tokens (implemented using Ring Attention + Infini-Attention)
Multimodal: Supports image input (unlike Llama 3.2 Vision, it is fused during training)
Official Name: meta-llama/Llama-4-Maverick-17B-128E
Parameters: Total parameters ≈ 400B, active parameters 17B, 128 experts
Context Length: 1M tokens
Multimodal: Image + short video input (maximum 1 minute 720p)
Not yet released, still in training (November 2025)
Parameter count: 288B activations, 16 experts, total parameters ≈ 2T
Serving as a “teacher” model for Scout/Maverick