Llama Models

This document does not mention Llama Guard 3 (content security classification model) or Prompt Guard (hint injection detection model). Please search for information on these two security auxiliary models yourself.

Llama 3 INFO

The entire Llama 3 series is a dense model.

Llama 3

Initially released in April 2024, context length 8K tokens, training data 15T tokens (public data + synthetic data)

Llama 3.1

Released in July 2024, context length 128K tokens (expanded with RoPE), training data over 30T tokens (including more multilingual data)

Llama 3.2

Released September 25th & October 2024, with a context length of 128K tokens and multimodal support. Maximum single-image resolution: 1120×1120 pixels.

The plain text model (1B/3B) dataset contains up to 9 trillion tokens (publicly available).

The visual model (11B/90B) uses 6 billion image-text pairs, pre-trained on large-scale noisy data and then refined on medium-scale high-quality data.

The 1B/3B models used knowledge distillation during pre-training (borrowing logit from Llama-3.1-70B).

The 11B/90B models used post-training: synthetic data generated by Llama 3.1 + over 3 million visual instruction examples.

Llama 3.3

Released December 6, 2024, with a context length of 128K tokens (using techniques such as RoPE scaling) and training data of 15T tokens (a new mix of public network data).

It is not a completely new model trained from scratch, but rather an advanced post-training optimization based on Llama-3.1-70B. Through more advanced alignment techniques (such as rejection fine-tuning and instruction-following reinforcement), it achieves performance close to Llama 3.1 405B, but with inference costs only about 1/6 of it.

Llama4 INFO

Llama 4 completely shifts to MoE (Mixture-of-Experts, sparse model) + native multimodal (early fusion of images and videos), no longer a dense model.

Scout Derived from Behemoth and Maverick

Llama 4 Scout

Official Name: meta-llama/Llama-4-Scout-17B-16E

Parameters: Total parameters ≈ 109B, active parameters 17B, 16 experts

Context Length: 10M tokens (implemented using Ring Attention + Infini-Attention)

Multimodal: Supports image input (unlike Llama 3.2 Vision, it is fused during training)

Llama 4 Maverick

Official Name: meta-llama/Llama-4-Maverick-17B-128E

Parameters: Total parameters ≈ 400B, active parameters 17B, 128 experts

Context Length: 1M tokens

Multimodal: Image + short video input (maximum 1 minute 720p)

Llama 4 Behemoth

Not yet released, still in training (November 2025)

Parameter count: 288B activations, 16 experts, total parameters ≈ 2T

Serving as a “teacher” model for Scout/Maverick