Multimodal AIMarch 9, 20264 min readvia Hugging Face

Ulysses Sequence Parallelism: Million-Token Training

Hugging Face unveils Ulysses SP to train transformers on million-token contexts by sharding the sequence dimension across devices.

The Brieftide

March 9, 2026

TL;DR

01Hugging Face unveils Ulysses SP to train transformers on million-token contexts by sharding the sequence dimension across devices.
02Hugging Face released Ulysses Sequence Parallelism, a training technique that enables transformers to be trained with million-token contexts by sharding the sequence dimension across accelerators.
03The release includes a reference implementation and examples intended to let labs and practitioners scale long-context training without redesigning model architectures.

Hugging Face released Ulysses Sequence Parallelism, a training technique that enables transformers to be trained with million-token contexts by sharding the sequence dimension across accelerators. The release includes a reference implementation and examples intended to let labs and practitioners scale long-context training without redesigning model architectures.

How it works

Ulysses Sequence Parallelism moves the unit of model parallelism from batch or parameter slices to the sequence axis. Instead of storing the full sequence of activations on every device, the method partitions the token sequence into contiguous shards and places each shard on a separate accelerator. Transformer weights remain replicated or sharded using conventional parameter-parallel tools, while activations and attention computation operate on the local sequence shard.

During the forward pass, each device computes attention and feedforward operations for its local shard. When attention demands access to tokens that live on other devices, the implementation performs targeted communication steps to exchange the minimal required data. Backpropagation mirrors the forward pattern, with gradients aggregated across devices using standard all-reduce operations. The result is a reduction in peak activation memory per device roughly proportional to the number of sequence shards, trading memory for extra communication.

The reference code integrates with common training stacks and is designed to interoperate with memory-efficient attention kernels and existing model-parallel primitives. Users can combine Ulysses SP with gradient accumulation and mixed precision to further lower memory pressure while keeping the model architecture unchanged.

Benchmarks and limitations

Initial examples show the technique enabling training runs with context lengths in the order of 10^6 tokens that would otherwise exceed per-GPU memory. Peak memory usage per device falls as the sequence axis is split, and the method preserves numerical gradients and standard optimizer behaviour.

There are trade-offs. Network communication increases because shards must exchange token windows for attention and cross-shard layers, which raises training latency and can require faster interconnects to be practical at very large scale. The technique also shifts some engineering complexity onto the training loop: careful partitioning, attention windowing strategies, and attention-kernel support are needed to keep the communication volume manageable.

Other constraints include optimizer state and embedding memory, which remain tied to model parameters and may still limit the minimum hardware footprint. Positional encoding strategies may need adaptation when contexts stretch to millions of tokens, and dataset preparation changes because training examples must be assembled into very long sequences or concatenated streams.

Hugging Face published the implementation and examples to illustrate integration with transformer checkpoints and common acceleration libraries. The code provides recipes for trade-offs between the number of sequence shards, communication pattern, and achievable context length so teams can tune for their hardware.

Why it matters

Ulysses Sequence Parallelism makes training with million-token contexts practically attainable for groups that can provision multi-node accelerator clusters and fast interconnects, reducing the need for wholesale model redesign. That matters for research directions that depend on very long-range context, like long-form document modeling, multi-document retrieval pipelines, and some multimodal tasks where extended temporal context is helpful.

Ulysses SP data and compute layout

Primary source

Hugging Face

huggingface.co

Read the original

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeNo adsNo trackingUnsubscribe in one click