AI Safety5 min read

Transitive RL: divide-and-conquer value learning for off-policy RL

Transitive RL (TRL) uses a divide-and-conquer value update to scale off-policy.

The Brieftide

TL;DR

  • 01Transitive RL (TRL) uses a divide-and-conquer value update to scale off-policy.
  • 02Transitive RL (TRL) is an off-policy reinforcement learning algorithm that applies a divide-and-conquer value-learning paradigm to scale to long-horizon, goal-conditioned tasks.
  • 03Off-policy RL allows using arbitrary data, including old experience and demonstrations, and is essential where data collection is expensive.

Transitive RL (TRL) is an off-policy reinforcement learning algorithm that applies a divide-and-conquer value-learning paradigm to scale to long-horizon, goal-conditioned tasks. The method was developed by researchers at Berkeley AI Research and co-led with Aditya, and it was evaluated on the hardest humanoidmaze and puzzle tasks in OGBench using 1B-sized datasets and tasks that require skills across up to 3,000 environment steps.

The scaling problem TRL targets

Off-policy RL allows using arbitrary data, including old experience and demonstrations, and is essential where data collection is expensive. The standard off-policy approach trains value functions with temporal difference (TD) learning, using the Bellman update Q(s,a) <- r + gamma max_{a'} Q(s',a'). The source identifies a core scaling failure mode for TD learning: error in bootstrapped next values propagates and accumulates over long horizons. A pragmatic mitigation is n-step TD, which mixes Monte Carlo returns for the first n steps with bootstrapping thereafter, but this only reduces Bellman recursions linearly by n and introduces variance and a hyperparameter n that must be tuned per task.

TRL proposes a different route: a third paradigm in value learning, divide and conquer, that can reduce the number of Bellman recursions logarithmically rather than linearly. The approach is especially natural for goal-conditioned RL, which aims to learn a policy that reaches any state from any other state and therefore admits a shortest-path structure.

How TRL works in practice

Under deterministic dynamics, the temporal distance d*(s,g) satisfies a triangle inequality: d*(s,g) <= d*(s,w) + d*(w,g). TRL translates that idea into a transitive Bellman update for the sparse-reward value V(s,g):

  • V(s,g) = gamma^0 if s = g,
  • V(s,g) = gamma^1 if (s,g) is an environment transition edge,
  • otherwise V(s,g) <- max_{w in S} V(s,w) V(w,g).

The practical challenge is how to choose the midpoint w in large or continuous state spaces. TRL restricts the search for w to states that appear in the dataset and lie between s and g on the same trajectory, avoiding an exhaustive search over the full state space. Instead of a hard argmax over w, TRL uses a soft argmax via expectile regression: it minimizes an expectile loss (denoted ℓ^2_kappa in the source) on quantities of the form V(s_i,s_j) - \bar{V}(s_i,s_k) \bar{V}(s_k,s_j), where \bar{V} is a target value network and the expectation is taken over (s_i,s_k,s_j) tuples with i <= k <= j sampled from dataset trajectories. Restricting candidate midpoints to dataset states and using expectile regression both reduces search cost and mitigates overestimation resulting from a plain max operator. The authors call this algorithm Transitive RL, or TRL.

Evaluation and key results

TRL was evaluated on challenging offline, goal-conditioned benchmarks in OGBench. The evaluations focused on the hardest versions of humanoidmaze and puzzle tasks with 1B-sized datasets and horizons up to 3,000 environment steps, tasks that require combinatorially complex skills across long horizons. Against many strong baselines drawn from TD, Monte Carlo, quasimetric learning and related categories, TRL "achieves the best performance on most tasks," according to the source.

A highlighted result is that TRL matches the best individually tuned TD-n across tasks without needing to set the hyperparameter n. The source presents this as a key advantage: by recursively splitting trajectories, TRL handles long horizons without arbitrarily choosing the length of trajectory chunks.

The work also contains further experiments, analyses and ablations in the paper, and acknowledges feedback from Kevin and Sergey.

Why it matters

TRL signals a concrete, practical pathway for off-policy value learning that avoids the primary scaling failure of TD bootstrapping. By reducing Bellman recursions logarithmically via recursive splitting and by replacing a hard argmax with expectile regression over dataset midpoints, TRL can handle very long horizons without the variance and hyperparameter tuning of large n-step TD. If the approach generalizes beyond goal-conditioned settings, it would address one of the central outstanding challenges in scalable off-policy RL.

What to watch

Whether TRL can be extended from deterministic, goal-conditioned settings to regular reward-based RL and to stochastic environments is the next major test flagged by the authors. Concrete signals to watch are follow-up results that apply TRL to reward-based problems, demonstrations handling stochastic dynamics, and proposed methods for choosing subgoal candidates beyond same-trajectory states.

Advertisement

Written by The Brieftide · Source: Berkeley AI Research

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

More in AI Safety
Advertisement