Coding Agents4 min read

CoT training in LLM agents: gains land in prompt actions

CoT training improves models' direct prompt-action predictions rather than widening CoT reasoning gains; masking some action-token.

The Brieftide

TL;DR

  • 01CoT training improves models' direct prompt-action predictions rather than widening CoT reasoning gains; masking some action-token.
  • 02Jingyu Liu and four co-authors submitted a paper to arXiv on 25 Jun 2026 (arXiv:2606.26935) that examines where chain-of-thought training actually helps in language-model agents.
  • 03The authors ask whether CoT training makes models better at changing actions via generated reasoning or simply better at predicting actions directly from the prompt.

Jingyu Liu and four co-authors submitted a paper to arXiv on 25 Jun 2026 (arXiv:2606.26935) that examines where chain-of-thought training actually helps in language-model agents. The authors ask whether CoT training makes models better at changing actions via generated reasoning or simply better at predicting actions directly from the prompt.

What did the paper test and find?

The paper directly compares two outputs: "prompt actions" (predicting the next action without chain-of-thought) and "CoT actions" (predicting the action with CoT) and finds that prompt-action quality rises substantially across checkpoints. The authors report that while CoT remains useful, CoT training does not widen CoT reasoning's relative advantage; instead it measurably improves the model's ability to predict actions directly from prompts.

The study frames its experiments around checkpoints of model training and measures action predictions both with and without verbalized reasoning. The key empirical patterns the authors highlight are: prompt-action quality improves substantially over training; the relative edge of CoT actions over prompt actions during environment interaction stays roughly constant; and later checkpoints revise actions in response to CoT less often, indicating increased reliance on prompt-derived signals.

How do CoT and prompt actions compare across checkpoints?

Across checkpoints the authors observe a substantial improvement in prompt-action quality, while the gap between CoT and prompt actions during interaction remains similar. Put simply, training with CoT raises baseline prompt-action performance rather than expanding the marginal benefit of producing CoT when acting.

The paper emphasizes two linked trends. First, models become better at predicting the correct action directly from the prompt as training progresses. Second, although CoT can still change or justify an action, its relative advantage over prompt-only predictions does not grow with CoT training. The authors interpret the decline in action revisions at later checkpoints as evidence the model increasingly trusts the prompt signal and less often updates its choice in response to generated reasoning.

What intervention did the authors test and what happened?

Motivated by those patterns, the authors selectively masked action-token supervision on a fraction of training examples and found this intervention improved out-of-domain generalization. Masking here means withholding direct action-token labels on some examples during training so the model cannot rely solely on supervised action prediction signals.

The paper presents this masking as a targeted change to the supervision signal: by removing action-token guidance in some training instances, the model's reliance on prompt cues and its robustness to new domains increased. The authors frame this as a practical lever that follows from their observation that CoT training raises prompt-action quality rather than enlarging CoT's decision-time advantage.

Why it matters

If CoT training mainly boosts prompt-action prediction, then teams building LLM-based agents should reassess when and why they expect generated reasoning to change behavior. Models that become better at predicting actions from prompts could reduce the marginal value of producing verbose CoT at inference time, with trade-offs for interpretability, latency, and trust.

The masking result points to a simple training intervention that may improve robustness: withholding some direct action supervision helps out-of-domain performance. That suggests supervision design, not just model scale or more CoT examples, can change how models use prompts versus generated reasoning.

What to watch

Look for follow-up work testing masking strategies across model sizes, domains, and interactive tasks, and for papers reporting quantitative measures of how often CoT revisions lead to correct actions. The arXiv submission to track is arXiv:2606.26935 (v1 submitted 25 Jun 2026) by Jingyu Liu, Zhiwen Wang, Yuxin Jing, Huanyu Zhou, and Yong Liu.

Comparison of prompt actions and CoT actions (authors' findings)
Item
Quality across checkpointsImproves substantiallyImprovesPrompt-action quality improves substantially across checkpoints
Relative advantage during interactionSimilarSimilarCoT training does not widen the advantage of CoT reasoning
Likelihood of revising action at later checkpointsLess likely to reviseLess likely to reviseLater checkpoints are less likely to revise the action in response to CoT, suggesting greater reliance on the prompt
Effect of masking action-token supervisionN/A (applied to some training examples)N/ASelective masking of action-token supervision on a fraction of training examples improves out-of-domain generalization
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement