CoT training in LLM agents: gains land in prompt actions
CoT training improves models' direct prompt-action predictions rather than widening CoT reasoning gains; masking some action-token.
TL;DR
- 01CoT training improves models' direct prompt-action predictions rather than widening CoT reasoning gains; masking some action-token.
- 02Jingyu Liu and four co-authors submitted a paper to arXiv on 25 Jun 2026 (arXiv:2606.26935) that examines where chain-of-thought training actually helps in language-model agents.
- 03The authors ask whether CoT training makes models better at changing actions via generated reasoning or simply better at predicting actions directly from the prompt.
Jingyu Liu and four co-authors submitted a paper to arXiv on 25 Jun 2026 (arXiv:2606.26935) that examines where chain-of-thought training actually helps in language-model agents. The authors ask whether CoT training makes models better at changing actions via generated reasoning or simply better at predicting actions directly from the prompt.
What did the paper test and find?
The paper directly compares two outputs: "prompt actions" (predicting the next action without chain-of-thought) and "CoT actions" (predicting the action with CoT) and finds that prompt-action quality rises substantially across checkpoints. The authors report that while CoT remains useful, CoT training does not widen CoT reasoning's relative advantage; instead it measurably improves the model's ability to predict actions directly from prompts.
The study frames its experiments around checkpoints of model training and measures action predictions both with and without verbalized reasoning. The key empirical patterns the authors highlight are: prompt-action quality improves substantially over training; the relative edge of CoT actions over prompt actions during environment interaction stays roughly constant; and later checkpoints revise actions in response to CoT less often, indicating increased reliance on prompt-derived signals.
How do CoT and prompt actions compare across checkpoints?
Across checkpoints the authors observe a substantial improvement in prompt-action quality, while the gap between CoT and prompt actions during interaction remains similar. Put simply, training with CoT raises baseline prompt-action performance rather than expanding the marginal benefit of producing CoT when acting.
The paper emphasizes two linked trends. First, models become better at predicting the correct action directly from the prompt as training progresses. Second, although CoT can still change or justify an action, its relative advantage over prompt-only predictions does not grow with CoT training. The authors interpret the decline in action revisions at later checkpoints as evidence the model increasingly trusts the prompt signal and less often updates its choice in response to generated reasoning.
What intervention did the authors test and what happened?
Motivated by those patterns, the authors selectively masked action-token supervision on a fraction of training examples and found this intervention improved out-of-domain generalization. Masking here means withholding direct action-token labels on some examples during training so the model cannot rely solely on supervised action prediction signals.
The paper presents this masking as a targeted change to the supervision signal: by removing action-token guidance in some training instances, the model's reliance on prompt cues and its robustness to new domains increased. The authors frame this as a practical lever that follows from their observation that CoT training raises prompt-action quality rather than enlarging CoT's decision-time advantage.
Why it matters
If CoT training mainly boosts prompt-action prediction, then teams building LLM-based agents should reassess when and why they expect generated reasoning to change behavior. Models that become better at predicting actions from prompts could reduce the marginal value of producing verbose CoT at inference time, with trade-offs for interpretability, latency, and trust.
The masking result points to a simple training intervention that may improve robustness: withholding some direct action supervision helps out-of-domain performance. That suggests supervision design, not just model scale or more CoT examples, can change how models use prompts versus generated reasoning.
What to watch
Look for follow-up work testing masking strategies across model sizes, domains, and interactive tasks, and for papers reporting quantitative measures of how often CoT revisions lead to correct actions. The arXiv submission to track is arXiv:2606.26935 (v1 submitted 25 Jun 2026) by Jingyu Liu, Zhiwen Wang, Yuxin Jing, Huanyu Zhou, and Yong Liu.
| Item | ||||
|---|---|---|---|---|
| Quality across checkpoints | Improves substantially | Improves | Prompt-action quality improves substantially across checkpoints | |
| Relative advantage during interaction | Similar | Similar | CoT training does not widen the advantage of CoT reasoning | |
| Likelihood of revising action at later checkpoints | Less likely to revise | Less likely to revise | Later checkpoints are less likely to revise the action in response to CoT, suggesting greater reliance on the prompt | |
| Effect of masking action-token supervision | N/A (applied to some training examples) | N/A | Selective masking of action-token supervision on a fraction of training examples improves out-of-domain generalization |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsAutoformalization: Agent Instructions to Policy-as-Code
A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.
Agentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A
An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.
Data2Story: CSV-to-article pipeline with seven AI agents
A Claude Code skill runs seven specialist agents to turn a CSV into a verifiable, interactive news article with an Inspector panel.
Vibe Coding: AI evaluation for greenfield software engineering
Callum Barbour's arXiv paper tests 'vibe coding' on isolated Python greenfield tasks using a custom evaluation suite.