Semi-CoT: Semi-supervised Chain-of-Thought Learning Study
Semi-CoT reuses unlabeled questions to create pseudo-CoTs; an entropy gate picks low-entropy chains.
TL;DR
- 01Semi-CoT reuses unlabeled questions to create pseudo-CoTs; an entropy gate picks low-entropy chains.
- 02Semi-CoT, a semi-supervised Chain-of-Thought learning framework by Hongyang He, Jiuming Liu and Victor Sanchez, was submitted on 1 Jul 2026.
- 03The framework treats chain-of-thought traces not merely as inference-time prompts but as semi-supervised signals, extending a self-training view of CoT into pseudo-supervision.
Semi-CoT, a semi-supervised Chain-of-Thought learning framework by Hongyang He, Jiuming Liu and Victor Sanchez, was submitted on 1 Jul 2026. The method samples multiple pseudo reasoning chains for each unlabeled question, estimates answer-level semantic entropy, and selects low-entropy chains as pseudo-CoT demonstrations for training students.
How does Semi-CoT work?
Semi-CoT constructs pseudo reasoning supervision from unlabeled questions by sampling multiple pseudo-CoTs, computing an answer-level semantic entropy, and keeping low-entropy chains as reliable demonstrations. The framework treats chain-of-thought traces not merely as inference-time prompts but as semi-supervised signals, extending a self-training view of CoT into pseudo-supervision.
The pipeline is simple: for each unlabeled question Semi-CoT generates multiple candidate reasoning chains, measures the semantic entropy at the answer level to estimate consensus, and selects those chains that fall below an entropy gate as pseudo-CoTs. Those selected chains are then reused as demonstrations for student models.
How well does Semi-CoT perform on benchmarks?
Pilot experiments on four benchmarks produced mixed results: pseudo-answer precision ranged from 91.36% to 100%, SVAMP and GSM8K saw small gains, AQuA experienced negative transfer, and MultiArith hit a ceiling. The authors report the pseudo-answer precision range explicitly as 91.36% to 100% across their experiments.
The paper lists AQuA, SVAMP, GSM8K and MultiArith as the evaluation suites. On SVAMP and GSM8K Semi-CoT yielded modest improvements, suggesting some benefit from the added pseudo-supervision. By contrast, AQuA showed negative transfer, meaning performance declined when using the selected pseudo-CoTs, and MultiArith reached a ceiling where Semi-CoT did not improve results further.
Why it matters
Semi-CoT demonstrates that unlabeled questions can supply high-precision pseudo reasoning signals, with selected pseudo-CoTs achieving precision between 91.36% and 100%. That matters because chain-of-thought is typically used only as an inference-time prompt; reusing generated chains as training supervision could reduce the need for expensive human-annotated reasoning traces.
At the same time the mixed benchmark outcomes underline limits: reliable selection and effective student training remain necessary. The authors note that while the entropy gate finds high-precision pseudo-CoTs, translating those signals into consistent across-the-board gains requires stronger demonstration selection or improvements in how students are trained on pseudo-supervision.
What to watch
Follow-up work that delivers stronger demonstration selection methods or revised student training regimes. The paper flags those two levers explicitly as the paths needed to make unlabeled-question pseudo-supervision broadly effective.
The authors and technical report provide a concise proof of concept: unlabeled questions can be a source of pseudo-CoTs, but converting high pseudo-answer precision into consistent task gains is the next technical milestone to check.
| Item | |||
|---|---|---|---|
| AQuA | negative transfer | 91.36%–100% (overall range reported) | |
| SVAMP | small gains | 91.36%–100% (overall range reported) | |
| GSM8K | small gains | 91.36%–100% (overall range reported) | |
| MultiArith | reached a ceiling | 91.36%–100% (overall range reported) |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Reasoning VerificationRetrieval-Grounded Formal Concept Analysis: Verifiable Knowledge
Yujin Yang and Heejung Lee present a retrieval-augmented SLM using formal concept analysis and oracle checks.
Data-driven ML and GPT-5: arXiv finds limits for symbolic logic
An arXiv paper by Tiansi Dong, Mateja Jamnik and Pietro Liò argues supervised deep learning cannot reach symbolic-level syllogistic.
Governing Actions, Not Agents: Institutional Attestation Model
Jakob Salfeld-Nebgen formalises a governance model where agents plan but execution of high-risk acts requires independent.
Verification Horizon: No Silver Bullet for Coding Agent Rewards
An arXiv paper argues verification, not generation, is the harder problem for coding agents and that verification must co-evolve with.