Model Compression4 min read

LLM Calibration with C3RL and CAS: Adaptive Test-Time Scaling

C3RL trains LLMs to express well-calibrated confidence; CAS uses that verbalized confidence to cut inference budget by up to 12.33×.

The Brieftide

TL;DR

  • 01C3RL trains LLMs to express well-calibrated confidence; CAS uses that verbalized confidence to cut inference budget by up to 12.33×.
  • 02C3RL, a reinforcement learning algorithm introduced by Xuqing Yang, Yi Yuan, Shanzhe Lei and Xuhong Wang, trains large language models to pair correctness with calibrated confidence.
  • 03The paper, submitted to arXiv on 2 July 2026, shows C3RL improves calibration without sacrificing accuracy and powers CAS, an inference strategy that reduces inference budget by up to 12.33 times.

C3RL, a reinforcement learning algorithm introduced by Xuqing Yang, Yi Yuan, Shanzhe Lei and Xuhong Wang, trains large language models to pair correctness with calibrated confidence. The paper, submitted to arXiv on 2 July 2026, shows C3RL improves calibration without sacrificing accuracy and powers CAS, an inference strategy that reduces inference budget by up to 12.33 times.

What are C3RL and CAS?

C3RL is Correctness and Confidence Calibration Reinforcement Learning, an RL algorithm that explicitly rewards models for correctness, calibration and dataset-informed reference accuracy. The paper frames the problem that prevailing RL reward designs typically prioritize response correctness and neglect calibration, causing models to be overconfident when uncertain. CAS is Confidence-based Adaptive Test Time Scaling, an inference-time strategy that uses the well-calibrated, verbalized confidence produced by C3RL to allocate computational resources adaptively.

C3RL integrates three reward components: a correctness reward, a calibration reward, and a dataset-informed reference accuracy reward. CAS consumes the model’s verbalized confidence and adjusts how much computation is spent per query, rather than applying a fixed ensemble or fixed number of samples.

How well do they perform?

C3RL and CAS were evaluated across eight text and multimodal datasets, and the authors report improved calibration while keeping accuracy intact. The paper states C3RL "enhances calibration without sacrificing accuracy, outperforming the current state-of-the-art method in both performance and calibration metrics." Using the verbalized confidence from C3RL, CAS "surpasses majority voting on both in-domain and out-of-domain datasets while reducing the inference budget by up to 12.33 times."

The study therefore ties a training-side change to a concrete inference-side saving: better-calibrated verbalized confidence enables fewer votes or less ensemble computation while maintaining or improving final decisions. The experiments span text and multimodal settings, indicating the approach was tested outside a single task type.

Why it matters

Poor calibration creates a false sense of certainty: models that answer confidently but incorrectly can cause downstream systems to act on wrong information. By building calibration into the reward signal, C3RL forces the model to internalize not just what is likely correct, but how certain it should be about that correctness. That permits an inference policy, CAS, which devotes extra compute only when the model signals low confidence, improving resource efficiency. The paper’s claim of up to a 12.33× reduction in inference budget directly connects improved calibration to tangible operational savings.

This matters for deployments where both reliability and compute cost matter: a model that signals uncertainty honestly can reduce wasted ensemble runs or human review, while keeping accuracy levels achieved by conventional RL-trained models.

What to watch

The authors say the code, data and models will be released, which will let others verify the calibration and budget claims and reproduce CAS scheduling policies. Watch for those releases and for independent benchmarks that compare C3RL against named baseline methods, since the paper reports outperforming the current state-of-the-art but does not list those methods in the abstract.

The paper is available on arXiv as arXiv:2607.01612, submitted 2 July 2026, and the next concrete signal will be the public code and model release the authors promise. If external replication confirms the 12.33× inference reduction and calibration gains across diverse datasets, practitioners may adopt confidence-driven adaptive inference as a standard efficiency tactic.

Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement