Coding AgentsJuly 3, 20264 min read

SkillCoach: Self-Evolving Rubrics for LLM Agent Skill-Use

A rubric framework that derives skill-grounded process rubrics from real rollouts and evaluates trajectories across four dimensions.

The BrieftideJuly 3, 2026

TL;DR

01A rubric framework that derives skill-grounded process rubrics from real rollouts and evaluates trajectories across four dimensions.
02Jiayin Zhu and six coauthors submitted SkillCoach, a self-evolving rubric framework for evaluating and enhancing agentic skill-use, to arXiv on 2 Jul 2026.
03The paper describes a system that derives skill-grounded process rubrics from real rollouts and keeps the external verifier as a separate outcome signal.

Jiayin Zhu and six coauthors submitted SkillCoach, a self-evolving rubric framework for evaluating and enhancing agentic skill-use, to arXiv on 2 Jul 2026. The paper describes a system that derives skill-grounded process rubrics from real rollouts and keeps the external verifier as a separate outcome signal.

What is SkillCoach and how does it work?

SkillCoach is a framework that automatically derives and evolves process rubrics from real agent rollouts, then uses those rubrics to evaluate and supervise skill use. It ingests trajectories produced by agents using skills and produces skill-grounded rubrics that judge process quality rather than only final verifier success.

The paper positions skills as a reusable operational layer for LLM agents, encoding standard operating procedures, domain rules, tool workflows, scripts and validation routines. SkillCoach targets the practical problem that overlapping or distractor skills in large repositories make reliable skill selection and composition difficult. To address that, SkillCoach evaluates trajectories along four explicit dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. It deliberately keeps an external verifier as a separate outcome signal so process quality can be distinguished from accidental task success.

How does SkillCoach evaluate agentic skill-use?

SkillCoach derives skill-grounded process rubrics from actual rollouts and scores trajectories on four dimensions, then uses the evolved rubrics as a source of process supervision. The framework judges whether agents picked the right skills, adhered to each skill's required steps, composed skills into correct workflows and produced reflections tied to skills, while treating final verifier success as an independent signal.

Because final verifier success can mask failures — for example when an agent reaches a correct outcome via trial and error, wrong skill choices or skipped steps — SkillCoach separates process evaluation from the external verifier. The authors report that the evolved rubrics substantially improve evaluation quality, expose failures hidden by final accuracy, and provide stronger supervision signals than outcome-only filtering for enhancing agentic skill-use. The repository and rollout-driven approach allows the rubrics to evolve as agents and skill sets change, turning evaluation traces into training supervision for selecting high-quality trajectories.

Why it matters

Process-level evaluation addresses a practical blind spot in agent training and assessment: final outcomes do not reveal how an agent arrived at a result. SkillCoach makes that distinction explicit by scoring how skills were selected, followed and combined, and by keeping verifier outcome separate. That shift matters for teams that maintain large skill libraries, because it surfaces systematic failures such as use of distractor skills, omitted checks or incorrect workflow composition that standard outcome-only metrics miss.

By turning evolved rubrics into process supervision, SkillCoach also offers a path to better training data selection. The paper claims these rubrics provide stronger supervision signals than filtering trajectories solely by outcome, which could improve the quality of training trajectories used to teach agents how to use skills reliably.

What to watch

Watch for public code, datasets or reproduction studies that apply SkillCoach rubrics to different skill repositories and task domains, and for comparisons that quantify how much process-driven supervision improves downstream agent behavior compared with outcome-only filtering. Also check whether other teams adopt the four-dimension rubric—skill selection, skill following, skill composition, skill-grounded reflection—as a standard for agent process evaluation.

Authors and submission details: the paper is authored by Jiayin Zhu, Kelong Mao, Yudong Guo, Dengbo He, Sulong Xu, Simiu Gu and Yutao Yue, and was submitted to arXiv on 2 Jul 2026. The arXiv submission file metadata lists a package size of 25,156 KB.

SkillCoach concept map

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Agent4cs: Multi-agent code summarization, up to 38% gains

Agent4cs uses three cooperating agents to summarize large hierarchical codebases.

The BrieftideDAILY BRIEF

Autoformalization: Agent Instructions to Policy-as-Code

A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.

The BrieftideDAILY BRIEF

Agentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A

An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.

The BrieftideDAILY BRIEF

Data2Story: CSV-to-article pipeline with seven AI agents

A Claude Code skill runs seven specialist agents to turn a CSV into a verifiable, interactive news article with an Inspector panel.