Enterprise AI Adoption4 min read

RLVR: Proof-of-Concept for Tool-Use Agents on Atlassian APIs

RLVR training raised average rewards for Qwen3 models on Jira and Confluence v2 synthetic workflows.

The Brieftide

TL;DR

  • 01RLVR training raised average rewards for Qwen3 models on Jira and Confluence v2 synthetic workflows.
  • 02They created a suite of five synthetic environments that emulate the Jira REST v3 and Confluence v2 APIs at schema fidelity, and a reward pipeline that reads only the tool-call trace.
  • 03The environments are presented as a proof of concept for outcome-optimised small models aimed at niche enterprise endpoints.

Karthikeya Aditya Vissa, Sankalp Mane, Ananya Mantravadi, Harshit Rajgarhia and Abhishek Mukherji submitted a paper on 1 Jul 2026 showing that Reinforcement Learning with Verifiable Rewards, or RLVR, can improve small-model outcomes inside enterprise-style APIs. They built five synthetic environments that emulate the Jira REST v3 and Confluence v2 schemas and scored prompted Qwen3-1.7B and Qwen3.5-4B models, computing rewards entirely from tool-call traces with no live API, learned judge, or human labels.

What did the authors build?

They created a suite of five synthetic environments that emulate the Jira REST v3 and Confluence v2 APIs at schema fidelity, and a reward pipeline that reads only the tool-call trace. Rewards were verifiable by construction: the paper computes them from the sequence of tool calls rather than from a learned judge or human labels, enabling training inside the target environment without a live production API.

The environments are presented as a proof of concept for outcome-optimised small models aimed at niche enterprise endpoints. The authors position the work as preliminary and explicitly call out limitations around reward design and scale.

How does RLVR affect agent performance?

On the four scenarios whose rewards are non-degenerate the RL-trained policy lifts average reward from a 4B-baseline range of 0.35–0.92 to 0.95–1.00, with the largest single gain on Confluence page creation (0.35 -> 1.00). The paper reports that scoring was performed on prompted Qwen3-1.7B and Qwen3.5-4B using the same checkers that drive GRPO training.

One of the five scenarios, ticket-transition, has a saturating reward shape that the prompted 4B already maxes out. That saturation is left as a boundary condition in the experiments: RLVR yields large gains where the reward signal admits improvement, and little or no gain where the baseline is already at the reward ceiling.

How did the reward and training setup differ from standard approaches?

The key difference is the reward signal. Instead of training with next-token likelihood or relying on a learned reward model, rewards are verifiable and derived solely from tool-call traces inside the synthetic environment. The authors deliberately avoid a live API, a learned judge, and human labels. That makes the reward both objective and reproducible for the endpoints they model, but it requires hand-crafting checkers for each scenario.

This setup allowed direct policy optimisation in the target workflow: the paper applies RLVR directly in the emulated Jira and Confluence contexts and measures outcome-level success rather than token-level proxies.

Why it matters

RLVR demonstrates that outcome-focused training inside a target environment can turn a small language model into a reliable tool-user for narrow enterprise APIs. The reported lift from a 4B-baseline range of 0.35–0.92 up to 0.95–1.00 on non-degenerate scenarios shows that policy optimisation against verifiable rewards can close the gap between token prediction and correct API actions.

The authors also warn of clear limits. They write that "hand-crafting verifiable rewards does not scale beyond the handful of endpoints reported here," which highlights the tradeoff: stronger, verifiable supervision for specific workflows versus the engineering cost of creating those checkers at scale.

What to watch

Whether verifiable reward checkers can be generalized or automated beyond a handful of endpoints is the next concrete test for RLVR. Also monitor attempts to move from synthetic, schema-fidelity environments to live API deployments and whether reward-shape issues like the ticket-transition saturation can be diagnosed or redesigned to permit further gains.

References and key facts pulled from the paper: five synthetic environments emulating Jira REST v3 and Confluence v2; experiments scored prompted Qwen3-1.7B and Qwen3.5-4B; RL-trained policy lifts average reward on four non-degenerate scenarios from a 4B-baseline range of 0.35–0.92 to 0.95–1.00; largest single gain Confluence page creation (0.35 -> 1.00); one scenario (ticket-transition) is saturated by the prompted 4B.

Baseline vs RL-trained average rewards (from paper)
Item
Four non-degenerate scenarios (aggregate)0.35–0.920.95–1.00
Confluence page creation0.351.00
Ticket-transitionsaturating reward; prompted 4B already maxes outsaturating (no RL gain reported)
Environments (count and targets)five synthetic (Jira REST v3 & Confluence v2)n/a
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement