Benchmarks & EvalsJune 19, 20265 min read

DeXposure-Claw: Agentic System for DeFi Risk Supervision

DeXposure-Claw routes LLM decisions through forecasts, deterministic monitors and confidence gates; DeXposure-Bench scores tickets with a.

The BrieftideJune 19, 2026

TL;DR

01DeXposure-Claw routes LLM decisions through forecasts, deterministic monitors and confidence gates; DeXposure-Bench scores tickets with a.
02DeXposure-Claw is an agentic supervision system for decentralized finance risk, introduced on 17 June 2026 by Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen and Fengxiang He (arXiv:2606.19501).
03The system routes language-model decisions through structured forecasts, deterministic monitors and safety gates, then emits auditable supervisory tickets with rationales.

DeXposure-Claw is an agentic supervision system for decentralized finance risk, introduced on 17 June 2026 by Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen and Fengxiang He (arXiv:2606.19501). The system routes language-model decisions through structured forecasts, deterministic monitors and safety gates, then emits auditable supervisory tickets with rationales.

How does DeXposure-Claw work?

DeXposure-Claw first forecasts future exposure networks with DeXposure-FM, then applies deterministic monitors and stress scenarios to convert those forecasts into typed alerts, attribution signals and scenario evidence, and finally uses data-health and confidence gates to constrain escalation before issuing tickets. The paper describes a three-part pipeline: (1) DeXposure-FM, a graph time-series foundation model for forecasting exposure networks; (2) deterministic monitors and stress scenarios that turn forecasts into alerts and evidence; (3) data-health and confidence gates that limit escalation and produce "auditable supervisory tickets with rationales." This structured routing is intended to prevent LLM agents from over-reading weak evidence and recommending high-stakes interventions.

How is the system evaluated?

Evaluation uses DeXposure-Bench, a six-axis harness whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. The paper introduces DeXposure-Bench and emphasizes two concrete evaluation elements: a six-axis framework and a decision axis that compares generated supervisory tickets to an absolute-loss ground truth while measuring false-intervention rate. Experiments run on five years of weekly real data fully support the system, and the authors make code available at the linked URL.

Why it matters

General-purpose LLM agents can over-interpret weak signals and push for unnecessary interventions, and existing evaluations do not provide regulator-aligned measures of false alarms. DeXposure-Claw addresses both gaps by grounding agent actions in forecasted network exposures and by introducing an evaluation metric that explicitly measures false-intervention rate. For supervisors, that combination aims to reduce inappropriate escalations while keeping a verifiable audit trail of why a ticket was opened.

What to watch

Look for external replication using the provided code and for others applying DeXposure-Bench's six-axis framework and decision-axis metrics to different datasets. The next concrete signal will be whether independent teams can reproduce the paper's results on the five years of weekly real data the authors cite and whether the false-intervention rate metric gains traction in regulator-aligned evaluations.

References and provenance: DeXposure-Claw: An Agentic System for DeFi Risk Supervision, Aijie Shu et al., arXiv:2606.19501, submitted 17 Jun 2026. The paper names DeXposure-FM, deterministic monitors and stress scenarios, data-health and confidence gates, DeXposure-Bench, a six-axis evaluation harness, a decision axis scoring against an absolute-loss ground truth, and experiments on five years of weekly real data. Code is linked in the paper.

DeXposure-Claw system architecture

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

LLM Agents: Predictive Validity vs Static Leaderboards

Dhaval C. Patel et al. aggregate fourteen implementation studies and seven prior benchmarks and propose ranking by predictive validity.

The BrieftideDAILY BRIEF

ForecastBench-Sim: Simulated-World Forecasting Benchmark

A benchmark built on Freeciv game rollouts that generates solvable forecasting tasks with configurable horizons.

The BrieftideDAILY BRIEF

TxBench-PP: 100 preclinical pharmacology tasks, top score 59.3%

TxBench-PP is a verifiable benchmark of 100 small-molecule preclinical decisions across 11 models and 4.

The BrieftideDAILY BRIEF

LLM-as-Judge: Curriculum-Grounded Marking Pipeline for Exam Prep

A staged LLM workflow that grounds question marking in authorised syllabus artefacts.