RIFT-Bench: Dynamic Red-teaming for Agentic AI Systems
A graph-driven methodology with automated Discovery and Scanning phases.
TL;DR
- 01A graph-driven methodology with automated Discovery and Scanning phases.
- 02It runs in two automated phases, Discovery and Scanning, and the authors demonstrate the pipeline across 45 agentic systems.
- 03RIFT-Bench is a unified evaluation framework that represents agentic AI systems with a novel hierarchical graph representation to enable comparison across heterogeneous architectures.
RIFT-Bench, presented on arXiv as arXiv:2606.23927 and submitted 22 Jun 2026 by Yarin Yerushalmi Levi and seven co-authors, is a graph representation-driven methodology for dynamic red-teaming of agentic AI systems. It runs in two automated phases, Discovery and Scanning, and the authors demonstrate the pipeline across 45 agentic systems.
What is RIFT-Bench?
RIFT-Bench is a unified evaluation framework that represents agentic AI systems with a novel hierarchical graph representation to enable comparison across heterogeneous architectures. The paper describes it as a methodology that extracts system structure and then evaluates the system itself using dynamically adaptable adversarial probes, rather than tying tests to a single implementation or domain.
The authors position RIFT-Bench as a way to move beyond security evaluations that are bound to particular deployments. Its hierarchical representation underpins automated analysis and allows the same pipeline to operate over varied agentic designs.
How does RIFT-Bench evaluate agentic systems?
RIFT-Bench operates in two automated phases: Discovery, which extracts the system structure, and Scanning, which deploys adaptive adversarial attacks and produces a comprehensive evaluation report. Discovery finds the elements and relationships in the target agentic architecture, and Scanning leverages a broad set of dynamically adaptable adversarial probes across diverse attack vectors and objectives to test the assembled graph.
The pipeline is described as producing a report on the examined system itself and also supporting direct evaluation of mitigation strategies. The approach, the authors write, generalizes effectively to heterogeneous agentic architectures, which they demonstrate by running RIFT-Bench on 45 agentic systems spanning a diverse range of implementations.
Why it matters
Agentic AI systems, powered by large language models, introduce attack vectors beyond those of traditional LLM vulnerabilities, and existing security evaluations are often tied to specific implementations or domains. RIFT-Bench addresses that gap by offering a single, graph-driven methodology that can both discover system structure and scan for adversarial weaknesses across different architectures. By including mitigation evaluation in the same pipeline, the method aims to shorten the path from vulnerability discovery to remediation in agentic contexts.
The demonstration across 45 systems is a concrete step toward scaled evaluation; it shows the authors tested the pipeline on a nontrivial set of heterogeneous implementations rather than a single reference agent.
What to watch
Whether independent teams reproduce RIFT-Bench on other agentic systems and extend its library of adversarial probes will be a key next signal. Also watch for published code, datasets, or evaluation artifacts tied to the arXiv entry that would enable broader community adoption and cross-study comparisons.
Paper details: the preprint is titled "RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems," arXiv:2606.23927, submitted 22 Jun 2026, authored by Yarin Yerushalmi Levi, Roy Betser, Amit Giloni, Lidor Erez, Itay Gershon, Oren Rachmil, Sindhu Padakandla, and Roman Vainshtein.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsBIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
ORAgentBench benchmark: LLM agents on 107 OR tasks pass rates
ORAgentBench packages 107 execution-grounded operations research tasks; best agent passed 35.51% overall and 20.59% of hard tasks.
LLM Agents: Predictive Validity vs Static Leaderboards
Dhaval C. Patel et al. aggregate fourteen implementation studies and seven prior benchmarks and propose ranking by predictive validity.
SafeClawBench: benchmark separating semantic, audit, sandbox harm
A 600-task staged benchmark measures semantic acceptance, audit-visible evidence.