AgentFinVQA: Deployable Multi-Agent Chart QA, 71.24% on FinMME
A multi-agent pipeline that records every step into a Model Evaluation Packet.
TL;DR
- 01A multi-agent pipeline that records every step into a Model Evaluation Packet.
- 02Aravind Narayanan and Shaina Raza posted AgentFinVQA to arXiv on 18 June 2026.
- 03The pipeline produces an auditable trace of intermediate steps so practitioners can inspect how an answer was produced before acting on it.
Aravind Narayanan and Shaina Raza posted AgentFinVQA to arXiv on 18 June 2026. AgentFinVQA is a deployable multi-agent pipeline for financial chart question answering that records each step in a per-sample Model Evaluation Packet and achieves 71.24% exact accuracy on the FinMME benchmark with a proprietary backbone.
What is AgentFinVQA?
AgentFinVQA is a multi-agent pipeline that decomposes each chart question into planning, OCR, legend grounding, visual inspection and verification, and saves a traceable Model Evaluation Packet for every sample. The pipeline produces an auditable trace of intermediate steps so practitioners can inspect how an answer was produced before acting on it.
The authors describe the pipeline as suitable for regulated settings where institutions often cannot send client data to external model providers. The MEP captures the decisions and outputs of planning, extraction and verification stages so downstream reviewers or auditors can follow the exact steps that produced a response.
How does it perform on benchmarks?
AgentFinVQA improves exact accuracy by 7.68 percentage points over a primary-backbone matched zero-shot baseline when using the proprietary backbone Gemini-3 Flash: 71.24% versus 63.56% (McNemar p approx 1.1 × 10^-16). Using an open-weights backbone served locally, Qwen3.6-27B-FP8, the pipeline still gains +4.84 percentage points versus its matched baseline.
The pipeline also includes a verifier whose verdict serves as a confidence signal: samples the verifier marked as confirmed reached 68.2% exact accuracy, while those marked as revised reached 55.6% exact accuracy. The authors propose routing confirmed answers to automated workflows and revised answers to human reviewers for a human-in-the-loop review process.
The paper’s error analysis finds that question misunderstanding, legend confusion and extraction error account for nearly two thirds of failures, and these categories are the least detected by the verifier. The authors release their code to support reproducible evaluation on FinMME.
Why it matters
Financial chart QA in regulated environments requires more than raw accuracy: institutions need to know when to trust an answer and they often must keep data on-premise. AgentFinVQA targets both constraints by combining stepwise audit logs (MEPs) with an on-premise deployable configuration using open-weights models. The reported accuracy gains with both a proprietary backbone and a locally served Qwen3.6-27B-FP8 show that auditable, on-premise chart QA can preserve most of the performance improvement while keeping data residency.
The verifier’s difference in confirmed versus revised accuracy offers a practical triage lever. With a 68.2% exact accuracy on confirmed answers versus 55.6% on revised answers, the system can meaningfully reduce the review load by sending higher-confidence outputs directly into downstream processes.
What to watch
Watch for follow-up work that reduces question misunderstanding, legend confusion and extraction errors and that raises the verifier’s detection rate for those failure modes. Also track community use of the released code on FinMME to see whether the open-weights Qwen3.6-27B-FP8 setup reproduces the reported +4.84 percentage point gain in varied, on-premise deployments.
Authors: Aravind Narayanan and Shaina Raza. Posted to arXiv on 18 June 2026. The paper and associated code are available for reproducible evaluation.
| Item | ||||
|---|---|---|---|---|
| Exact accuracy (%) | 71.24 | 63.56 | not provided | |
| Improvement over baseline (percentage points) | +7.68 | — | +4.84 | |
| McNemar p-value (AgentFinVQA vs baseline) | ≈ 1.1 × 10^-16 | — | — | |
| Verifier confirmed vs revised exact accuracy (%) | 68.2 (confirmed) / 55.6 (revised) | — | — | |
| Code and reproducibility | Released | — | — |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsAdobe creative agents arrive in Photoshop, Premiere, and more
Firefly-powered AI assistants automate multi-step production tasks across Creative Cloud and plug into ChatGPT, Claude.
CODA-BENCH benchmark: testing code agents on data tasks
CODA-BENCH places agents in a Kaggle-based Linux sandbox with 1,009 tasks across 31 communities and an average of 980 files per task.
SWE-Explore: benchmark shows AI coding agents miss key lines
SWE-Explore isolates code search from repair and finds agents hit the right files but cover only 14–19% of the lines that matter.
OpenAI acquires Ona to add persistent agents to Codex
The deal brings Ona's cloud development environments into Codex so agents can continue tasks for hours or days in customers' clouds.