Coding Agents4 min read

AgentFinVQA: Deployable Multi-Agent Chart QA, 71.24% on FinMME

A multi-agent pipeline that records every step into a Model Evaluation Packet.

The Brieftide

TL;DR

  • 01A multi-agent pipeline that records every step into a Model Evaluation Packet.
  • 02Aravind Narayanan and Shaina Raza posted AgentFinVQA to arXiv on 18 June 2026.
  • 03The pipeline produces an auditable trace of intermediate steps so practitioners can inspect how an answer was produced before acting on it.

Aravind Narayanan and Shaina Raza posted AgentFinVQA to arXiv on 18 June 2026. AgentFinVQA is a deployable multi-agent pipeline for financial chart question answering that records each step in a per-sample Model Evaluation Packet and achieves 71.24% exact accuracy on the FinMME benchmark with a proprietary backbone.

What is AgentFinVQA?

AgentFinVQA is a multi-agent pipeline that decomposes each chart question into planning, OCR, legend grounding, visual inspection and verification, and saves a traceable Model Evaluation Packet for every sample. The pipeline produces an auditable trace of intermediate steps so practitioners can inspect how an answer was produced before acting on it.

The authors describe the pipeline as suitable for regulated settings where institutions often cannot send client data to external model providers. The MEP captures the decisions and outputs of planning, extraction and verification stages so downstream reviewers or auditors can follow the exact steps that produced a response.

How does it perform on benchmarks?

AgentFinVQA improves exact accuracy by 7.68 percentage points over a primary-backbone matched zero-shot baseline when using the proprietary backbone Gemini-3 Flash: 71.24% versus 63.56% (McNemar p approx 1.1 × 10^-16). Using an open-weights backbone served locally, Qwen3.6-27B-FP8, the pipeline still gains +4.84 percentage points versus its matched baseline.

The pipeline also includes a verifier whose verdict serves as a confidence signal: samples the verifier marked as confirmed reached 68.2% exact accuracy, while those marked as revised reached 55.6% exact accuracy. The authors propose routing confirmed answers to automated workflows and revised answers to human reviewers for a human-in-the-loop review process.

The paper’s error analysis finds that question misunderstanding, legend confusion and extraction error account for nearly two thirds of failures, and these categories are the least detected by the verifier. The authors release their code to support reproducible evaluation on FinMME.

Why it matters

Financial chart QA in regulated environments requires more than raw accuracy: institutions need to know when to trust an answer and they often must keep data on-premise. AgentFinVQA targets both constraints by combining stepwise audit logs (MEPs) with an on-premise deployable configuration using open-weights models. The reported accuracy gains with both a proprietary backbone and a locally served Qwen3.6-27B-FP8 show that auditable, on-premise chart QA can preserve most of the performance improvement while keeping data residency.

The verifier’s difference in confirmed versus revised accuracy offers a practical triage lever. With a 68.2% exact accuracy on confirmed answers versus 55.6% on revised answers, the system can meaningfully reduce the review load by sending higher-confidence outputs directly into downstream processes.

What to watch

Watch for follow-up work that reduces question misunderstanding, legend confusion and extraction errors and that raises the verifier’s detection rate for those failure modes. Also track community use of the released code on FinMME to see whether the open-weights Qwen3.6-27B-FP8 setup reproduces the reported +4.84 percentage point gain in varied, on-premise deployments.

Authors: Aravind Narayanan and Shaina Raza. Posted to arXiv on 18 June 2026. The paper and associated code are available for reproducible evaluation.

AgentFinVQA benchmark and verification numbers from the paper
Item
Exact accuracy (%)71.2463.56not provided
Improvement over baseline (percentage points)+7.68+4.84
McNemar p-value (AgentFinVQA vs baseline)≈ 1.1 × 10^-16
Verifier confirmed vs revised exact accuracy (%)68.2 (confirmed) / 55.6 (revised)
Code and reproducibilityReleased
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement