InvestPhilBench v0.6: Benchmark for LLM Investment Procedure
v0.6 supplies 118 verified investment principle cards, 25 framework cards and 243 QA items plus an automated scoring suite called BASP.
TL;DR
- 01v0.6 supplies 118 verified investment principle cards, 25 framework cards and 243 QA items plus an automated scoring suite called BASP.
- 02InvestPhilBench is a multi-layer dynamic benchmark designed to test procedural reasoning across eight cognitive tiers, from principle identification (L1) to novel framework extrapolation (L8).
- 03The authors present the benchmark as primarily a methodology and dataset contribution in this release.
InvestPhilBench v0.6, published on 24 Jun 2026 by Mingguang Chen and Bo Qu, delivers a multi-layer benchmark and tooling to evaluate how well large language models reconstruct and apply expert investment decision procedures. The release includes 118 primary-source-verified investment principle cards, 25 decision framework cards with explicit topology metadata, and 243 QA questions (197 development / 46 held-out test), and it ships with code, data, and evaluation tools.
What is InvestPhilBench and what does v0.6 include?
InvestPhilBench is a multi-layer dynamic benchmark designed to test procedural reasoning across eight cognitive tiers, from principle identification (L1) to novel framework extrapolation (L8). The v0.6 release bundles 118 verified principle cards, 25 decision framework cards, and 243 QA questions split into 197 dev and 46 held-out test items, plus a Benchmark Automated Scoring Pipeline and failure-mode tooling.
The authors present the benchmark as primarily a methodology and dataset contribution in this release. The paper runs a preliminary empirical study and notes that benchmark, data, and code are available, while a de-confounded multi-model leaderboard and a full three-condition run are scheduled as v1.0 deliverables.
How does the scoring and diagnostics work?
InvestPhilBench evaluates models with the Benchmark Automated Scoring Pipeline, abbreviated BASP, which combines five algorithmic metrics: OGRS, KCCS, SAP@k, IVP, and CKCA, and pairs these with a Failure Mode Detection Protocol that encodes computable rules for six failure modes. For gate-level reasoning, the suite exposes Gate Reconstruction Accuracy, a per-gate metric for questions that have gold reasoning programs.
The paper reports concrete results from a sanity wave on the 188-question development split: a stark provider-tier split in BASP composite scores (0.906 versus 0.438), and a frontier composite score hitting 0.932 for Claude L4. By contrast, GRA still shows a procedural gap: frontier L4 GRA is approximately 0.77 while L7 GRA ranges between 0.57 and 0.62. On a 100-item expert-annotated gold set the BASP composite tracks the human reference at Pearson r = 0.72 with MAE = 0.10, while the SAP@3 sub-metric is the weakest and the failure-mode detector runs sensitive but over-flagging.
Why does this matter?
InvestPhilBench exposes a divergence between fluent, persuasive model outputs and true procedural fidelity. The BASP composite can saturate near the frontier, as seen with Claude L4 at 0.932, while Gate Reconstruction Accuracy shows lower numbers (L4 ~0.77, L7 0.57 to 0.62), indicating that high-scoring prose can mask procedural errors. That gap matters for anyone relying on LLMs as investment research assistants because procedural missteps, not just surface fluency, drive incorrect decisions in rule-based or framework-driven domains.
What are the limits and next steps?
v0.6 is explicitly a benchmark-and-methodology release. The authors flag that the four-model sanity wave and the provider-tier split produce mixed-judge numbers and confounded upper bounds. The paper lists v1.0 deliverables: a de-confounded multi-model leaderboard and a full three-condition run that will offer a cleaner, comparative evaluation across models.
What to watch
Watch for the v1.0 release, which the authors identify as the milestone that will provide the de-confounded multi-model leaderboard and the full three-condition run. Those runs will test whether the BASP composite and GRA converge when judged under unified, de-confounded conditions.
References and provenance: the paper is arXiv:2606.25984, submitted 24 Jun 2026; it runs 57 pages with 6 figures and 26 tables and lists benchmark, data, and code as released material.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsCORE-Bench: Life After Benchmark Saturation, v1.1 Findings
arXiv paper shows CORE-Bench v1.1 and CORE-Bench OOD expose construct validity, efficiency, reliability and a twofold human-agent speedup.
T2D-Bench: Benchmarking LLMs for Type 2 Diabetes Evidence
A multi-layer clinical-lifestyle knowledge graph flags unsupported LLM diabetes recommendations and corrects them across 100 vignettes.
BIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
ORAgentBench benchmark: LLM agents on 107 OR tasks pass rates
ORAgentBench packages 107 execution-grounded operations research tasks; best agent passed 35.51% overall and 20.59% of hard tasks.