Verifiable Agent Framework for Open-Web Data Collection
Bo Chen's framework converts LLM outputs into typed JSON collectors and, across 138 tasks, yields a reusable.
TL;DR
- 01Bo Chen's framework converts LLM outputs into typed JSON collectors and, across 138 tasks, yields a reusable.
- 02Bo Chen proposes a constrained, verifiable agent framework for open-web data collection in a paper submitted 25 Jun 2026.
- 03The framework shifts LLM output from free-form code to typed JSON collector configurations and enforces constraints across template, utility, source, field, and execution layers.
Bo Chen proposes a constrained, verifiable agent framework for open-web data collection in a paper submitted 25 Jun 2026. The framework replaces free-form LLM-generated scrapers with typed JSON collector configurations, adds multiple execution and quality constraints, and targets repeated scheduled collection with deterministic runs.
What does the framework do?
The framework shifts LLM output from free-form code to typed JSON collector configurations and enforces constraints across template, utility, source, field, and execution layers. It defines a six-type collector taxonomy, applies template and utility-function constraints, executes collectors as static Airflow DAGs, performs rule-based quality checking, and uses structured feedback correction to refine collectors before reuse.
The opening design choice is to trade hand-written or free-form LLM code for constrained, machine-verifiable artifacts. That constraint set covers source, field, and execution concerns beyond the initial natural-language description.
How was it evaluated and what were the results?
The paper reports experiments on 138 tasks that tested the taxonomy and configuration approach and a focused evaluation on 80 independently source-verified tasks that measured runtime characteristics. Across those 80 source-verified tasks the framework ran with zero execution-stage LLM tokens and achieved the lowest average wall-clock time, according to the paper.
The experiments confirmed that the six-type taxonomy supports description-based requirement typing and that stable instantiation requires completing source, field, and execution constraints beyond the initial description. The authors characterize the result as a tradeoff: moderate one-shot quality in exchange for a reusable, deterministic, and verifiable execution path suited to repeated scheduled collection.
Why did the authors constrain outputs instead of generating code?
Constrained, typed configurations allow deterministic execution and verifiability where free-form code fails because of dependency errors, broken selectors, schema mismatches, and heterogeneous page structures. The paper frames those failure modes as common failure causes for directly generated web scrapers from natural-language requirements. By moving to typed JSON collector configurations and static Airflow DAG execution, the framework eliminates execution-stage LLM tokens and produces a repeatable execution path.
The framework also layers rule-based quality checks and structured feedback correction so that configuration faults can be detected and corrected before running at scale. That combination aims to reduce the operational cost and variability associated with one-off LLM-generated scrapers.
Why it matters
Verifiable, deterministic collection matters when data pipelines run on schedules and require auditability. The framework provides a path to reduce runtime uncertainty by producing typed artifacts and static DAGs rather than code that must be repaired at runtime. The experiment on 80 independently source-verified tasks, in which runs used zero execution-stage LLM tokens and the system achieved the lowest average wall-clock time, illustrates the framework's potential to lower recurring cost and increase reliability for repeated open-web data collection.
What to watch
Look for follow-up work or released artifacts that apply the six-type collector taxonomy and the template and utility-function constraints to broader site families or more diverse page structures. Adoption signals would include public code, Airflow DAG examples, or replication of the reported 80-task zero execution-stage LLM token runs.
Additional details: the paper is 15 pages with one figure, is available as arXiv:2607.00035, and the DOI link is https://doi.org/10.48550/arXiv.2607.00035. The author is Bo Chen.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsAgent4cs: Multi-agent code summarization, up to 38% gains
Agent4cs uses three cooperating agents to summarize large hierarchical codebases.
llm-coding-agent 0.1a0: GPT-5.5 coding agent and tools
Simon Willison published llm-coding-agent 0.1a0 on 2nd July 2026, a PyPI slop-alpha that exposes file.
Autoformalization: Agent Instructions to Policy-as-Code
A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.
Agentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A
An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.