Datasette Agent: improving SQL system prompts with DSPy
Simon Willison used DSPy and Claude Fable 5 on 2nd July 2026 to evaluate and refine Datasette Agent's read-only SQL system prompts.
TL;DR
- 01Simon Willison used DSPy and Claude Fable 5 on 2nd July 2026 to evaluate and refine Datasette Agent's read-only SQL system prompts.
- 02Datasette Agent has been evaluated and tweaked using the DSPy framework to improve its read-only SQL system prompts, Simon Willison wrote on 2nd July 2026.
- 03The project runs DSPy agents against an in-process Datasette and scores their behaviour against a gold-standard, auto-generated dataset using custom metrics.
Datasette Agent has been evaluated and tweaked using the DSPy framework to improve its read-only SQL system prompts, Simon Willison wrote on 2nd July 2026. The project runs DSPy agents against an in-process Datasette and scores their behaviour against a gold-standard, auto-generated dataset using custom metrics.
How did Willison use DSPy to test Datasette Agent?
Willison used a DSPy-based harness in which DSPy agents invoke Datasette Agent’s actual tool implementations and prompts against a live in-process Datasette, then evaluate results against an auto-generated gold-standard dataset. He launched the experiment by installing the latest Datasette alpha, datasette-agent, and dspy, and running an asynchronous research task in Claude Code for web using Claude Fable 5.
The harness executes the real prompts and tools, not a mock, and collects traces for evaluation with bespoke metrics. Willison notes that the setup lets DSPy run the same prompts and tool calls Datasette Agent uses in production, enabling direct comparison to the gold-standard answers.
What concrete changes did DSPy and Fable identify?
DSPy testing (orchestrated by Claude Fable 5) exercised models including GPT 4.1 mini and GPT 4.1 nano and surfaced specific prompt weaknesses to address. One concrete example: the prompt’s schema listing gave only table names, and the advice to "don't call describe_table if you already have the information" triggered column-name guessing (page_count, o.order_id, first_name) and error-retry loops in baseline traces.
Fable recommended either including column names in the prompt's schema listing or softening that advice. Willison describes this as one of several promising directions uncovered by the experiments, and he highlights the schema-listing issue as particularly useful feedback for improving the system prompt.
Why does this matter?
Improving system prompts reduces unnecessary tool calls, spurious column-name guesses, and error-retry loops that waste model tokens and produce incorrect answers. By running DSPy agents against the real Datasette Agent implementation and scoring them against a generated gold-standard dataset, this work gives prompt authors a repeatable way to find and fix practical failures in SQL question answering. That should make read-only SQL responses more reliable and predictable for end users.
What did the experiment actually run?
Willison writes that Fable chose to test using GPT 4.1 mini and GPT 4.1 nano, and that those runs led to the specific prompt-change suggestions above. The test workflow combined the DSPy harness, the datasette-agent implementation, and a live in-process Datasette, producing traces that revealed behaviours such as column guessing and retry loops.
What to watch
Watch the datasette-agent project for updates to the system prompt, particularly changes that add column names to schema listings or alter the guidance about calling describe_table. Also look for follow-up posts or experiments using DSPy and Claude Fable 5 that apply the same harness to other agent behaviours or prompt variants.
Written by The Brieftide · Source: Simon Willison
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsAgent4cs: Multi-agent code summarization, up to 38% gains
Agent4cs uses three cooperating agents to summarize large hierarchical codebases.
llm-coding-agent 0.1a0: GPT-5.5 coding agent and tools
Simon Willison published llm-coding-agent 0.1a0 on 2nd July 2026, a PyPI slop-alpha that exposes file.
Autoformalization: Agent Instructions to Policy-as-Code
A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.
Agentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A
An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.