Probably raises $9M seed to build more reliable AI data tool
Probably secured $9 million from Andreessen Horowitz to build a validator-backed data science tool that aims to stop LLM hallucinations.
TL;DR
- 01Probably secured $9 million from Andreessen Horowitz to build a validator-backed data science tool that aims to stop LLM hallucinations.
- 02The company says its system aims for the kind of 99.99% accuracy common in deterministic systems and uses a harness that checks model answers against source data.
- 03The founder, Peter Elias, describes the arrangement as a "data science mech suit," where the validator bounces back mismatches and the LLM learns to produce answers that pass the check.
Probably raised $9 million in seed funding from Andreessen Horowitz to build a validator-backed data science tool that prevents LLM hallucinations and factual errors. The company says its system aims for the kind of 99.99% accuracy common in deterministic systems and uses a harness that checks model answers against source data.
How does Probably’s system work?
Probably runs an LLM first pass, then routes that output through a deterministic validator that rejects any answers that do not match the dataset, and trains the model against the validator so the whole loop becomes more reliable. The founder, Peter Elias, describes the arrangement as a "data science mech suit," where the validator bounces back mismatches and the LLM learns to produce answers that pass the check.
The company pairs that harness with an audit trail and citations for every result. The validator is designed to be deterministic, so when the model’s answer does not match the dataset the system rejects it rather than delivering a potentially hallucinated response to the user. Probably optimizes the combined system for speed and accuracy, rather than relying on a single, very large model to be correct on its own.
What is the product and where does it run?
Probably’s first product is a data science tool that produces quick answers from complex datasets, with each result accompanied by a citation and an audit trail. The startup says the tool runs on a model that is "four classes weaker than the frontier models," which lets it operate on local hardware such as a desktop instead of requiring a data center.
Running on smaller models reduces token costs because the system does not need the most powerful models to pass the validator. That design choice, Elias says, stems from harness engineering: "What we learned building this was that the better your harness engineering is, the weaker the model can be." Probably also says the same engine can extend beyond data science to precision-sensitive use cases like accounting or medical services.
Why does this matter?
Reducing hallucinations matters because errors persist even in advanced LLMs and can be costly in precision-sensitive workflows. Probably is trying to reach a reliability bar of 99.99% accuracy that deterministic systems achieve, and it does so by changing the system architecture rather than only improving model size. If the approach scales, customers could cut both error rates and the token costs that come with repeatedly correcting model outputs.
The company’s emphasis on a validator-trained loop highlights a design trade-off many teams face: invest in bigger models or invest in engineering that constrains and verifies outputs. Peter Elias argues the latter lets organizations run on smaller, cheaper models while keeping mistakes out of user-facing answers.
What to watch
Watch whether Probably can demonstrate sustained 99.99% accuracy in real-world deployments beyond controlled datasets, and whether customers adopt local deployments to reduce token costs. Also watch if major AI labs attempt similar harness engineering; Elias says he finds it "really interesting that the big AI labs have not even attempted to do this," suggesting a divergence in incentives between model makers and teams prioritizing precision.
Written by The Brieftide · Source: TechCrunch
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI InfrastructureIEEE launches virtual training course on large language models
IEEE is offering a virtual training course that teaches engineers to use large language models as reasoning engines in development.
AI4SE and SE4AI: A decade review of AI in systems engineering
H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.
Amazon's AWS may sell Trainium chips to challenge Nvidia
AWS executives say selling Trainium to third parties is possible, with Andy Jassy estimating a potential ~$50 billion annual run rate.
Hyperscalers AI spending to outpace cash flow by Q3 2026
Epoch AI data shows infrastructure spending growing ~70% annually versus operating cash flow at ~23%, with a crossover around Q3 2026.