Bridgewater's fine-tuned Qwen3-235B outperforms GPT, Claude
Bridgewater and Thinking Machines fine-tuned Qwen3-235B to 84.7% accuracy in internal tests and say it runs nearly 14 times cheaper than.
TL;DR
- 01Bridgewater and Thinking Machines fine-tuned Qwen3-235B to 84.7% accuracy in internal tests and say it runs nearly 14 times cheaper than.
- 02They trained an open model on proprietary, expert-labeled examples and used a staged labeling loop to reduce expert time.
- 03Where the model and original label disagreed, Bridgewater investors corrected only those disputed cases, concentrating expensive expert effort on the highest-value examples.
Bridgewater and Thinking Machines Lab fine-tuned an open-weight model, Qwen3-235B, and in the teams' internal evaluation the model reached 84.7 percent accuracy on six investor-oriented financial-document tasks while costing nearly 14 times less to run than the best frontier model they tested.
How did they get a model to understand investors' judgment?
They trained an open model on proprietary, expert-labeled examples and used a staged labeling loop to reduce expert time. The team built the fine-tune on top of Qwen3-235B using the Tinker platform, first collecting labels from outside contractors, then training a first model on those imperfect labels and having that model re-evaluate the same documents. Where the model and original label disagreed, Bridgewater investors corrected only those disputed cases, concentrating expensive expert effort on the highest-value examples.
The report says that training with those corrected examples produced the 84.7 percent accuracy figure, a step above the 78.2 percent accuracy the authors measured for the best tested frontier model.
How did the fine-tuned model compare to GPT, Claude and other frontier models?
In the team’s tests, basic prompts left frontier models at about 50 percent accuracy; carefully written expert instructions and a three-tier rating system raised those models into the mid-70s, but still below the authors’ 80 percent threshold for trustworthy deployment. The fine-tuned Qwen3-235B reached 84.7 percent in the internal evaluation versus 78.2 percent for the best frontier model the teams tested. The report also notes that GPT 5.4 costs 43 percent more than 5.2 while being only marginally more accurate, illustrating diminishing per-dollar improvements among the largest proprietary models.
The six tasks the teams used mirror routine investor judgments: for example, deciding whether a news article is relevant to a company executive, or whether a central bank document signals future rate direction. The teams implemented a three-tier label taxonomy: "relevant and interesting," "relevant but uninteresting," and "irrelevant." That taxonomy and the investor corrections were the secret sauce the authors credit for better performance.
Why it matters
The result shows proprietary corporate data and internal human expertise can still outstrip generic frontier models when turned into training signal. Bridgewater and Thinking Machines present a concrete example: a fine-tuned, open-weight model that the teams say hits 84.7 percent accuracy on investor-oriented tasks and costs nearly 14 times less to run than the best frontier model they tested. That implies firms with valuable private data can build domain-specialized systems without handing sensitive material to big labs, and that cost-performance trade-offs still favor targeted fine-tuning in some cases.
What are the methodological limits of the claim?
The numbers come from the collaborators' own internal evaluation. The report acknowledges this is not an independent comparison and both companies have a commercial interest in the result. The measurement choices matter: basic prompts for frontier models produced about 50 percent accuracy, while expert-written instructions raised their scores into the mid-70s but did not reach the team’s 80 percent deployment threshold.
What to watch
Look for independent replications or third-party benchmarks that apply the same six investor tasks and the three-tier label scheme. Also watch whether other firms deploy similar staged-labeling pipelines and whether open-weight fine-tunes on proprietary data close the gap in other high-sensitivity domains.
| Item | |||
|---|---|---|---|
| Fine-tuned Qwen3-235B (Bridgewater / Thinking Machines) | 84.7% | nearly 14x cheaper | |
| Best frontier model (with expert prompt) | 78.2% | baseline | |
| Frontier models, naive prompt (Gemini / Claude / GPT variants) | about 50% | — | |
| Frontier models with expert instructions (general) | mid-70s | — | |
| GPT 5.4 vs 5.2 (cost change) | marginally more accurate | 43% more expensive |
Written by The Brieftide · Source: The Decoder
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsEinstein World Models: LLMs with visual rollouts (arXiv 2026)
An arXiv paper submitted 25 Jun 2026 proposes Einstein World Models, letting LLMs call visual-temporal rollouts as inspectable hypotheses.
KARLA: KB-augmented retrieval for language models paper
arXiv paper (25 Jun 2026) by Francois Crespin, Fabian M. Suchanek and Nils Holzenberger shows LLMs can query a knowledge base during token.
Synthetic clinical notes from LLMs: 70-patient longitudinal
William Poulett publishes a modular LLM pipeline and a synthetic dataset of 70 patients.
Capability Frontier: Benchmarks Miss 82% of LLM Performance
An arXiv paper finds single-model, single-run benchmarks undercount LLM ability; an oracle multi-model approach recovers 82% more.