Subquadratic's SubQ: sparse-attention claim, 12M-token context
Subquadratic says SubQ runs with a 12 million token window, is 56× faster than FlashAttention and scored 98% on long-context retrieval in.
TL;DR
- 01Subquadratic says SubQ runs with a 12 million token window, is 56× faster than FlashAttention and scored 98% on long-context retrieval in.
- 02The company has released third-party results from Appen that back several headline claims about speed, cost and long-context retrieval.
- 03SubQ replaces dense attention, the quadratic multiplication step inside transformers, with a sparse-attention mechanism that dynamically selects which token relationships to compute.
Subquadratic, a Miami-based AI startup that came out of stealth mode last month, says it has built SubQ, a large language model that abandons dense attention in favor of sparse attention and can handle context windows up to 12 million tokens. The company has released third-party results from Appen that back several headline claims about speed, cost and long-context retrieval.
How does SubQ differ from today’s transformers?
SubQ replaces dense attention, the quadratic multiplication step inside transformers, with a sparse-attention mechanism that dynamically selects which token relationships to compute. Dense attention multiplies every token with every other token, producing a quadratic growth in operations as text length increases; Subquadratic says sparse attention cuts that dramatically by selecting a subset of relationships on the fly. The firm says the selection is dynamic and computed differently for each input, and it declined to publish the exact selection method, calling it “the secret sauce.”
What did independent tests measure and report?
Appen ran a suite of tests and reported multiple striking numbers: it found SubQ 56 times faster than models using FlashAttention in a straight speed test, SubQ scored 89.7% on LiveCodeBench for competitive coding problems, and SubQ scored 98% on a needle-in-a-haystack long-context retrieval test with context windows of six million and 12 million tokens. Appen’s director of generative AI research, Jeanine Sinanan-Singh, said, "That was really exciting to me, it validated their architecture." Subquadratic also demonstrated a practical demo where SubQ reasoned across 400 documents in seconds, a task that another service failed to load.
What about cost and scale claims?
Subquadratic says SubQ can be far cheaper to run for certain data-heavy tests. The firm’s CEO Justin Dangel said running Anthropic’s Opus 4.6 through Nvidia’s RULER 128 cost $2,600, while running SubQ cost $8. The model’s context window tops out at 12 million tokens; Subquadratic contrasts that with most top models today, which it says have context windows of one million tokens. The company also claims tens of thousands of signups for early access, including more than 500 enterprise customers, though it has given access to only a small number because of limited resources.
Why it matters
If SubQ’s numbers hold up under broad outside use, the model would change trade-offs between context length, speed and cost for tasks that need to ingest very large data sets, such as searching or reasoning across hundreds of documents or entire code bases. Subquadratic positions SubQ as tailored to coding and large-scale retrieval tasks rather than as a drop-in replacement across every benchmark. Skeptics note important caveats: the firm bootstrapped SubQ from weights of a version of the Chinese open-source model Qwen rather than training entirely from scratch, and benchmarks run under specific conditions do not fully capture real-world failure modes.
What to watch
Watch for wider access and third-party evaluations beyond Appen, including independent users running diverse real-world workloads and security audits of the reused Qwen weights. A confirmatory signal would be more public access to end-to-end tests that reproduce Appen’s 56× speed claim and the 98% retrieval score at six million and 12 million token contexts.
Subquadratic’s early results have shifted skepticism toward scrutiny: the company has posted third-party benchmarks that support many of its claims, but final judgment will come from hands-on trials by outside teams and broader testing across tasks and domains.
| Item | |||
|---|---|---|---|
| Speed vs FlashAttention | 56× faster (Appen speed test) | FlashAttention baseline (Appen) | |
| Context window | Up to 12,000,000 tokens (Subquadratic claim) | Most top models: 1,000,000 tokens (Subquadratic claim) | |
| LiveCodeBench (coding) | 89.7% (Appen) | Frontier-level, in same ballpark as top coding models (Appen) | |
| RULER 128 cost example | $8 to run SubQ (Subquadratic claim) | Anthropic Opus 4.6: $2,600 (Subquadratic claim) | |
| Needle-in-a-haystack retrieval | 98% at 6M and 12M token contexts (Appen) | Few models tested at that scale (Appen) |
Written by The Brieftide · Source: MIT Technology Review
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Visual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.
Hugging Face Spaces agents.md: chain image to 3D splats
An agent used two Hugging Face Spaces and their agents.md files to auto-generate images, reconstruct 3D Gaussian splats.