Multimodal AI5 min read

Subquadratic's SubQ: sparse-attention claim, 12M-token context

Subquadratic says SubQ runs with a 12 million token window, is 56× faster than FlashAttention and scored 98% on long-context retrieval in.

The Brieftide

TL;DR

  • 01Subquadratic says SubQ runs with a 12 million token window, is 56× faster than FlashAttention and scored 98% on long-context retrieval in.
  • 02The company has released third-party results from Appen that back several headline claims about speed, cost and long-context retrieval.
  • 03SubQ replaces dense attention, the quadratic multiplication step inside transformers, with a sparse-attention mechanism that dynamically selects which token relationships to compute.

Subquadratic, a Miami-based AI startup that came out of stealth mode last month, says it has built SubQ, a large language model that abandons dense attention in favor of sparse attention and can handle context windows up to 12 million tokens. The company has released third-party results from Appen that back several headline claims about speed, cost and long-context retrieval.

How does SubQ differ from today’s transformers?

SubQ replaces dense attention, the quadratic multiplication step inside transformers, with a sparse-attention mechanism that dynamically selects which token relationships to compute. Dense attention multiplies every token with every other token, producing a quadratic growth in operations as text length increases; Subquadratic says sparse attention cuts that dramatically by selecting a subset of relationships on the fly. The firm says the selection is dynamic and computed differently for each input, and it declined to publish the exact selection method, calling it “the secret sauce.”

What did independent tests measure and report?

Appen ran a suite of tests and reported multiple striking numbers: it found SubQ 56 times faster than models using FlashAttention in a straight speed test, SubQ scored 89.7% on LiveCodeBench for competitive coding problems, and SubQ scored 98% on a needle-in-a-haystack long-context retrieval test with context windows of six million and 12 million tokens. Appen’s director of generative AI research, Jeanine Sinanan-Singh, said, "That was really exciting to me, it validated their architecture." Subquadratic also demonstrated a practical demo where SubQ reasoned across 400 documents in seconds, a task that another service failed to load.

What about cost and scale claims?

Subquadratic says SubQ can be far cheaper to run for certain data-heavy tests. The firm’s CEO Justin Dangel said running Anthropic’s Opus 4.6 through Nvidia’s RULER 128 cost $2,600, while running SubQ cost $8. The model’s context window tops out at 12 million tokens; Subquadratic contrasts that with most top models today, which it says have context windows of one million tokens. The company also claims tens of thousands of signups for early access, including more than 500 enterprise customers, though it has given access to only a small number because of limited resources.

Why it matters

If SubQ’s numbers hold up under broad outside use, the model would change trade-offs between context length, speed and cost for tasks that need to ingest very large data sets, such as searching or reasoning across hundreds of documents or entire code bases. Subquadratic positions SubQ as tailored to coding and large-scale retrieval tasks rather than as a drop-in replacement across every benchmark. Skeptics note important caveats: the firm bootstrapped SubQ from weights of a version of the Chinese open-source model Qwen rather than training entirely from scratch, and benchmarks run under specific conditions do not fully capture real-world failure modes.

What to watch

Watch for wider access and third-party evaluations beyond Appen, including independent users running diverse real-world workloads and security audits of the reused Qwen weights. A confirmatory signal would be more public access to end-to-end tests that reproduce Appen’s 56× speed claim and the 98% retrieval score at six million and 12 million token contexts.

Subquadratic’s early results have shifted skepticism toward scrutiny: the company has posted third-party benchmarks that support many of its claims, but final judgment will come from hands-on trials by outside teams and broader testing across tasks and domains.

SubQ vs typical top models (selected metrics from source)
Item
Speed vs FlashAttention56× faster (Appen speed test)FlashAttention baseline (Appen)
Context windowUp to 12,000,000 tokens (Subquadratic claim)Most top models: 1,000,000 tokens (Subquadratic claim)
LiveCodeBench (coding)89.7% (Appen)Frontier-level, in same ballpark as top coding models (Appen)
RULER 128 cost example$8 to run SubQ (Subquadratic claim)Anthropic Opus 4.6: $2,600 (Subquadratic claim)
Needle-in-a-haystack retrieval98% at 6M and 12M token contexts (Appen)Few models tested at that scale (Appen)
Advertisement

Written by The Brieftide · Source: MIT Technology Review

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement