June 19, 20265 min read

TOTEN: Ontological tokenization for Brazilian Portuguese notation

TOTEN replaces statistical subword splits with a knowledge-based ontology to preserve units and numbers.

The BrieftideJune 19, 2026

TL;DR

01TOTEN replaces statistical subword splits with a knowledge-based ontology to preserve units and numbers.
02TOTEN, a knowledge-based ontological tokenization framework, was submitted to arXiv on 17 Jun 2026.
03Dimensional equivalence specifically shows statistical parity with Pint, the dimensional oracle from which TOTEN inherits dimensional authority.

TOTEN, a knowledge-based ontological tokenization framework, was submitted to arXiv on 17 Jun 2026. The authors present TOTEN as the triple <O, classify, {inst_tau}> and evaluate it on an internal benchmark EngQuant (N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases), reporting statistically significant improvements over eight state-of-the-art baselines.

What is TOTEN and how does it work?

TOTEN is a declarative tokenization system that replaces statistical derivation with ontology-guided classification: the paper formalizes it as <O, classify, {inst_tau}> where O is an ontology of engineering entities, classify maps raw text into typed regions, and the instantiator family yields a self-descriptive structured representation. The system couples deterministic classification with three external oracles: Pint for dimensional analysis, the Unicode Character Database for typographic properties, and RSLP for Portuguese morphology. The ontology encodes types, structural principles, composition relations, and preservable invariants so that physical quantities, units, numeric forms, and symbolic expressions are treated as coherent, atomic tokens rather than arbitrary subword fragments.

How does TOTEN perform versus existing tokenizers and detectors?

TOTEN attains unit ontological atomicity in all contrasts against eight state-of-the-art baselines and posts substantially higher numerical reconstruction scores: on the internal EngQuant benchmark (N=800) TOTEN reaches numerical reconstruction of 0.780 versus 0.340 for the best baseline. Across four external Brazilian Portuguese corpora (N=1771 eligible cases) TOTEN yields numerical reconstruction in the range 0.775 to 0.904, compared with 0.627 to 0.703 for the best baseline, Quantulum3. The authors report these differences as statistically significant using McNemar tests with Holm correction, and they show a Spearman correlation between internal and external rankings to support the benchmark's concurrent validity.

How robust and verifiable is TOTEN's design?

Robustness is engineered by construction: the paper defines four intrinsic properties that the system can verify by construction, ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction, and evaluates them on EngQuant and the external corpora. Dimensional equivalence specifically shows statistical parity with Pint, the dimensional oracle from which TOTEN inherits dimensional authority. Detection recall is reported separately to distinguish coverage from conditional atomicity, and the authors contrast TOTEN against eight baselines to separate detection performance from the system's ontological guarantees.

Why it matters

TOTEN addresses a practical mismatch between statistical tokenizers and technical text: Byte-Pair Encoding and similar methods fragment physical quantities, units, and symbolic expressions into lexically arbitrary subwords, which breaks downstream tasks that need coherent numeric and unit semantics. By returning self-descriptive, typed regions grounded in an engineering ontology and external oracles, TOTEN preserves numeric and dimensional structure. That preservation yields measurable gains on both a physically validated internal benchmark and on real-world Portuguese corpora, implying cleaner inputs for extraction, conversion, and engineering NLP pipelines.

What to watch

Follow whether external teams reproduce the reported numerical reconstruction ranges (0.775 to 0.904) on additional Portuguese technical corpora and whether TOTEN’s ontological approach is integrated into tokenizers used in engineering document pipelines. A concrete next milestone will be public release of the EngQuant benchmark and the TOTEN instantiators to enable independent replication and broader evaluation against multilingual technical texts.

TOTEN vs best baseline: reported metrics

Item
Numerical reconstruction	0.780	0.340	0.775–0.904	0.627–0.703
Unit ontological atomicity	achieved in all contrasts	not achieved (baseline)	achieved in all contrasts	not achieved (baseline)
Dimensional equivalence with Pint	statistical parity with Pint	not reported	statistical parity with Pint	not reported

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Browse the feed

The BrieftideDAILY BRIEF

Grok 3 and Grok 3 mini API released; Epoch flags o1 Apr 2025

X.ai pushed Grok 3 and Grok 3 mini APIs while Sama hyped a ChatGPT Memory update with few technical details; o3 and o4-mini signs also.

The BrieftideDAILY BRIEF

Anthropic Claude Fable 5 and Mythos 5: benchmarks, price

Anthropic's Claude Fable 5 and Mythos 5 set new benchmarks in coding and science and cost roughly double Claude Opus 4.8.

The BrieftideDAILY BRIEF

Codex powers Tax AI for Crete accountants: self-improving loop

Built with Codex, Tax AI processed 7,000 returns across Crete’s 30+ firms.

The BrieftideDAILY BRIEF

Claude Fable 5 vs GPT-5.5: FrontierMath toughest-tier scores

Anthropic's Claude Fable 5 beats GPT-5.5 by 13 points on FrontierMath's hardest tier, hitting 88% versus about 75%.