5 min read

TOTEN: Ontological tokenization for Brazilian Portuguese notation

TOTEN replaces statistical subword splits with a knowledge-based ontology to preserve units and numbers.

The Brieftide

TL;DR

  • 01TOTEN replaces statistical subword splits with a knowledge-based ontology to preserve units and numbers.
  • 02TOTEN, a knowledge-based ontological tokenization framework, was submitted to arXiv on 17 Jun 2026.
  • 03Dimensional equivalence specifically shows statistical parity with Pint, the dimensional oracle from which TOTEN inherits dimensional authority.

TOTEN, a knowledge-based ontological tokenization framework, was submitted to arXiv on 17 Jun 2026. The authors present TOTEN as the triple <O, classify, {inst_tau}> and evaluate it on an internal benchmark EngQuant (N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases), reporting statistically significant improvements over eight state-of-the-art baselines.

What is TOTEN and how does it work?

TOTEN is a declarative tokenization system that replaces statistical derivation with ontology-guided classification: the paper formalizes it as <O, classify, {inst_tau}> where O is an ontology of engineering entities, classify maps raw text into typed regions, and the instantiator family yields a self-descriptive structured representation. The system couples deterministic classification with three external oracles: Pint for dimensional analysis, the Unicode Character Database for typographic properties, and RSLP for Portuguese morphology. The ontology encodes types, structural principles, composition relations, and preservable invariants so that physical quantities, units, numeric forms, and symbolic expressions are treated as coherent, atomic tokens rather than arbitrary subword fragments.

How does TOTEN perform versus existing tokenizers and detectors?

TOTEN attains unit ontological atomicity in all contrasts against eight state-of-the-art baselines and posts substantially higher numerical reconstruction scores: on the internal EngQuant benchmark (N=800) TOTEN reaches numerical reconstruction of 0.780 versus 0.340 for the best baseline. Across four external Brazilian Portuguese corpora (N=1771 eligible cases) TOTEN yields numerical reconstruction in the range 0.775 to 0.904, compared with 0.627 to 0.703 for the best baseline, Quantulum3. The authors report these differences as statistically significant using McNemar tests with Holm correction, and they show a Spearman correlation between internal and external rankings to support the benchmark's concurrent validity.

How robust and verifiable is TOTEN's design?

Robustness is engineered by construction: the paper defines four intrinsic properties that the system can verify by construction, ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction, and evaluates them on EngQuant and the external corpora. Dimensional equivalence specifically shows statistical parity with Pint, the dimensional oracle from which TOTEN inherits dimensional authority. Detection recall is reported separately to distinguish coverage from conditional atomicity, and the authors contrast TOTEN against eight baselines to separate detection performance from the system's ontological guarantees.

Why it matters

TOTEN addresses a practical mismatch between statistical tokenizers and technical text: Byte-Pair Encoding and similar methods fragment physical quantities, units, and symbolic expressions into lexically arbitrary subwords, which breaks downstream tasks that need coherent numeric and unit semantics. By returning self-descriptive, typed regions grounded in an engineering ontology and external oracles, TOTEN preserves numeric and dimensional structure. That preservation yields measurable gains on both a physically validated internal benchmark and on real-world Portuguese corpora, implying cleaner inputs for extraction, conversion, and engineering NLP pipelines.

What to watch

Follow whether external teams reproduce the reported numerical reconstruction ranges (0.775 to 0.904) on additional Portuguese technical corpora and whether TOTEN’s ontological approach is integrated into tokenizers used in engineering document pipelines. A concrete next milestone will be public release of the EngQuant benchmark and the TOTEN instantiators to enable independent replication and broader evaluation against multilingual technical texts.

TOTEN vs best baseline: reported metrics
Item
Numerical reconstruction0.7800.3400.775–0.9040.627–0.703
Unit ontological atomicityachieved in all contrastsnot achieved (baseline)achieved in all contrastsnot achieved (baseline)
Dimensional equivalence with Pintstatistical parity with Pintnot reportedstatistical parity with Pintnot reported
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Browse the feed
Advertisement