5 min read

Stanford EDGAR Filings Dataset: SEFD-v1 152B-token release

Open reconstruction of SEC filings yields a 152B-token public snapshot and an 18.5M-filing archive estimated at 550B tokens.

The Brieftide

TL;DR

  • 01Open reconstruction of SEC filings yields a 152B-token public snapshot and an 18.5M-filing archive estimated at 550B tokens.
  • 02The Stanford EDGAR Filings Dataset (SEFD) reconstructs U.S.
  • 03SEC filings into layout-faithful MultiMarkdown and publishes SEFD-v1, a 152B-token public snapshot intended for long-context pretraining and evaluation.

The Stanford EDGAR Filings Dataset (SEFD) reconstructs U.S. SEC filings into layout-faithful MultiMarkdown and publishes SEFD-v1, a 152B-token public snapshot intended for long-context pretraining and evaluation. The authors also describe a larger archive of 18.5 million filings estimated at 550B tokens and report less than 0.1% overlap with Common Crawl-derived corpora.

What is SEFD and what does it contain?

SEFD is an open reconstruction of SEC filings formatted as layout-faithful MultiMarkdown for financial language modeling and evaluation. The corpus makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding.

The initial public snapshot, SEFD-v1, is described as a 152B-token dataset. The paper additionally provides corpus-level analyses of a larger archive of 18.5M filings, which the authors estimate at 550B tokens. The authors emphasize token-efficiency and low duplication versus common web corpora, noting the corpus has less than 0.1% overlap with Common Crawl-derived datasets.

How was the dataset built and what problems does it address?

SEFD reconstructs filings into a layout-preserving MultiMarkdown representation to supply clean long-context documents for model pretraining and downstream evaluation. The authors argue that high-quality public web corpora are increasingly exhausted, and that existing long-context corpora are often proprietary, costly, synthetically generated, or concentrated in narrow domains.

To address that gap, the dataset pipeline extracts audited statements, disclosures, ownership reports, accounting notes, and event filings from EDGAR and converts them into a format intended to preserve document layout while remaining token-efficient and model-ready. The paper frames SEFD as usable both for pretraining and for document-level evaluation tasks in finance and compliance.

What benchmarks come with SEFD?

The authors introduce two SEFD-derived benchmarks, EDGAR-Forecast and EDGAR-OCR, alongside the corpus. EDGAR-Forecast evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR evaluates transcription of complex financial tables.

Both benchmarks are positioned as evaluation tools that leverage SEFD's layout-faithful reconstructions: EDGAR-Forecast targets numerical forecasting grounded in filings, EDGAR-OCR targets document transcription accuracy for tables that are often challenging for text-only corpora.

Why it matters

Clean, long-context documents are scarce for training and evaluating large language models, especially in finance where precise layout and tabular data matter. SEFD supplies a sizeable, open, domain-specific corpus with a 152B-token public snapshot and a much larger 550B-token archive, plus benchmarks designed for forecasting and table transcription. The dataset's reported overlap of less than 0.1% with Common Crawl-derived corpora suggests high novelty, which matters for researchers seeking fresh training signal or for evaluators testing true generalization beyond common web text.

Researchers and practitioners focused on financial reasoning, forecasting, compliance, and document understanding gain a usable, layout-aware dataset and concrete evaluation tasks in EDGAR-Forecast and EDGAR-OCR. The archive scale and the token-efficiency claim make SEFD relevant to teams building long-context pretraining mixes or fine-tuning bench models on financial documents.

What to watch

Check adoption of EDGAR-Forecast and EDGAR-OCR as community evaluation tasks and whether model builders incorporate the 152B-token SEFD-v1 snapshot into pretraining mixes. Also watch for public releases or updates tied to the 18.5M-filing archive and any empirical results comparing models trained with SEFD data against existing long-context corpora.

Authors and submission details: the paper is authored by Nick Bettencourt, Xiaowei Ding, and Kay Giesecke, and the arXiv submission history shows a first submission on 16 Jun 2026 and a revision on 17 Jun 2026.

Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement