Multimodal AIJune 27, 20265 min read

Synthetic clinical notes from LLMs: 70-patient longitudinal

William Poulett publishes a modular LLM pipeline and a synthetic dataset of 70 patients.

The BrieftideJune 27, 2026

TL;DR

01William Poulett publishes a modular LLM pipeline and a synthetic dataset of 70 patients.
02William Poulett published a modular pipeline and a synthetic dataset on 25 June 2026 that uses large language models to generate longitudinal clinical notes.
03Validation and augmentation components run after generation to check and increase realism and diversity.

William Poulett published a modular pipeline and a synthetic dataset on 25 June 2026 that uses large language models to generate longitudinal clinical notes. The release includes a dataset of 70 synthetic patients, each associated with 20–50 clinical notes spanning a full hospital journey, and the pipeline is designed to prioritise internal consistency across records.

How does the pipeline work?

The pipeline combines structured patient generation, semi-structured patient journey simulation, and unstructured clinical note generation using large language models, with additional LLM-based validation and augmentation steps to improve faithfulness, realism, and diversity. The paper describes a modular setup: first the system generates structured patient data, then simulates a semi-structured sequence of events that form a patient journey, and finally produces unstructured clinical notes with an LLM. Validation and augmentation components run after generation to check and increase realism and diversity.

The modular design allows each stage to be adjusted independently. Structured patient generation supplies demographics and clinical variables that feed the journey simulation. The journey simulation produces encounter sequences and semi-structured artifacts that the note-generation LLM turns into free-text clinical documentation. The authors add LLM-based validators and augmenters to reduce inconsistencies and expand stylistic variation in the notes.

What does the dataset contain?

The dataset contains 70 synthetic patients, with 20–50 clinical notes per patient, covering a full hospital journey, and is provided at multiple levels of validation so users can choose a balance between realism and scalability. The paper explicitly states the dataset is released at multiple validation levels to enable different use cases.

Notes vary in writing style, structure, and clinical detail, reflecting the pipeline’s aim to capture variation across clinical documentation. The authors position the dataset for development and evaluation of clinical AI systems, listing use cases including summarisation tools, coding models, and decision support systems. The submission to arXiv includes the paper PDF and associated materials; the arXiv record gives the submission date as 25 Jun 2026.

Why it matters

Synthetic clinical documentation is a practical route around patient privacy limits for developers and researchers who need realistic data to train and test models. By focusing on longitudinal consistency and multi-level validation, the pipeline addresses two common weaknesses of synthetic medical text: disconnected events across encounters and a lack of realistic stylistic variation. That combination makes the dataset immediately useful for tasks that require coherent histories, such as summarisation across multiple encounters or models that rely on temporal context.

The work also delegates specific roles to LLMs beyond raw text generation: the paper describes using models for validation and augmentation, not just note drafting. That approach treats LLMs as tools in a larger synthetic-data workflow rather than as single-step generators, which could influence how future synthetic clinical datasets are produced.

What to watch

Watch for external adoption of the dataset and pipeline in papers or toolkits focused on clinical summarisation, coding, or decision support, and for comparisons between models trained on these synthetic notes versus real-world clinical corpora. Also look for how groups use the multiple validation levels: whether they prioritise realism or scale for different evaluation tasks.

The paper and dataset are available via arXiv (arXiv:2606.26879) as of the 25 June 2026 submission.

Pipeline stages for generating longitudinal synthetic clinical notes

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

MIT Masked IRL: LLMs help robots clarify and ignore cues

MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.

The BrieftideDAILY BRIEF

Multimodal LLM evaluation: four missing capabilities (2026)

A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.

The BrieftideDAILY BRIEF

ReMMD: Multilingual Multi-Image Benchmark and Agent Release

ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.

The BrieftideDAILY BRIEF

Amazon Nova embeddings beat Cohere for Vexcel aerial search

Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.