Multimodal AIJune 26, 20264 min read

Generative Retrieval MO-DiT+HPPO: arXiv paper and results

MO-DiT+HPPO pairs a Diffusion Transformer with metric-ordered sequence training and hybrid-policy preference optimization for.

The BrieftideJune 26, 2026

TL;DR

01MO-DiT+HPPO pairs a Diffusion Transformer with metric-ordered sequence training and hybrid-policy preference optimization for.
02The work targets what the authors call "pattern-preserving attribute retrieval," where returned items must both satisfy a target attribute and stay within a fine-grained seed-seed) pattern.
03MO-DiT+HPPO is a staged continuous generative retrieval pipeline that reads sequences of item embeddings and generates query embeddings for nearest-neighbor search.

Chenghao Liu and 10 co-authors posted a paper to arXiv on 25 Jun 2026 (arXiv:2606.26899) that introduces MO-DiT+HPPO, a staged framework for continuous generative retrieval built around a Diffusion Transformer and a hybrid preference-optimization procedure. The work targets what the authors call "pattern-preserving attribute retrieval," where returned items must both satisfy a target attribute and stay within a fine-grained seed pattern.

What is MO-DiT+HPPO and how does it work?

MO-DiT+HPPO is a staged continuous generative retrieval pipeline that reads sequences of item embeddings and generates query embeddings for nearest-neighbor search. The framework includes raw-sequence pretraining, multi-domain metric-ordered continuation pretraining, tail-centroid fine-tuning, and a final Hybrid-Policy Preference Optimization (HPPO) stage.

Metric-ordered training converts sparse online retrieval labels into in-pattern trajectories ordered from low to high predicted attribute density, teaching a single model the metric-improvement direction across domains. HPPO aligns the generated query distribution with the online objective by labeling a hybrid candidate pool with the online intersection metric and applying reference-anchored preference optimization. A Pareto pair filter keeps only winner pairs that do not lower same-pattern purity, aiming to raise the attribute metric without sacrificing pattern fidelity.

How did the paper evaluate performance and what were the results?

The authors evaluated MO-DiT+HPPO across four attribute domains under item- and pattern-holdout protocols and measured improvement in the intersection metric. Metric-ordered DiT improved the intersection metric over a pretrained generative retriever, and HPPO improved it further, producing significant gains on seven of eight domain-split cells and a marginal tie on the hardest split.

The paper also reports ablations and validations to trace the source of gains: metric-predictor validation, order ablations, CPT/SFT comparisons, and a candidate-policy ablation. Those experiments, the authors say, show where the improvements come from within the staged training and HPPO pipeline.

Why does this matter?

Pattern-preserving attribute retrieval describes a common production need where naive averaging or global attribute search fails: averaging seeds preserves pattern but yields low attribute scores, while global attribute retrieval drifts to unrelated patterns. MO-DiT+HPPO directly addresses the two-way tension by training a generative retriever to move along in-pattern trajectories toward higher attribute density and then aligning generation with the actual online metric. If reproduced and adopted, that approach could change how systems balance pattern fidelity against attribute targeting in recommendation and retrieval settings.

What to watch next?

Look for code and data releases linked from the arXiv entry or the authors' pages, plus replication on external datasets and live A/B evaluations that measure the online intersection metric. The paper lists several internal ablations; confirming those in independent implementations will show whether the seven-of-eight experimental improvements generalize beyond the reported domains.

Paper and provenance: "Generative Retrieval via Diffusion Transformer with Metric-Ordered Sequence Training and Hybrid-Policy Preference Optimization," Chenghao Liu, Yu Zhang, Zhongtao Jiang, Kun Xu, Zhenwei An, Renzhi Wang, Zhao Wang, Jiachen Zhang, Yuxiao Zhang, Kun Xu, Songfang Huang, arXiv:2606.26899, submitted 25 Jun 2026. DOI: https://doi.org/10.48550/arXiv.2606.26899.

MO-DiT+HPPO training and optimization stages

01
Raw-sequence pretraining
Initial training on raw item-embedding sequences to teach a generative retriever to read and produce embeddings.
02
Multi-domain metric-ordered continuation pretraining
Convert sparse retrieval labels into in-pattern trajectories ordered from low to high predicted attribute density.
03
Tail-centroid fine-tuning
Fine-tune the model to focus on tail-centroid representations within the target pattern.
04
Hybrid-Policy Preference Optimization (HPPO)
Label a hybrid candidate pool with the online intersection metric, apply reference-anchored preference optimization, and filter with a Pareto pair filter to preserve same-pattern purity.

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

MIT Masked IRL: LLMs help robots clarify and ignore cues

MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.

The BrieftideDAILY BRIEF

Multimodal LLM evaluation: four missing capabilities (2026)

A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.

The BrieftideDAILY BRIEF

ReMMD: Multilingual Multi-Image Benchmark and Agent Release

ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.

The BrieftideDAILY BRIEF

Amazon Nova embeddings beat Cohere for Vexcel aerial search

Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.

What is MO-DiT+HPPO and how does it work?

How did the paper evaluate performance and what were the results?

Why does this matter?

What to watch next?

Raw-sequence pretraining

Multi-domain metric-ordered continuation pretraining

Tail-centroid fine-tuning

Hybrid-Policy Preference Optimization (HPPO)

Continue reading

MIT Masked IRL: LLMs help robots clarify and ignore cues

Multimodal LLM evaluation: four missing capabilities (2026)

ReMMD: Multilingual Multi-Image Benchmark and Agent Release

Amazon Nova embeddings beat Cohere for Vexcel aerial search