UltraFlux native 4K text-to-image DiT, MultiAspect-4K-1M
A Flux-based DiT trained natively at 4096 on MultiAspect-4K-1M (1M images) that uses new RoPE.
TL;DR
- 01A Flux-based DiT trained natively at 4096 on MultiAspect-4K-1M (1M images) that uses new RoPE.
- 02UltraFlux, a Flux-based diffusion transformer, is trained natively at 4K on a new 1M-image corpus and targets high-quality text-to-image generation across diverse aspect ratios.
- 03The dataset side provides resolution- and AR-aware sampling via MultiAspect-4K-1M, enabling training and evaluation across diverse aspect ratios.
UltraFlux, a Flux-based diffusion transformer, is trained natively at 4K on a new 1M-image corpus and targets high-quality text-to-image generation across diverse aspect ratios. The paper, submitted 22 Nov 2025 by Tian Ye, Song Fei and Lei Zhu, introduces the MultiAspect-4K-1M dataset and a set of model and objective changes that the authors say raise fidelity, aesthetics and alignment at 4096.
What is UltraFlux?
UltraFlux is a Flux-based DiT trained natively at 4K (4096) on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. The project couples dataset design and model changes specifically to address failures the authors attribute to positional encoding, VAE compression, and optimization when scaling DiTs to native 4K across wide, square and tall aspect ratios.
How does UltraFlux work?
UltraFlux combines four concrete model and training components with its dataset: (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy concentrating high-aesthetic supervision on high-noise steps governed by the model prior. The dataset side provides resolution- and AR-aware sampling via MultiAspect-4K-1M, enabling training and evaluation across diverse aspect ratios.
What evidence do the authors present for quality gains?
On the Aesthetic-Eval at 4096 benchmark and in multi-AR 4K settings, the paper states UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics. The authors also report that, when paired with an LLM prompt refiner, UltraFlux matches or surpasses the proprietary Seedream 4.0 on those measures. The concrete dataset scale cited in the paper is a 1M-image 4K corpus; the submission date is 22 Nov 2025.
Why it matters
Training diffusion transformers natively at 4096 exposes interacting failure modes the authors say cannot be solved in isolation: positional encoding, VAE reconstruction and optimization all degrade quality at 4K and across aspect ratios. UltraFlux makes those interactions explicit and proposes matched dataset and model fixes. That approach matters for anyone needing native 4K generation across nonstandard aspect ratios, because it targets the three core bottlenecks the paper identifies rather than relying on upscaling or ad hoc fixes.
What to watch
Look for the project's code and project page linked from the submission, and for external reproduction on public 4K benchmarks. The next confirmatory signals will be independent runs on Aesthetic-Eval at 4096 and community comparisons to proprietary Seedream 4.0 using the same LLM prompt refiner setup.
References and provenance: paper "UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios", arXiv:2511.18050, submitted 22 Nov 2025, authors Tian Ye, Song Fei, Lei Zhu. Key dataset name: MultiAspect-4K-1M (1M-image 4K corpus).
| Item | |||||
|---|---|---|---|---|---|
| UltraFlux | native 4K (4096) | MultiAspect-4K-1M (1M-image 4K corpus with multi-AR coverage, bilingual captions, VLM/IQA metadata) | Resonance 2D RoPE with YaRN; VAE post-training; SNR-Aware Huber Wavelet objective; Stage-wise Aesthetic Curriculum Learning | Outperforms strong open-source baselines on fidelity, aesthetic, and alignment; with an LLM prompt refiner matches or surpasses proprietary Seedream 4.0 on Aesthetic-Eval at 4096 | |
| Strong open-source baselines | various | various | baseline DiT approaches (not specified in paper excerpt) | Reportedly outperformed by UltraFlux on Aesthetic-Eval at 4096 and multi-AR 4K settings | |
| Seedream 4.0 (proprietary) | proprietary (not specified) | proprietary (not specified) | proprietary model | UltraFlux with LLM prompt refiner matches or surpasses Seedream 4.0 (per paper) |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIMMIR-TCM: multimodal TCM AI framework outperforms GPT-4o, Gemini
MMIR-TCM pairs Memory-SAM, fine-tuned Qwen3-VL and a Qwen3 RAG pipeline.
MIT Masked IRL: LLMs help robots clarify and ignore cues
MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.
Multimodal LLM evaluation: four missing capabilities (2026)
A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.
ReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.