Cross-Modal Representation Alignment for Time-to-Event Modeling
A foundation model framework aligns CT imaging and longitudinal EHR with four fusion strategies.
TL;DR
- 01A foundation model framework aligns CT imaging and longitudinal EHR with four fusion strategies.
- 02Across tasks and institutions, multimodal fusion improved predictive performance measured by concordance index.
- 03The paper frames these findings as a systematic analysis of how different fusion approaches behave under modality imbalance, and positions the framework as generalizable across tasks and institutions.
Zhemin Zhang and nine co-authors submitted "Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling" on 13 Jun 2026, presenting a foundation model-driven framework to align CT imaging and longitudinal electronic health record (EHR) data for time-to-event (TTE) prediction.
The paper evaluates two clinical TTE tasks — pulmonary embolism mortality and cardiovascular disease outcomes — on large multi-institutional cohorts and tests four principled fusion strategies to bring image and EHR representations into a shared latent space.
Methods and experimental setup
The authors encode CT and EHR modalities independently using domain-specific foundation models, then align the resulting representations through four fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention. They apply the framework to two tasks with explicit cohort sizes: pulmonary embolism (PE) mortality (train N=3,099; internal N=1,098; external N=435) and cardiovascular disease (CVD) outcomes (train N=2,951; internal N=837; external N=682).
The paper compares unimodal baselines against multimodal fusion when modalities "contribute comparably," and tests variants of contrastive multimodal fusion, including experiments using CLMBR representations for the EHR modality.
Results
Across tasks and institutions, multimodal fusion improved predictive performance measured by concordance index. Fusion "consistently improves concordance index by 1.5-5.4% over unimodal baselines when modalities contribute comparably."
The authors report that contrastive multimodal fusion, particularly when paired with CLMBR representations, provided the most consistent and statistically robust improvements, with a special impact on PE mortality prediction. For the major adverse cardiovascular events experiments (MACE), cross-attention using one-hot inputs achieved the highest internal performance, while image-guided co-attention delivered the best external performance.
The paper frames these findings as a systematic analysis of how different fusion approaches behave under modality imbalance, and positions the framework as generalizable across tasks and institutions.
Why it matters
Clinical TTE prediction commonly combines imaging and longitudinal EHR data, but modality imbalance and distribution shift undermine simple fusion recipes. This paper shows fusion is not uniformly beneficial: the choice of alignment strategy changes which cohort and which external site sees gains. That implies model designers must evaluate fusion methods against the specific task and data balance they face, not rely on a single fusion default.
The authors' use of foundation models for modality-specific encoding and their multi-institutional evaluation make the claim of cross-site generalization testable rather than theoretical. The 1.5–5.4% concordance improvements quantify the potential upside of task-aware multimodal alignment in TTE settings.
What to watch
Watch for follow-up work that publishes per-method concordance numbers across internal and external splits and for replication on different imaging modalities or EHR representations. The next critical milestone will be demonstrations of consistent external gains across more institutions and clinical settings, which would validate the paper's claim that task-aware multimodal alignment supports scalable clinical deployment.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIVisual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.
Hugging Face Spaces agents.md: chain image to 3D splats
An agent used two Hugging Face Spaces and their agents.md files to auto-generate images, reconstruct 3D Gaussian splats.
LLM Research Papers 2026 (Jan–May): Curated list and trends
Sebastian Raschka assembled a curated list of LLM papers bookmarked from January through May 2026.