Multimodal AIJune 16, 20264 min read

Cross-Modal Representation Alignment for Time-to-Event Modeling

A foundation model framework aligns CT imaging and longitudinal EHR with four fusion strategies.

The BrieftideJune 16, 2026

TL;DR

01A foundation model framework aligns CT imaging and longitudinal EHR with four fusion strategies.
02Across tasks and institutions, multimodal fusion improved predictive performance measured by concordance index.
03The paper frames these findings as a systematic analysis of how different fusion approaches behave under modality imbalance, and positions the framework as generalizable across tasks and institutions.

Zhemin Zhang and nine co-authors submitted "Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling" on 13 Jun 2026, presenting a foundation model-driven framework to align CT imaging and longitudinal electronic health record (EHR) data for time-to-event (TTE) prediction.

The paper evaluates two clinical TTE tasks — pulmonary embolism mortality and cardiovascular disease outcomes — on large multi-institutional cohorts and tests four principled fusion strategies to bring image and EHR representations into a shared latent space.

Methods and experimental setup

The authors encode CT and EHR modalities independently using domain-specific foundation models, then align the resulting representations through four fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention. They apply the framework to two tasks with explicit cohort sizes: pulmonary embolism (PE) mortality (train N=3,099; internal N=1,098; external N=435) and cardiovascular disease (CVD) outcomes (train N=2,951; internal N=837; external N=682).

The paper compares unimodal baselines against multimodal fusion when modalities "contribute comparably," and tests variants of contrastive multimodal fusion, including experiments using CLMBR representations for the EHR modality.

Results

Across tasks and institutions, multimodal fusion improved predictive performance measured by concordance index. Fusion "consistently improves concordance index by 1.5-5.4% over unimodal baselines when modalities contribute comparably."

The authors report that contrastive multimodal fusion, particularly when paired with CLMBR representations, provided the most consistent and statistically robust improvements, with a special impact on PE mortality prediction. For the major adverse cardiovascular events experiments (MACE), cross-attention using one-hot inputs achieved the highest internal performance, while image-guided co-attention delivered the best external performance.

The paper frames these findings as a systematic analysis of how different fusion approaches behave under modality imbalance, and positions the framework as generalizable across tasks and institutions.

Why it matters

Clinical TTE prediction commonly combines imaging and longitudinal EHR data, but modality imbalance and distribution shift undermine simple fusion recipes. This paper shows fusion is not uniformly beneficial: the choice of alignment strategy changes which cohort and which external site sees gains. That implies model designers must evaluate fusion methods against the specific task and data balance they face, not rely on a single fusion default.

The authors' use of foundation models for modality-specific encoding and their multi-institutional evaluation make the claim of cross-site generalization testable rather than theoretical. The 1.5–5.4% concordance improvements quantify the potential upside of task-aware multimodal alignment in TTE settings.

What to watch

Watch for follow-up work that publishes per-method concordance numbers across internal and external splits and for replication on different imaging modalities or EHR representations. The next critical milestone will be demonstrations of consistent external gains across more institutions and clinical settings, which would validate the paper's claim that task-aware multimodal alignment supports scalable clinical deployment.

Written by The Brieftide · Source: arXiv