[2602.19367] Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces
Summary
This paper investigates the alignment of representations from time series, vision, and language modalities, revealing insights into their geometric relationships and implications for multimodal systems.
Why It Matters
Understanding the alignment between different data modalities is crucial for advancing multimodal AI systems. This research contributes to the foundational knowledge of how various forms of data can be integrated, which is essential for applications in AI that require the synthesis of diverse information types.
Key Takeaways
- The Platonic Representation Hypothesis suggests learned representations converge across modalities, but this study finds time series exhibit near-orthogonal geometry without explicit coupling.
- Post-hoc alignment improves representation alignment, particularly showing that time series align more with visual data than with text.
- Model size positively impacts alignment in contrastive representation spaces, but the improvement is asymmetric.
- Richer textual descriptions enhance alignment only to a certain threshold, beyond which no further benefits are observed.
- The findings have implications for developing multimodal systems that incorporate non-conventional data types.
Computer Science > Artificial Intelligence arXiv:2602.19367 (cs) [Submitted on 22 Feb 2026] Title:Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces Authors:Pratham Yashwante, Rose Yu View a PDF of the paper titled Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces, by Pratham Yashwante and 1 other authors View PDF HTML (experimental) Abstract:The Platonic Representation Hypothesis posits that learned representations from models trained on different modalities converge to a shared latent structure of the world. However, this hypothesis has largely been examined in vision and language, and it remains unclear whether time series participate in such convergence. We first examine this in a trimodal setting and find that independently pretrained time series, vision, and language encoders exhibit near-orthogonal geometry in the absence of explicit coupling. We then apply post-hoc alignment by training projection heads over frozen encoders using contrastive learning, and analyze the resulting representations with respect to geometry, scaling behavior, and dependence on information density and input modality characteristics. Our investigation reveals that overall alignment in contrastive representation spaces improves with model size, but this alignment is asymmetric: time series align more strongly with visual representations than with text, and images can act as effectiv...