[2511.05541] Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
Summary
The paper introduces Temporal Sparse Autoencoders (T-SAEs), enhancing interpretability in language models by leveraging the sequential nature of language to recover coherent semantic concepts.
Why It Matters
Interpretability in AI is crucial for understanding model decisions. T-SAEs address limitations of existing methods by incorporating temporal structures, potentially improving the transparency of language models and their applications in various domains.
Key Takeaways
- T-SAEs improve the interpretability of language models by leveraging temporal structures.
- The model disentangles semantic from syntactic features in a self-supervised manner.
- T-SAEs recover smoother and more coherent semantic concepts without explicit semantic signals.
- The approach shows promise across multiple datasets and models, enhancing unsupervised interpretability.
- This research opens new pathways for understanding AI decision-making processes.
Computer Science > Computation and Language arXiv:2511.05541 (cs) [Submitted on 30 Oct 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability Authors:Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon View a PDF of the paper titled Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability, by Usha Bhalla and 4 other authors View PDF HTML (experimental) Abstract:Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they often only recover token-specific, noisy, or highly local concepts. We argue that this limitation stems from neglecting the temporal structure of language, where semantic content typically evolves smoothly over sequences. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacri...