[2505.03530] A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders
About this article
Abstract page for arXiv paper 2505.03530: A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders
Computer Science > Machine Learning arXiv:2505.03530 (cs) [Submitted on 6 May 2025 (v1), last revised 5 Apr 2026 (this version, v3)] Title:A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders Authors:Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy View a PDF of the paper titled A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders, by Dip Roy and 3 other authors View PDF Abstract:Understanding how generative models represent and transform data is a foundational problem in deep learning interpretability. While mechanistic interpretability of discriminative architectures has yielded substantial insights, relatively little work has addressed variational autoencoders (VAEs). This paper presents the first general-purpose multilevel causal intervention framework for mechanistic interpretability of VAEs. The framework comprises four manipulation types: input manipulation, latent-space perturbation, activation patching, and causal mediation analysis. We also define three new quantitative metrics capturing properties not measured by existing disentanglement metrics alone: Causal Effect Strength (CES), intervention specificity, and circuit modularity. We conduct the largest empirical study to date of VAE causal mechanisms across six architectures (standard VAE, beta-VAE, FactorVAE, beta-TC-VAE, DIP-VAE-II, and VQ-VAE) and five benchmarks (dSprites, 3DShapes, MPI3D, CelebA, and...