[2602.10058] Evaluating Disentangled Representations for Controllable Music Generation
Summary
This article evaluates disentangled representations in music generation, focusing on their effectiveness for controllable synthesis and identifying inconsistencies in current models.
Why It Matters
Understanding disentangled representations is crucial for advancing controllable music generation techniques. This research highlights the limitations of current methods, prompting a reevaluation of strategies to enhance the semantic clarity and usability of music generation models.
Key Takeaways
- Disentangled representations are essential for controllable music synthesis.
- Current models show inconsistencies between intended and actual semantics.
- The study evaluates models using a probing-based framework across multiple axes.
- Insights gained may inform future strategies for improving music generation.
- A re-examination of controllability approaches is necessary for better outcomes.
Computer Science > Sound arXiv:2602.10058 (cs) [Submitted on 10 Feb 2026 (v1), last revised 15 Feb 2026 (this version, v2)] Title:Evaluating Disentangled Representations for Controllable Music Generation Authors:Laura Ibáñez-Martínez, Chukwuemeka Nkama, Andrea Poltronieri, Xavier Serra, Martín Rocamora View a PDF of the paper titled Evaluating Disentangled Representations for Controllable Music Generation, by Laura Ib\'a\~nez-Mart\'inez and 4 other authors View PDF HTML (experimental) Abstract:Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of pr...