[2602.21750] From Words to Amino Acids: Does the Curse of Depth Persist?
Summary
This paper explores the depth inefficiency in protein language models (PLMs), revealing that later layers contribute less to output predictions, similar to findings in large language models (LLMs).
Why It Matters
Understanding depth inefficiency in PLMs is crucial for improving model architectures and training methods, which can enhance performance in protein engineering and design. This research builds on existing knowledge from LLMs, providing insights that could lead to more efficient AI models in bioinformatics.
Key Takeaways
- PLMs exhibit depth inefficiency, where deeper layers contribute less to predictions.
- The study analyzes six popular PLMs across various training objectives.
- Findings suggest that later layers mainly refine outputs rather than add new information.
- Depth inefficiency is increasingly pronounced in deeper models.
- Results motivate future research on more efficient model architectures.
Computer Science > Machine Learning arXiv:2602.21750 (cs) [Submitted on 25 Feb 2026] Title:From Words to Amino Acids: Does the Curse of Depth Persist? Authors:Aleena Siji, Amir Mohammad Karimi Mamaghan, Ferdinand Kapl, Tobias Höppe, Emmanouil Angelis, Andrea Dittadi, Maurice Brenner, Michael Heinzinger, Karl Henrik Johansson, Kaitlin Maile, Johannes von Oswald, Stefan Bauer View a PDF of the paper titled From Words to Amino Acids: Does the Curse of Depth Persist?, by Aleena Siji and 11 other authors View PDF HTML (experimental) Abstract:Protein language models (PLMs) have become widely adopted as general-purpose models, demonstrating strong performance in protein engineering and de novo design. Like large language models (LLMs), they are typically trained as deep transformers with next-token or masked-token prediction objectives on massive sequence corpora and are scaled by increasing model depth. Recent work on autoregressive LLMs has identified the Curse of Depth: later layers contribute little to the final output predictions. These findings naturally raise the question of whether a similar depth inefficiency also appears in PLMs, where many widely used models are not autoregressive, and some are multimodal, accepting both protein sequence and structure as input. In this work, we present a depth analysis of six popular PLMs across model families and scales, spanning three training objectives, namely autoregressive, masked, and diffusion, and quantify how layer contributi...