Llms Machine Learning Nlp Ai Agents

[2602.11358] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

arXiv - Machine Learning February 19, 2026 4 min read Article

Summary

This article explores the relationship between vocabulary activation and self-referential processing in large language models, introducing the Pull Methodology to enhance introspective language generation.

Why It Matters

Understanding how language models process self-referential information is crucial for improving their interpretability and reliability. This research offers insights into the internal workings of models like Llama 3.1 and Qwen 2.5-32B, which could inform future AI development and applications in natural language processing.

Key Takeaways

Self-referential vocabulary in models can reflect internal computation.
The Pull Methodology enhances self-examination in language models.
Different models can develop unique introspective vocabularies based on activation metrics.
Activation dynamics are distinct between self-referential and descriptive processing.
Findings may improve the reliability of self-reports in transformer models.

Computer Science > Computation and Language arXiv:2602.11358 (cs) [Submitted on 11 Feb 2026 (v1), last revised 18 Feb 2026 (this version, v2)] Title:When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing Authors:Zachary Pedram Dadfar View a PDF of the paper titled When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing, by Zachary Pedram Dadfar View PDF HTML (experimental) Abstract:Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in ...

Read Original Article

[2602.11358] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Summary

Why It Matters

Key Takeaways

Related Articles

Kept hitting ChatGPT and Claude limits during real work. This is the free setup I ended up using

Is ChatGPT changing the way we think too much already?

Will people continue paying for the plans after the honeymoon is over?

Nvidia goes all-in on AI agents while Anthropic pulls the plug

No comments

Stay updated with AI News