[2602.11358] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

[2602.11358] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

arXiv - Machine Learning 4 min read Article

Summary

This article explores the relationship between vocabulary activation and self-referential processing in large language models, introducing the Pull Methodology to enhance introspective language generation.

Why It Matters

Understanding how language models process self-referential information is crucial for improving their interpretability and reliability. This research offers insights into the internal workings of models like Llama 3.1 and Qwen 2.5-32B, which could inform future AI development and applications in natural language processing.

Key Takeaways

  • Self-referential vocabulary in models can reflect internal computation.
  • The Pull Methodology enhances self-examination in language models.
  • Different models can develop unique introspective vocabularies based on activation metrics.
  • Activation dynamics are distinct between self-referential and descriptive processing.
  • Findings may improve the reliability of self-reports in transformer models.

Computer Science > Computation and Language arXiv:2602.11358 (cs) [Submitted on 11 Feb 2026 (v1), last revised 18 Feb 2026 (this version, v2)] Title:When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing Authors:Zachary Pedram Dadfar View a PDF of the paper titled When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing, by Zachary Pedram Dadfar View PDF HTML (experimental) Abstract:Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in ...

Related Articles

Llms

Kept hitting ChatGPT and Claude limits during real work. This is the free setup I ended up using

I do a lot of writing and random problem solving for work. Mostly long drafts, edits, and breaking down ideas. Around Jan I kept hitting ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Is ChatGPT changing the way we think too much already?

Back in the day, I got ChatGPT Plus mostly for work and to help me write better and do stuff faster. But now I use it for almost everythi...

Reddit - Artificial Intelligence · 1 min ·
Llms

Will people continue paying for the plans after the honeymoon is over?

I currently pay for Max 20x and the demand at work is so high that I can only get everything I need done because I have access to Claude....

Reddit - Artificial Intelligence · 1 min ·
Llms

Nvidia goes all-in on AI agents while Anthropic pulls the plug

TLDR: Nvidia is partnering with 17 major companies to build a platform specifically for enterprise AI agents, basically trying to become ...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime