[2604.02608] Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
About this article
Abstract page for arXiv paper 2604.02608: Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
Computer Science > Machine Learning arXiv:2604.02608 (cs) [Submitted on 3 Apr 2026] Title:Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens Authors:Mohammed Suhail B Nadaf View a PDF of the paper titled Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens, by Mohammed Suhail B Nadaf View PDF HTML (experimental) Abstract:Function vectors (FVs) -- mean-difference directions extracted from in-context learning demonstrations -- can steer large language model behavior when added to the residual stream. We hypothesized that FV steering failures reflect an absence of task-relevant information: the logit lens would fail alongside steering. We were wrong. In the most comprehensive cross-template FV transfer study to date - 4,032 pairs across 12 tasks, 6 models from 3 families (Llama-3.1-8B, Gemma-2-9B, Mistral-7B-v0.3; base and instruction-tuned), 8 templates per task - we find the opposite dissociation: FV steering succeeds even when the logit lens cannot decode the correct answer at any layer. This steerability-without-decodability pattern is universal: steering exceeds logit lens accuracy for every task on every model, with gaps as large as -0.91. Only 3 of 72 task-model instances show the predicted decodable-without-steerable pattern, all in Mistral. FV vocabulary projection reveals that FVs achieving over 0.90 steering accuracy still project to incoherent token distributions, indicating FVs encode computational instructions rath...