[2603.05494] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
About this article
Abstract page for arXiv paper 2603.05494: Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Computer Science > Machine Learning arXiv:2603.05494 (cs) [Submitted on 5 Mar 2026] Title:Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation Authors:Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda View a PDF of the paper titled Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation, by Helena Casademunt and 5 other authors View PDF HTML (experimental) Abstract:Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored mode...