[2602.20273] The Truthfulness Spectrum Hypothesis
Summary
The Truthfulness Spectrum Hypothesis explores how large language models (LLMs) represent truthfulness across various domains, revealing a spectrum of truth types and their generalization capabilities.
Why It Matters
Understanding how LLMs encode truthfulness is crucial for improving their reliability and ethical use. This research provides insights into the representational geometry of truth in AI, which can inform better model training and evaluation strategies.
Key Takeaways
- LLMs exhibit a spectrum of truthfulness, ranging from domain-general to domain-specific.
- Probes show strong generalization across most truth types but struggle with sycophantic and expectation-inverted lying.
- Post-training adjustments can reshape the geometry of truth representations in LLMs.
Computer Science > Machine Learning arXiv:2602.20273 (cs) [Submitted on 23 Feb 2026] Title:The Truthfulness Spectrum Hypothesis Authors:Zhuofan Josh Ying, Shauli Ravfogel, Nikolaus Kriegeskorte, Peter Hase View a PDF of the paper titled The Truthfulness Spectrum Hypothesis, by Zhuofan Josh Ying and 3 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventi...