Llms Machine Learning Ai Safety

[2602.20273] The Truthfulness Spectrum Hypothesis

arXiv - Machine Learning February 25, 2026 4 min read Article

Summary

The Truthfulness Spectrum Hypothesis explores how large language models (LLMs) represent truthfulness across various domains, revealing a spectrum of truth types and their generalization capabilities.

Why It Matters

Understanding how LLMs encode truthfulness is crucial for improving their reliability and ethical use. This research provides insights into the representational geometry of truth in AI, which can inform better model training and evaluation strategies.

Key Takeaways

LLMs exhibit a spectrum of truthfulness, ranging from domain-general to domain-specific.
Probes show strong generalization across most truth types but struggle with sycophantic and expectation-inverted lying.
Post-training adjustments can reshape the geometry of truth representations in LLMs.

Computer Science > Machine Learning arXiv:2602.20273 (cs) [Submitted on 23 Feb 2026] Title:The Truthfulness Spectrum Hypothesis Authors:Zhuofan Josh Ying, Shauli Ravfogel, Nikolaus Kriegeskorte, Peter Hase View a PDF of the paper titled The Truthfulness Spectrum Hypothesis, by Zhuofan Josh Ying and 3 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventi...

Read Original Article

[2602.20273] The Truthfulness Spectrum Hypothesis

Summary

Why It Matters

Key Takeaways

Related Articles

I stopped using Claude like a chatbot — 7 prompt shifts that reclaimed 10 hours of my week

What features do you actually want in an AI chatbot that nobody has built yet?

So, what exactly is going on with the Claude usage limits?

Why the Reddit Hate of AI?

No comments

Stay updated with AI News