From this point forward, we concentrate on our most effective strategy: a classifier trained on intermediate LLM representations within the exact answer tokens, referred to as 'probing classifiers' (Belinkov, 2021). This approach helps us explore what these representations reveal about LLMs.
Our demonstration that a trained probing classifier can predict errors suggests that LLMs encode information related to their own truthfulness. However, we find that probing classifiers do not generalize across different tasks (Section 4). Instead, generalization occurs only within tasks requiring similar skills (e.g., factual retrieval), indicating "skill-specific" truthfulness features. For tasks involving different skills, e.g., sentiment analysis, these classifiers are no better-or worse-than logit-based uncertainty predictors, challenging the idea of a "universal truthfulness" encoding proposed in previous work (Marks & Tegmark, 2023; Slobodkin et al., 2023). Instead, our results indicate that
It’s not a simple fix for hallucination, though, because different types of untruthfulness are associated with different internal states. Or as the authors put it “LLMs encode multiple distinct notions of truth.”