This article describes a technique for using LLMs as classifiers by extracting hidden states rather than generating text. The key insight is that when an LLM evaluates whether content satisfies a criterion, that decision exists in its internal representations before any token generation occurs.
The method involves extracting hidden states from an intermediate layer at the final prompt token and training a small MLP to map these states to a probability score. Key advantages include speed and cost comparable to embedding classifiers, calibrated probabilities, and the ability to handle structural reasoning that embeddings struggle with.
The author notes practical applications including safety tools that evaluate structural questions across large conversation datasets.