Logit-Forced Knowledge Extraction

Description: Large Language Models (LLMs) with accessible output logits are vulnerable to "coercive interrogation," a novel attack that extracts harmful knowledge hidden in low-ranked tokens. The attack doesn't require crafted prompts; instead, it iteratively forces the LLM to select and output low-probability tokens at key positions in the response sequence, revealing toxic content the model would otherwise suppress.

Examples: See the paper "Make them spill the beans! coercive knowledge extraction from (production) llms" for detailed examples of the attack against multiple open-source and commercial LLMs. Examples include eliciting instructions on illegal activities, revealing private information, and generating harmful content even from LLMs specialized for coding tasks.

Impact: Successful attacks can lead to the disclosure of sensitive information, generation of harmful content (e.g., instructions for illegal activities, hate speech), and the circumvention of existing safety and alignment mechanisms. The attack is highly effective and significantly faster than existing prompt-based jailbreaking techniques.

Affected Systems: LLMs with accessible output logits (e.g., probability scores for each token) during the generation process. This includes many open-source models and some commercial LLM APIs.

Mitigation Steps:

Restrict Logit Access: Limit or remove API access to output logits. Provide only the final, filtered output to users.
Enhanced Filtering: Implement more robust filters to identify and remove toxic content from responses, including mechanisms that analyse responses beyond simple keyword blocking.
Improved Alignment: Develop more robust model alignment techniques that are resistant to coercion and better suppress harmful knowledge.
Model Unlearning: Investigate methods to remove harmful knowledge from the model’s training data or model weights.

Logit-Forced Knowledge Extraction

Research Paper