LLM Self-Introspection Jailbreak

Description: A vulnerability exists in Large Language Models (LLMs) that allows attackers to manipulate the model's output by modifying token log probabilities. Attackers can use a lightweight plug-in model (BiasNet) to subtly alter the probabilities, steering the LLM toward generating harmful content even when safety mechanisms are in place. This attack requires only access to the top-k token log probabilities returned by the LLM's API, without needing model weights or internal access.

Examples: The paper demonstrates successful attacks against multiple LLMs including Llama3-8B-Instruct, Llama2-7B-Chat, and Qwen2-1.5B-Instruct, using only the top-5 log probabilities returned by the API. Specific examples are shown in Table A6 of the provided research paper arXiv:XXXX.XXXX. Replace XXXX.XXXX with the actual arXiv ID once available.

Impact: Successful exploitation allows attackers to bypass LLM safety mechanisms, generating harmful content such as instructions for illegal activities, hate speech, or disinformation campaigns. The attack is particularly dangerous as it only requires API access, making it applicable to many commercially available LLMs. The low training cost of the attack model also lowers the barrier to entry for malicious actors.

Affected Systems: LLMs that provide access to token log probabilities via APIs. Specifically, the paper shows successful exploits on models from the Llama and Qwen families, indicating potential vulnerability in other LLMs using similar architectures and APIs.

Mitigation Steps:

Implement stronger safety mechanisms that are robust against manipulations of token probabilities.
Limit or remove API access to token log probabilities.
Develop and deploy detection mechanisms to identify and block attempts to manipulate token probabilities.
Regularly audit and update safety mechanisms to mitigate emerging attack techniques.

LLM Self-Introspection Jailbreak

Research Paper