Low-Perplexity LLM Attack

Description: Large Language Models (LLMs) are vulnerable to adversarial attacks that utilize low-perplexity prompts to elicit unsafe content. These prompts, while statistically likely to occur in normal conversation, can trigger the generation of harmful or toxic outputs that evade standard safety filters. The vulnerability stems from the model's inability to reliably distinguish between benign and malicious intents within the statistical distribution of natural language.

Examples: See arXiv:2407.09447v4 for examples of low-perplexity prompts that successfully elicit unsafe responses from multiple LLMs (Llama-8.1B, Mistral-7B, Qwen-7B, TinyLlama). Examples include prompts leading to conversations involving political extremism, hate speech, sexual violence, and profanity.

Impact: Successful exploitation leads to the generation of unsafe and potentially harmful content, compromising the safety and trustworthiness of LLM applications. This can have significant downstream consequences including reputational damage, legal liabilities, and the spread of harmful ideologies. The low perplexity of the attacks makes them difficult to detect and filter, increasing their effectiveness.

Affected Systems: Large Language Models (LLMs) from various vendors and architectures are susceptible, including but not limited to Llama-8.1B, Mistral-7B, Qwen-7B, and TinyLlama. The vulnerability is likely present in other LLMs as well.

Mitigation Steps:

Improve safety filters by incorporating methods to detect and mitigate low-perplexity attacks, moving beyond simplistic keyword filtering.
Develop robust training techniques to increase model resilience to adversarial prompts, including the incorporation of low-perplexity adversarial examples in the training data.
Implement input sanitization procedures to identify and neutralize potentially harmful prompts before they reach the LLM. This requires moving beyond simple keyword detection towards more sophisticated analysis of linguistic patterns associated with unsafe behavior.
Conduct regular red-teaming exercises using techniques such as ASTPrompter to proactively identify and address vulnerabilities.

Low-Perplexity LLM Attack

Research Paper