LLM Resource Exhaustion Jailbreak
Research Paper
Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models
View PaperDescription: Large Language Models (LLMs) are vulnerable to a novel jailbreak attack that exploits resource limitations. By overloading the model with a computationally intensive preliminary task (e.g., a complex character map lookup and decoding), the attacker prevents the activation of the LLM's safety mechanisms, enabling the generation of unsafe outputs from subsequent prompts. The attack's strength is scalable and adjustable by modifying the complexity of the preliminary task.
Examples: See the paper's repository for code and detailed experimental setups. The attack involves crafting a prompt that includes:
- A character map with varying size, query length, and query count. The complexity of the encoding is controlled by these parameters.
- An encoded string using the character map.
- A masked instruction with a placeholder that will be replaced by the decoded string. The masked instruction is the malicious prompt.
The LLM processes the character map first, depleting resources such that subsequent safety checks are circumvented when the masked instruction is processed.
Impact: Successful exploitation allows an attacker to bypass LLM safety restrictions, leading to the generation of harmful, unethical, or illegal content. The attack's effectiveness depends on the LLM's resource constraints and the complexity of the preliminary task.
Affected Systems: Large Language Models (LLMs) that rely on resource-constrained safety mechanisms. Specific affected models include Llama 3-8B, Mistral-7B, Llama2, Vicuna-7B, and the Qwen2.5 family of models.
Mitigation Steps:
- Enhance LLM safety mechanisms to be more robust when under resource-intensive conditions.
- Implement resource prioritization algorithms within LLMs to ensure sufficient resources for safety checks regardless of input complexity.
- Develop techniques to detect and mitigate resource-exhaustion attacks.
- Regularly update safety models to account for new attack strategies.
© 2025 Promptfoo. All rights reserved.