Self-Fooling LLM Prompt Attack
Research Paper
An LLM can Fool Itself: A Prompt-Based Adversarial Attack
View PaperDescription: A prompt-based adversarial attack, termed PromptAttack, can cause Large Language Models (LLMs) to generate incorrect outputs by manipulating the input prompt. PromptAttack crafts prompts that include the original input, an attack objective (to generate semantically similar but misclassified output), and attack guidance with instructions for character, word, or sentence-level perturbations. This allows an attacker to manipulate an LLM's response without direct access to its internal parameters. An example is adding a simple emoji ":)" to successfully mislead GPT-3.5.
Examples: See arXiv:2405.18540 for numerous examples demonstrating the attack's success against Llama2 and GPT-3.5 across various tasks in the GLUE benchmark. Specific examples are shown in Tables 2 and 17 of the provided paper.
Impact: Successful exploitation leads to incorrect LLM outputs, potentially impacting applications relying on the LLM for accurate results. This can have significant consequences depending on the application, ranging from minor inconveniences to severe safety hazards in safety-critical domains such as medicine and industrial control. The attack is effective against both open-source and closed-source LLMs, and requires only API access.
Affected Systems: Large Language Models (LLMs), specifically those susceptible to prompt manipulation. The paper demonstrates the vulnerability in Llama2 and GPT-3.5, suggesting broader applicability.
Mitigation Steps:
- Implement prompt sanitization techniques to detect and mitigate malicious prompt components.
- Develop and deploy robust input validation mechanisms.
- Explore and implement techniques to increase the model's resistance to prompt-based adversarial attacks, such as adversarial training.
- Utilize ensemble methods to incorporate multiple LLM predictions and improve the reliability of the overall output.
© 2025 Promptfoo. All rights reserved.