Hidden-Intent LLM Evasion
Research Paper
Imposter. ai: Adversarial attacks with hidden intentions towards aligned large language models
View PaperDescription: Large Language Models (LLMs) are vulnerable to adversarial attacks that employ conversation strategies to elicit harmful information through seemingly benign dialogues. The attack, termed "Imposter.AI," leverages three key strategies: (1) decomposing malicious questions into innocuous sub-questions; (2) rephrasing overtly malicious questions into benign-sounding alternatives; and (3) enhancing the harmfulness of responses by prompting the LLM for illustrative examples. This allows attackers to bypass safety mechanisms designed to prevent the generation of harmful content.
Examples: See paper for specific examples of prompts and responses demonstrating the attack. Examples include using innocuous sub-questions to obtain details needed for illegal activities such as creating explosives, making it harder for safety mechanisms to identify malicious intent.
Impact: Successful exploitation can lead to the generation of harmful content including instructions for illegal activities, the spreading of misinformation, or revealing sensitive information. The subtlety of the attack makes detection and mitigation challenging.
Affected Systems: Large Language Models (LLMs) such as GPT-3.5-turbo, GPT-4, and Llama2 (though Llama2 shows higher resistance). The vulnerability is likely present in other LLMs using similar safety mechanisms. The impact varies across models, with some demonstrating increased vulnerability compared to others.
Mitigation Steps:
- Implement more robust content filtering and detection mechanisms that consider the context of entire conversations rather than individual prompts.
- Develop LLM architectures that are better at identifying and rejecting malicious intent even when masked by seemingly benign language.
- Train LLMs on adversarial examples to improve their resistance to this type of attack.
- Develop advanced detection mechanisms that assess the overall intent of a conversation across many turns to detect the presence of harmful intentions even when obscured by intermediate benign exchanges.
© 2025 Promptfoo. All rights reserved.