In-Context LLM Jailbreak
Research Paper
Jailbreak and guard aligned language models with only few in-context demonstrations
View PaperDescription: Large Language Models (LLMs) are vulnerable to In-Context Attacks (ICA) and susceptible to mitigation via In-Context Defense (ICD). ICA leverages a small number of harmful demonstration examples within a prompt to elicit harmful responses from the LLM, even if it is otherwise safety-aligned. ICD counteracts ICA by prepending safe demonstration examples to the prompt, effectively reducing the likelihood of harmful output. The effectiveness of both ICA and ICD is demonstrated across multiple LLMs.
Examples:
-
ICA: A prompt containing several examples of the model generating violent content, followed by a seemingly innocuous request, will trigger the generation of violent content. See arXiv:2405.18540 for specific examples used in the research.
-
ICD: Prepending a small number (1-2 shots) of examples where the LLM refuses to answer harmful prompts to a malicious prompt can reduce or eliminate harmful responses. See arXiv:2405.18540 for specific examples used in the research.
Impact:
Successful ICA can lead to the generation of harmful content (e.g., hate speech, instructions for illegal activities, personally identifiable information) by an LLM, bypassing existing safety mitigations. The impact depends on the LLM and the specific context; it could range from minor inconvenience to potentially serious harm (e.g., incitement to violence, spreading misinformation). ICD can significantly reduce this risk but may not be entirely effective against sophisticated attacks.
Affected Systems:
Various LLMs, including open-source models (Vicuna, Llama 2, QWen) and closed-source models (GPT-4), are susceptible to this vulnerability. The specific vulnerability varies across models.
Mitigation Steps:
- Improved prompt engineering: Develop robust methods to detect and filter malicious prompts attempting to leverage ICA.
- Enhanced model training: Incorporate adversarial training during the model's development to improve resistance to ICA.
- In-Context Defense (ICD) implementation: Implement ICD by proactively prepending safe examples to all user prompts to mitigate potentially harmful requests. The number of safe examples necessary depends largely on the model's resilience and the sophistication of the attack.
- Regular security audits: Conduct periodic red-teaming exercises to identify and address weaknesses in LLM safety mechanisms.
© 2025 Promptfoo. All rights reserved.