Personalized Encryption Jailbreak
Research Paper
Codechameleon: Personalized encryption framework for jailbreaking large language models
View PaperDescription: A vulnerability exists in several Large Language Models (LLMs) allowing attackers to bypass safety and ethical protocols through a novel code injection technique using personalized encryption and decryption functions. The attack leverages the LLMs' code execution capabilities to process encrypted malicious instructions, circumventing the intent security recognition mechanism.
Examples: See https://github.com/huizhang-L/CodeChameleon
Impact: Successful exploitation allows attackers to induce LLMs to generate harmful, unethical, or illegal outputs, bypassing built-in safety mechanisms. This can lead to the generation of malicious code, dissemination of misinformation, and other detrimental consequences. The attack's success rate is reported to be as high as 86.6% on GPT-4-1106.
Affected Systems: Multiple LLMs, including but not limited to GPT-3.5-1106, GPT-4-1106, Llama 2 series, and Vicuna series. The vulnerability's impact is amplified with LLMs exhibiting strong code generation capabilities.
Mitigation Steps:
- Enhance LLMs' intent recognition capabilities to detect and block code resembling the described encryption/decryption patterns.
- Implement additional safeguards to prevent the execution of arbitrary code within the model's response generation process.
- Develop robust detection mechanisms specifically targeting code injection via personalized encryption techniques. Focus on detecting and blocking code blocks that appear to be decryption functions preceding encrypted instructions.
- Improve the robustness of LLM safety mechanisms against attacks exploiting the model's code interpretation and execution capabilities. Consider diverse prompt formats and more sophisticated methods to detect malicious intent beyond simple keyword filtering.
© 2025 Promptfoo. All rights reserved.