Preference-Optimized Jailbreak

Description: JailPO is a black-box attack framework that leverages preference optimization to generate effective jailbreak prompts for aligned LLMs. The attack automatically generates prompts, bypassing safety mechanisms and eliciting harmful or undesirable responses from the target LLM. The framework includes three attack patterns (QEPrompt, TemplatePrompt, MixAsking) with varying degrees of effectiveness and risk.

Examples: See the paper's Appendix for specific examples of prompts generated by the JailPO framework's QEM and TEM models, and sample QEPrompt and TemplatePrompt attacks. These examples are not included here due to length restrictions and the sensitive nature of the content.

Impact: Successful exploitation of this vulnerability could lead to the LLM generating harmful content, including hate speech, illegal instructions, and misinformation. The vulnerability affects the safety and security of LLM applications. The demonstrated ability to bypass safety mechanisms significantly reduces the trustworthiness of the LLM.

Affected Systems: The vulnerability affects various aligned LLMs including, but not limited to, Llama2, Mistral, Vicuna, and GPT-3.5. The paper demonstrates the vulnerability on both open-source and commercial models.

Mitigation Steps:

Improve the robustness of LLM safety mechanisms against adversarial prompt generation.
Develop more advanced detection techniques to identify and filter malicious prompts.
Regularly update and improve safety and alignment training for LLMs.
Implement input sanitization and output filtering mechanisms.
Develop and deploy more sophisticated defense mechanisms to counter preference-optimization-based attacks. This could involve techniques aimed disrupting the attacker's scoring strategy and the preference learning process.

Preference-Optimized Jailbreak

Research Paper