Natural Prompt Jailbreaks

Description: Large Language Models (LLMs) trained with safety fine-tuning are vulnerable to a novel attack, Response-Guided Question Augmentation (ReG-QA). This attack leverages the asymmetry in safety alignment between question and answer generation. By providing a safety-aligned LLM with toxic answers generated by an unaligned LLM, ReG-QA generates semantically related, yet naturally phrased questions that bypass safety mechanisms and elicit undesirable responses. The attack does not require adversarial prompt crafting or model optimization.

Examples: The paper provides specific examples of seed prompts and resulting semantically-related jailbreak questions. See arXiv:2405.18540 for details. A simplified example is demonstrated through the use of an unaligned LLM to generate toxic answers to a seed question. These answers are subsequently fed to a safety-aligned LLM, which then generates natural-sounding questions that, when posed to the safety-aligned LLM, produce unsafe responses.

Impact: Successful exploitation of this vulnerability allows an attacker to circumvent safety mechanisms in LLMs and generate harmful, unethical, or illegal content. This compromises the integrity and trustworthiness of the LLM.

Affected Systems: LLMs trained with safety fine-tuning techniques such as reinforcement learning from human feedback (RLHF) and instruction tuning, including but not limited to, GPT-3.5, GPT-4, and other models susceptible to similar attacks.

Mitigation Steps:

Improve the symmetry of safety training to better generalize to question generation from unsafe answers.
Develop defenses that focus on semantic understanding rather than solely relying on surface-level features of the input.
Implement robust filtering mechanisms based on deeper analysis of both input and output content, going beyond standard perplexity checks.
Investigate and mitigate the "reversal curse" identified in the paper, where safety training in one direction (question to answer) doesn't generalize to the reverse direction (answer to question).

Natural Prompt Jailbreaks

Research Paper