Benign Mirroring Jailbreak

Description: Large Language Models (LLMs) are vulnerable to stealthy jailbreak attacks leveraging benign data mirroring. Attackers train a local "mirror model" on benign data obtained from the target LLM. This mirror model, mimicking the target's behavior, is then used to generate adversarial prompts, which are subsequently deployed against the target LLM, bypassing content moderation systems due to the lack of overtly malicious content in the initial data gathering phase.

Examples: The paper demonstrates this attack with a success rate of up to 92% against GPT-3.5 Turbo using a subset of the AdvBench dataset. Specific examples of adversarial prompts generated through this method are available in the paper's associated repository. (See arXiv:2405.18540).

Impact: Successful exploitation allows attackers to elicit harmful or unsafe responses from LLMs, evading existing safety mechanisms and potentially causing significant harm depending on the nature of the elicited response (e.g., generating illegal instructions, biased content, or promoting harmful activities). The high success rate and low detection rate achieved compromise the model's safety and reliability.

Affected Systems: LLMs susceptible to transfer attacks, particularly those employing safety-alignment techniques. The paper specifically tested GPT-3.5 Turbo and GPT-4o mini. Other LLMs using similar architectures or safety mechanisms may also be vulnerable.

Mitigation Steps:

Diverse Safety Alignment: Train LLMs using diverse datasets that encompass a wide range of potential adversarial prompts, including those crafted using shadow model techniques.
Improved Input Detection: Implement more robust input detection mechanisms capable of identifying subtle patterns indicative of shadow model attacks, such as statistically unusual query sequences or prompt variations.
Dynamic Safety Boundaries: Develop adaptable safety mechanisms that dynamically adjust their response strategies based on real-time threat assessments. Regularly updated dynamic threat models are crucial.

Benign Mirroring Jailbreak

Research Paper