AutoJailbreak of GPT-4V

Description: A vulnerability in GPT-4V's facial recognition safety mechanisms allows for automated jailbreaking attacks using Large Language Models (LLMs) to bypass safety features and elicit unintended facial identification responses. The attack, termed "AutoJailbreak," optimizes prompts through iterative refinement with an LLM "red-teaming" model, significantly increasing the attack success rate. This vulnerability exploits weaknesses in GPT-4V's prompt processing and safety alignment, allowing malicious actors to circumvent restrictions on identity recognition.

Examples: See the paper for specific examples of successful jailbreak prompts generated by AutoJailbreak. The prompts leverage various techniques including prefix injection, refusal suppression, and length control to manipulate GPT-4V's response generation. Specific examples of prompts which achieve an Attack Success Rate (ASR) exceeding 95% are provided in Appendix C of the referenced paper.

Impact: Successful exploitation allows unauthorized identification of individuals in images provided to GPT-4V, leading to significant privacy violations. The high attack success rate (ASR exceeding 95%) demonstrates the severity of this vulnerability.

Affected Systems: GPT-4V (OpenAI's multimodal large language model).

Mitigation Steps:

Enhance GPT-4V's safety mechanisms to be more resilient against prompt manipulation techniques used in AutoJailbreak.
Develop improved methods for detecting and mitigating adversarial prompts.
Implement more robust filtering of potentially harmful inputs, including images.
Consider using advanced defense strategies beyond LLM-based input/output evaluation to improve cost-effectiveness and reduce reliance on computationally expensive verification methods.

AutoJailbreak of GPT-4V

Research Paper