GPT-4o Multimodal Jailbreak

Description: GPT-4o exhibits vulnerability to jailbreak attacks via audio prompts, despite enhanced safety against text-based attacks. Successful jailbreaks can be achieved by converting text prompts, including those optimized for adversarial attacks against other LLMs (demonstrated using GCG, AutoDAN, PAP, and BAP methods), into audio using text-to-speech (TTS) synthesis. This circumvention allows elicitation of unsafe responses from GPT-4o that would otherwise be prevented by its safety mechanisms. The success rate of these audio-based attacks is comparable to text-based attacks, indicating a significant security weakness in the audio processing pipeline.

Examples: See "https://github.com/NY1024/Jailbreak_GPT4o" for code and methodology used to generate audio prompts. Note that due to limitations on API access and mobile application usage, the specific prompts are not directly provided in their entirety in the linked repository, but the process documented is sufficient to reproduce similar attacks.

Impact: Successful jailbreaks can lead to GPT-4o generating harmful or unethical content, including but not limited to hate speech, instructions for illegal activities, and unsafe advice. This compromises the model's intended safety mitigations and has potential for significant societal harm depending on the application context.

Affected Systems: OpenAI GPT-4o, specifically when interacting via the mobile application or APIs supporting audio input.

Mitigation Steps:

Implement robust audio pre-processing and content filtering specifically designed to detect and mitigate adversarial audio prompts.
Enhance model training to improve resilience against audio-based adversarial attacks, potentially using adversarial training methods focused on the audio modality.
Develop more sophisticated detection mechanisms to identify and block attempts to circumvent safety measures using audio.
Regularly update and refine safety protocols based on ongoing research and discovered vulnerabilities, addressing gaps highlighted by attacks like those detailed in the referenced paper.

GPT-4o Multimodal Jailbreak

Research Paper