Asynchronous Audio Jailbreak
Research Paper
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models
View PaperDescription: End-to-end Large Audio-Language Models (LALMs) are vulnerable to AudioJailbreak, a novel attack that appends adversarial audio perturbations ("jailbreak audios") to user prompts. These perturbations, even when applied asynchronously and without alignment to the user's speech, can manipulate the LALM's response to generate adversary-desired outputs that bypass safety mechanisms. The attack achieves universality by employing a single perturbation effective across different prompts and robustness to over-the-air transmission by incorporating reverberation effects during perturbation generation. Even with stealth strategies employed to mask malicious intent, the attack remains highly effective.
Examples: See https://audiojailbreak.github.io/AudioJailbreak for implementation and audio samples.
Impact: Successful AudioJailbreak attacks can cause LALMs to generate responses that violate safety guidelines, including but not limited to: generating misinformation, harmful advice, hate speech, and responses facilitating illegal activities. The attack's stealthiness and robustness against over-the-air transmission significantly increase the risk of real-world exploitation, impacting applications using LALMs for voice assistants, customer service, and other interactive systems. The universality of the attack further worsens the impact as a single jailbreak audio can be used against numerous prompts.
Affected Systems: All end-to-end Large Audio-Language Models susceptible to adversarial audio injection which is a near-universal characteristic of the current end-to-end LALM architecture. Specific models tested include but aren't limited to: Mini-Omni, Mini-Omni2, Qwen-Audio, Qwen2-Audio, LLaSM, LLaMA-Omni, SALMONN, BLSP, SpeechGPT, and ICHIGO.
Mitigation Steps:
- Enhanced audio input processing: Implement more robust methods for detecting and filtering adversarial audio perturbations.
- Improved safety mechanisms: Develop LALM models with more resilient safety guardrails and improved ability to detect and reject malicious prompts regardless of modality.
- Auditory adversarial training: Train LALMs on adversarial audio examples to increase robustness against AudioJailbreak-style attacks.
- Input sanitization: Implement stricter input validation procedures to detect and remove any potentially malicious audio components, including suffixal additions.
- Over-the-air robustness: Design systems considering real-world audio transmission characteristics, including potential distortions like reverberation, to create more robust defenses.
© 2025 Promptfoo. All rights reserved.