Audio Adversarial Jailbreak

Description: Large Audio-Language Models (LALMs) are vulnerable to a stealthy adversarial jailbreak attack, AdvWave, which leverages a dual-phase optimization to overcome gradient shattering caused by audio discretization. The attack crafts adversarial audio by adding perceptually realistic environmental noise, making it difficult to detect. The attack also dynamically adapts the adversarial target based on the LALM's response patterns.

Examples: See the paper for specific examples and prompts used to generate adversarial audio snippets that successfully jailbreak various LALMs, including SpeechGPT, Qwen2-Audio, Llama-Omni and GPT-4O-S2S. The paper includes examples of both white-box and black-box attacks and the methodology to create them.

Impact: Successful exploitation of this vulnerability allows attackers to elicit unsafe and harmful responses from LALMs, circumventing their built-in safety mechanisms. This can lead to the generation of illegal instructions, malicious code, harmful advice, and other dangerous outputs. The stealthiness of the attack makes it difficult to detect and mitigate.

Affected Systems: All LALMs using audio encoders with discretization operations are potentially affected. Specific models tested and shown vulnerable in the paper include SpeechGPT, Qwen2-Audio, Llama-Omni, and GPT-4O-S2S.

Mitigation Steps:

Improve robustness of audio encoders to adversarial attacks.
Develop more sophisticated safety mechanisms to detect and mitigate adversarial audio inputs.
Implement more robust and diverse safety filters beyond keyword blocking.
Develop detection mechanisms to identify unusual response patterns indicative of jailbreaks.
Incorporate adversarial training techniques to enhance model resistance to this type of attack.

Audio Adversarial Jailbreak

Research Paper