Audio Adversarial Jailbreak

Description: A vulnerability in SpeechGPT allows bypassing safety filters through adversarial audio prompts crafted by a white-box token-level attack. The attacker leverages knowledge of SpeechGPT's internal speech tokenization process to generate adversarial token sequences, which are then synthesized into audio. These audio prompts elicit restricted or harmful outputs the model would normally suppress. The attack's effectiveness relies on the model's discrete audio token representation and does not require access to model parameters or gradients.

Examples: See the paper's GitHub repository: https://github.com/Magic-Ma-tech/AudioJailbreak-Attacks/tree/main. Specific examples of adversarial audio and the resulting model outputs are included in the paper.

Impact: Successful exploitation allows an attacker to elicit harmful, unethical, or otherwise restricted outputs from SpeechGPT, bypassing its safety mechanisms. The impact depends on the context of SpeechGPT's deployment – for example, in a virtual assistant, this could lead to the dissemination of harmful information or instructions, privacy violations, or other malicious actions.

Affected Systems: SpeechGPT, and potentially other multimodal large language models (MLLMs) employing similar discrete audio tokenization schemes.

Mitigation Steps:

Improve model robustness through adversarial training incorporating diverse adversarial audio examples.
Develop more sophisticated safety filters that are resistant to manipulation at a token level.
Implement audio preprocessing techniques (e.g., denoising) to detect and remove characteristic features of adversarial audio.
Enhance the alignment between audio tokens and semantic meaning to lessen the chance of token manipulation triggering unintended outputs.

Audio Adversarial Jailbreak

Research Paper