LMVD-ID: daca2ed7
Published May 1, 2025

Audio Adversarial Jailbreak

Affected Models:speechgpt

Research Paper

Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

View Paper

Description: A vulnerability in SpeechGPT allows bypassing safety filters through adversarial audio prompts crafted by a white-box token-level attack. The attacker leverages knowledge of SpeechGPT's internal speech tokenization process to generate adversarial token sequences, which are then synthesized into audio. These audio prompts elicit restricted or harmful outputs the model would normally suppress. The attack's effectiveness relies on the model's discrete audio token representation and does not require access to model parameters or gradients.

Examples: See the paper's GitHub repository: https://github.com/Magic-Ma-tech/AudioJailbreak-Attacks/tree/main. Specific examples of adversarial audio and the resulting model outputs are included in the paper.

Impact: Successful exploitation allows an attacker to elicit harmful, unethical, or otherwise restricted outputs from SpeechGPT, bypassing its safety mechanisms. The impact depends on the context of SpeechGPT's deployment – for example, in a virtual assistant, this could lead to the dissemination of harmful information or instructions, privacy violations, or other malicious actions.

Affected Systems: SpeechGPT, and potentially other multimodal large language models (MLLMs) employing similar discrete audio tokenization schemes.

Mitigation Steps:

  • Improve model robustness through adversarial training incorporating diverse adversarial audio examples.
  • Develop more sophisticated safety filters that are resistant to manipulation at a token level.
  • Implement audio preprocessing techniques (e.g., denoising) to detect and remove characteristic features of adversarial audio.
  • Enhance the alignment between audio tokens and semantic meaning to lessen the chance of token manipulation triggering unintended outputs.

© 2025 Promptfoo. All rights reserved.