Adversarial Speech Jailbreak

Description: Multimodal Large Language Models (LLMs) processing speech input are vulnerable to adversarial attacks. Imperceptible perturbations added to audio input can cause the model to generate unsafe or harmful text responses, overriding built-in safety mechanisms. The attacks are effective even with limited knowledge of the model's internal workings, demonstrating transferability across different models.

Examples: See arXiv:2405.18540 for details on the attack methodology and specific examples of adversarial audio perturbations. The paper demonstrates attacks using both white-box (full model access) and black-box (limited model access) techniques. Specific audio examples are not publicly available due to ethical concerns.

Impact: Successful attacks lead to the generation of unsafe and harmful text responses by the LLM, regardless of its intended safety constraints. This compromises the reliability and trustworthiness of the system, potentially allowing malicious actors to circumvent safety protocols and elicit harmful content. The level of impact depends on the application context and the severity of the generated harmful content.

Affected Systems: Multimodal Large Language Models that process speech input and generate text responses. Specifically, the paper notes vulnerability in models using Conformer audio encoders and Flan-T5-XL or Mistral-7bInstruct language models. The vulnerability is likely to affect similar architectures.

Mitigation Steps:

Implement Time-Domain Noise Flooding (TDNF) as a pre-processing step. Adding random noise to the audio input can mitigate the effectiveness of adversarial attacks. The optimal Signal-to-Noise Ratio (SNR) needs to be determined empirically.
Develop and deploy more robust adversarial training techniques tailored to the speech modality.
Regularly evaluate models for vulnerabilities to adversarial attacks, using both white-box and black-box testing methods.

Adversarial Speech Jailbreak

Research Paper