Vulnerabilities in multimodal AI systems
Multilingual and multi-accent audio inputs, combined with acoustic adversarial perturbations (reverberation, echo, whisper effects), can bypass safety mechanisms in Large Audio Language Models (LALMs), causing them to generate unsafe or harmful outputs. The vulnerability is amplified by the interaction between acoustic and linguistic variations, particularly in languages with less training data.
Multimodal Large Language Models (MLLMs) are vulnerable to a jailbreaking attack, dubbed PiCo, that leverages token-level typographic attacks on images embedded within code-style instructions. The attack bypasses multi-tiered defense mechanisms, including input filtering and runtime monitoring, by exploiting weaknesses in the visual modality's integration with programming contexts. Harmful intent is concealed within visually benign image fragments and code instructions, circumventing safety protocols.
A vulnerability in text-to-image (T2I) models allows bypassing safety filters through the use of metaphor-based adversarial prompts. These prompts, crafted using LLMs, indirectly convey sensitive content, exploiting the model's ability to infer meaning from figurative language while circumventing explicit keyword filters and model editing strategies.
Multimodal Large Language Models (MLLMs) are vulnerable to Jailbreak-Probability-based Attacks (JPA). JPA leverages a Jailbreak Probability Prediction Network (JPPN) to identify and optimize adversarial perturbations in input images, maximizing the probability of eliciting harmful responses from the MLLM, even with small perturbation bounds and few iterations. The attack operates by modifying the input image's hidden states within the MLLM to increase the predicted jailbreak probability.
FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.
Multimodal Large Language Models (MLLMs) are vulnerable to a jailbreaking attack leveraging a "Distraction Hypothesis". The attack, termed Contrasting Subimage Distraction Jailbreaking (CS-DJ), bypasses safety mechanisms by using multiple contrasting subimages and a decomposed harmful prompt to overwhelm the model's attention and reduce its ability to identify malicious content. The complexity of the visual input, rather than its specific content, is the key to successful exploitation.
A novel "Flanking Attack" exploits the vulnerability of multimodal LLMs (e.g., Google Gemini) to bypass content moderation filters by embedding adversarial prompts within a sequence of benign prompts. The attack leverages the LLM's processing of both audio and text, obfuscating harmful requests through contextualization and layering, thereby yielding policy-violating responses.
Multimodal Large Language Models (MLLMs) exhibit a vulnerability where shuffling the order of words in text prompts or patches in image prompts can bypass their safety mechanisms, despite the model still understanding the intent of the shuffled input. This "Shuffle Inconsistency" allows attackers to elicit harmful responses by submitting shuffled harmful prompts that would otherwise be blocked.
A bimodal adversarial attack, PBI-Attack, can manipulate Large Vision-Language Models (LVLMs) into generating toxic or harmful content by iteratively optimizing both textual and visual inputs in a black-box setting. The attack leverages a surrogate LVLM to inject malicious features from a harmful corpus into a benign image, then iteratively refines both image and text perturbations to maximize the toxicity of the model’s output as measured by a toxicity detection model (Perspective API or Detoxify).
A novel jailbreak attack, Multi-Modal Linkage (MML), exploits the vulnerability in Large Vision-Language Models (VLMs) by leveraging an "encryption-decryption" scheme across text and image modalities. MML encrypts malicious queries within images (e.g., using word replacement, image transformations) to bypass initial safety mechanisms. A subsequent text prompt guides the VLM to "decrypt" the content, eliciting harmful outputs. "Evil alignment," framing the attack within a video game scenario, further enhances the attack's success rate.
© 2025 Promptfoo. All rights reserved.