Vulnerabilities impacting model output reliability
Large Language Models (LLMs) employing internal security mechanisms based on linearly separable embeddings in intermediate layers are vulnerable to a generative adversarial attack. The CAVGAN framework exploits this vulnerability by generating adversarial perturbations that misclassify malicious inputs as benign, allowing the attacker to bypass the LLM's safety filters and elicit harmful outputs.
Multimodal Large Language Models (MLLMs) are vulnerable to visual contextual attacks, where carefully crafted images and accompanying text prompts can bypass safety mechanisms and elicit harmful responses. The vulnerability stems from the MLLM's ability to integrate visual and textual context to generate outputs, allowing attackers to create realistic scenarios that subvert safety filters. Specifically, the attack leverages image-driven context injection to construct deceptive multi-turn conversations that gradually lead the MLLM to produce unsafe responses.
Large Language Models (LLMs) are vulnerable to a novel adversarial attack, Alphabet Index Mapping (AIM), which achieves high success rates in bypassing safety filters ("jailbreaking"). AIM encodes prompts by converting characters to their alphabet indices, maximizing semantic dissimilarity while maintaining straightforward decoding instructions. This allows malicious prompts to evade detection based on semantic similarity, even when the LLM correctly decodes the intent.
A novel black-box attack, dubbed BitBypass, exploits the vulnerability of aligned LLMs by camouflaging harmful prompts using hyphen-separated bitstreams. This bypasses safety alignment mechanisms by transforming sensitive words into their bitstream representations and replacing them with placeholders, in conjunction with a specially crafted system prompt that instructs the LLM to convert the bitstream back to text and respond as if given the original harmful prompt.
A vulnerability in Large Language Models (LLMs) allows adversarial prompt distillation from a large language model (LLM) to a smaller language model (SLM), enabling efficient and stealthy jailbreak attacks. The attack leverages knowledge distillation techniques, reinforcement learning, and dynamic temperature control to transfer the LLM's ability to bypass safety mechanisms to a smaller, more easily deployable SLM. This allows for lower computational cost attacks with a potentially high success rate.
A hybrid jailbreak attack, combining gradient-guided token optimization (GCG) with iterative prompt refinement (PAIR or WordGame+), bypasses LLM safety mechanisms resulting in the generation of disallowed content. The hybrid approach leverages the strengths of both techniques, circumventing defenses effective against single-mode attacks. Specifically, the combination of semantically crafted prompts and strategically placed adversarial tokens confuse and overwhelm existing defenses.
The MIST attack exploits a vulnerability in black-box large language models (LLMs) allowing iterative semantic tuning of prompts to elicit harmful responses. The attack leverages synonym substitution and optimization strategies to bypass safety mechanisms without requiring access to the model's internal parameters or weights. The vulnerability lies in the susceptibility of the LLM to semantically similar prompts that trigger unsafe outputs.
Large language models (LLMs) protected by multi-stage safeguard pipelines (input and output classifiers) are vulnerable to staged adversarial attacks (STACK). STACK exploits weaknesses in individual components sequentially, combining jailbreaks for each classifier with a jailbreak for the underlying LLM to bypass the entire pipeline. Successful attacks achieve high attack success rates (ASR), even on datasets of particularly harmful queries.
A vulnerability in fine-tuning-based large language model (LLM) unlearning allows malicious actors to craft manipulated forgetting requests. By subtly increasing the frequency of common benign tokens within the forgetting data, the attacker can cause the unlearned model to exhibit unintended unlearning behaviors when these benign tokens appear in normal user prompts, leading to a degradation of model utility for legitimate users. This occurs because existing unlearning methods fail to effectively distinguish between benign tokens and those truly related to the target knowledge being unlearned.
A white-box vulnerability allows attackers with full model access to bypass LLM safety alignments by identifying and pruning parameters responsible for rejecting harmful prompts. The attack leverages a novel "twin prompt" technique to differentiate safety-related parameters from those essential for model utility, performing fine-grained pruning with minimal impact on overall model functionality.
© 2025 Promptfoo. All rights reserved.