LMVD-ID: 91f9a8dc
Published June 1, 2024

Model Combination Misuse

Affected Models:claude 3 opus, claude 3 sonnet, claude 3 haiku, llama 2 7b-chat, llama 2 13b-chat, llama 2 70b-chat, mistral 7b, mixtral 8x7b, dall-e 3, stable diffusion v1.5, gpt-4

Research Paper

Adversaries can misuse combinations of safe models

View Paper

Description: LLMs, even when individually assessed as "safe," can be combined by an adversary to achieve malicious outcomes. This vulnerability exploits the complementary strengths of multiple models—a high-capability model that refuses malicious requests and a low-capability model that does not—through task decomposition. Adversaries can either manually decompose tasks into benign (solved by the high-capability model) and easily-malicious subtasks (solved by the low-capability model) or automate the decomposition process using the weaker model to generate benign subtasks for the stronger model and then utilizing the solutions in-context to achieve the malicious goal.

Examples:

  • Vulnerable Code Generation: A strong LLM refuses to generate code containing a directory traversal vulnerability. A weaker LLM generates a functional, but incomplete, script lacking said vulnerability. The adversary uses the strong LLM to complete the script, then uses the weaker LLM to subtly introduce the vulnerability into the completed script. (See paper for specific prompt examples and dataset)

  • Explicit Image Generation: A strong text-to-image model refuses prompts for explicit content. The adversary uses this model to generate an image of a benign scene (e.g., a person sunbathing). Then, using a weaker model (e.g., a less-safe image editing model), they modify this image to add explicit content. (See paper for specific prompt examples and dataset)

  • Malicious Python Script Generation: A strong LLM refuses to create a reverse shell script. A weaker LLM generates related but benign Python code snippets. The adversary uses the strong model to solve the benign tasks, then incorporates these results into a prompt for the weaker model to construct the malicious script. (See paper for specific prompt examples and dataset).

Impact: This vulnerability allows adversaries to leverage multiple LLMs to bypass individual safety mechanisms, resulting in the generation of malicious code, images, scripts, and manipulative text, exceeding the capabilities of any single model. The impact can range from generating malicious software to creating highly targeted disinformation campaigns.

Affected Systems: Any system using multiple LLMs with varying safety configurations and capabilities. This vulnerability particularly affects systems deploying a combination of highly aligned "frontier" models and less-aligned open-source models.

Mitigation Steps:

  • Restrict Model Access: Limit the types of LLMs that can interact with each other within a system. Avoid giving adversarial actors access to both high-capability and low-capability models concurrently.

  • Comprehensive Red Teaming: Expand red-teaming efforts to include testing combinations of models, not just individual models in isolation. Test for the capability of automating the decomposition process.

  • Enhanced Safety Mechanisms: Develop and implement safety mechanisms that are robust to task decomposition. Focus on preventing the aggregation of outputs from multiple models towards malicious ends.

  • Runtime Monitoring and Control: Implement runtime monitoring to detect and prevent suspicious interactions between LLMs that might indicate task decomposition.

  • Model-Specific Defenses: Explore model-specific defenses that make it difficult to use one model to improve the output of another in malicious ways.

© 2025 Promptfoo. All rights reserved.