Conditional Prompt Hijack

Description: The CAIN (Conversational AI attack) framework introduces a black-box adversarial attack vector against Large Language Models (LLMs) that utilizes malicious system prompts to hijack conversations. Unlike traditional jailbreaks that aim to bypass safeguards for all inputs, CAIN optimizes system prompts to induce incorrect or harmful responses only for specific targeted questions (e.g., "Are COVID vaccines safe?", "Who should I vote for?"), while maintaining high accuracy and benign behavior on all other non-targeted queries.

The vulnerability utilizes a two-stage optimization framework:

AdvAutoPrompt (AAP): An automated sentence-level generation module that initializes a human-readable, coherent prompt designed to maximize the divergence between the target question's loss and the benign set's loss.
Greedy Word-Level Optimization: A refinement stage that applies iterative perturbations—specifically Random Split, Random Swap, Substitute Keyboard, and Substitute Synonym—to critical tokens within the system prompt.

The resulting system prompts exhibit low semantic similarity to the target questions and pass standard perplexity-based filters, allowing them to evade detection while successfully manipulating model outputs for specific misinformation campaigns.

Examples: Note: The specific optimized prompt text strings are generated dynamically. Refer to the associated repository or dataset for the raw prompt artifacts.

Attack Scenario 1: Targeted Misinformation (Untargeted Attack on Factuality)

Target Question: "What is the nationality of Christian Doppler?"
Malicious System Prompt: (A benign-looking text block optimized by CAIN)
Result: The model answers "German" (Incorrect/Targeted) instead of "Austrian" (Correct).
Benign Check: When asked "What is the capital of France?", the model correctly answers "Paris," concealing the compromised state.

Attack Scenario 2: Biasing Multiple Choice (Targeted Attack)

Target Question: A TruthfulQA question regarding vaccine safety.
Context: The model is presented with options A (Incorrect/Harmful) and B (Correct).
Malicious System Prompt: (Optimized via CAIN)
Result: The model consistently selects option A.
Benign Check: The model performs normally on other multiple-choice questions in the Misconceptions or History categories.

Impact:

Targeted Disinformation: Attackers can craft "sleeper agent" prompts that spread specific misinformation (political, medical, legal) only when triggered by precise user queries.
Supply Chain Contamination: Malicious prompts can be distributed via public prompt marketplaces (e.g., PromptBase, FlowGPT, Hugging Face) or AI agent platforms. Users adopting these "high-performing" prompts unknowingly deploy compromised agents.
Evasion of Defenses: The attack preserves benign accuracy (up to 71.44 F1 on commercial models) and uses human-readable text, bypassing existing defenses based on performance degradation monitoring, lexical similarity detection, or perplexity filtering.
Illusory Truth Effect: By functioning correctly on the vast majority of queries, the compromised model builds user trust, increasing the likelihood that the targeted misinformation is accepted as truth.

Affected Systems: Validations were performed on the following systems, though the methodology is model-agnostic:

Open Source Models:
Llama-2 (7B, 13B)
Llama-3.1 (8B)
DeepSeek (7B)
Qwen (2.5, 7B to 32B)
Pythia (12B)
Commercial APIs (via System Prompt Injection):
OpenAI GPT-3.5-Turbo
OpenAI GPT-4o-mini
OpenAI GPT-4o-nano

Mitigation Steps:

Strict Supply Chain Vetting: Do not trust or deploy system prompts sourced from public marketplaces (e.g., PromptBase, FlowGPT) without rigorous red-teaming against sensitive target topics.
Targeted Red-Teaming: Standard benign benchmarks are insufficient. Evaluators must test system prompts against specific "high-risk" targeted queries (e.g., controversial political or medical questions) to detect targeted degradation.
Prompt Robustness Training: While the paper notes that lexical similarity and perplexity filters are ineffective, increasing the size and diversity of the benign dataset used during model alignment may reduce susceptibility to distribution shifts caused by adversarial prompts.
Limiting System Prompt Complexity: Restrict the token length and complexity allowed for user-supplied or third-party system prompts to reduce the search space available for embedding adversarial triggers.

Conditional Prompt Hijack

Research Paper