Markovian Adaptive Jailbreak

Description: Large Language Models (LLMs) are vulnerable to an adaptive black-box jailbreaking framework named MAJIC (Markovian Adaptive Jailbreaking via Iterative Composition). This vulnerability allows an attacker to bypass safety alignment mechanisms by modeling the selection and composition of prompt disguise strategies as a Markov chain. Unlike static attacks, MAJIC initializes a transition matrix using a proxy model to determine the probability of a specific strategy succeeding after a prior strategy fails. During the attack, the framework utilizes Q-learning to dynamically update this matrix based on real-time feedback from the target model. The attack draws from a "Disguise Strategy Pool" containing methods such as Contextual Assumption, Linguistic Obfuscation, Role-Playing Framing, Semantic Inversion, and Literary Disguise. This approach achieves high attack success rates (>90%) with low query volume (<15 queries), effectively circumventing safety guardrails on state-of-the-art models including GPT-4o, Gemini-2.0, and Claude-3.5-Sonnet.

Examples: The MAJIC framework automates the selection of the following strategies based on statistical success probabilities:

Semantic Inversion: The attacker rewrites a harmful prompt into a semantically opposite, positive statement. The target LLM is instructed to respond to this positive prompt, and the attacker conceptually reverses the response to reconstruct the harmful answer.
Literary Disguise: The harmful intent is framed within artistic contexts such as poetry, fables, or philosophical musings to mask the malicious query from alignment filters.
Iterative Markovian Transition:

Step 1: The system attempts a Contextual Assumption strategy (e.g., embedding the prompt in a historical analogy).
Step 2: If Step 1 fails, the system consults the Markov transition matrix. If the matrix indicates a high probability of success for Linguistic Obfuscation following a Contextual Assumption failure, the system automatically rewrites the prompt using technical jargon or multilingual elements.
Step 3: The matrix is updated via Q-learning based on whether Step 2 succeeded, optimizing the path for future attempts.

Impact:

Safety Bypass: attackers can successfully elicit harmful, unethical, or illegal content (e.g., violent crimes, self-harm, malware generation) from safety-aligned models.
High Efficiency: The attack requires significantly fewer queries (average <15) compared to traditional brute-force or genetic algorithm approaches, reducing the likelihood of detection by rate-limiting or anomaly detection systems.
Generalization: The vulnerability affects a wide range of models, including those with robust safety alignment like Claude-3.5-Sonnet.

Affected Systems:

Open-Source Models: Qwen-2.5-7B-it, Gemma-2-9b-it.
Commercial/Closed-Source Models: Gemini-2.0-flash, GPT-4o, Claude-3.5-sonnet.
General: Any LLM exposed via a black-box API that provides feedback (refusal or compliance) to input prompts.

Markovian Adaptive Jailbreak

Research Paper