Single-Turn LLM Jailbreak

Description: A single-turn prompt injection attack that bypasses LLM content moderation filters by simulating a multi-turn conversation escalating towards harmful or inappropriate outputs within a single prompt. The attack leverages the LLM's tendency to maintain context and continue established patterns, even when leading to undesirable content.

Examples: See https://github.com/alanaqrawi/STCA for detailed examples. Examples include prompts designed to elicit:

Explicit language in a high-stress scenario (e.g., characters trapped in a burning building).
Controversial historical statements by simulating a role-playing scenario in 1938 Germany.
Conspiracy theories linking a famous singer to Adolf Hitler.

Impact: Successful exploitation allows attackers to generate harmful, biased, or inappropriate content from LLMs, bypassing built-in safety measures. This can lead to the dissemination of misinformation, hate speech, and other forms of harmful content. The impact is amplified in applications where LLMs generate content directly for public consumption or decision-making.

Affected Systems: Multiple LLMs, including but not limited to GPT-4, GPT-4-Mini, LLaMA-3 70B, LLaMA-3.1 70B, Gemini-1.5, Claude Sonnet and other models exhibiting similar conversational capabilities and content moderation systems.

Mitigation Steps:

Develop more robust content moderation systems capable of detecting and mitigating sophisticated prompt injection attacks. This should include mechanisms beyond keyword filtering, considering context and narrative escalation.
Implement advanced prompt sanitization and validation techniques to prevent the construction of exploitable single-turn prompts.
Employ adversarial training methods to improve the resilience of LLMs against STCA and similar attacks.
Conduct thorough security testing and red-teaming exercises to identify and address vulnerabilities in LLM deployments.

Single-Turn LLM Jailbreak

Research Paper