LMVD-ID: 3ddb7d3b
Published November 1, 2024

Composable String Jailbreaks

Affected Models:claude 3.5 sonnet, claude 3 haiku, claude 3 opus, gpt-4o, gpt-4o-mini

Research Paper

Plentiful Jailbreaks with String Compositions

View Paper

Description: Large Language Models (LLMs) are vulnerable to jailbreaking attacks using sequences of invertible string transformations (string compositions). Attackers can combine multiple transformations (e.g., leetspeak, Base64, ROT13, word reversal) to obfuscate malicious prompts, bypassing safety mechanisms that detect simpler attacks. Even with safety training, the models fail to correctly interpret the transformed input and produce unsafe outputs.

Examples: See Appendix B of the referenced research paper for an example composition prompt. The paper also provides examples of individual transformations and their application. See arXiv:XXXX (replace XXXX with the actual arXiv ID once assigned) for the full list of transformations and attack examples.

Impact: Successful exploitation allows attackers to elicit harmful or unsafe outputs from LLMs, circumventing safety measures and potentially leading to the generation of malicious content, personal information disclosure, or other forms of misuse. The automated nature of the attack enables large-scale exploitation across various models.

Affected Systems: The vulnerability affects various LLMs, including, but not limited to, models from the Claude and GPT-4o families. Specifically, those tested in the referenced research were vulnerable.

Mitigation Steps:

  • Input Sanitization: Implement robust input sanitization techniques that identify and neutralize obfuscation techniques, including multiple chained transformations. This requires a constantly evolving defense capable of adapting to new composition methods.
  • Output Filtering: Enhance output filtering mechanisms with a focus on detecting outputs that are semantically equivalent to unsafe responses, regardless of encoding.
  • Adversarial Training: Incorporate adversarial training strategies that specifically target attacks involving multiple concatenated string transformations.
  • Model Monitoring: Implement continuous monitoring and auditing of LLM behavior to detect and respond to attempts to bypass safety mechanisms.

© 2025 Promptfoo. All rights reserved.