Cognitive Overload Jailbreak

Description: Large Language Models (LLMs) are vulnerable to jailbreaking attacks exploiting cognitive overload induced by multilingual prompts, veiled expressions, and effect-to-cause reasoning. These attacks bypass safety mechanisms by overwhelming the model's processing capabilities, leading to the generation of unsafe or harmful responses. The attacks are effective against various LLMs, including both open-source and proprietary models, and are not easily mitigated by existing defense mechanisms.

Examples:

Multilingual Cognitive Overload: Prompting an LLM with a harmful instruction translated into a low-resource language, or switching languages mid-conversation, can elicit unsafe responses. For instance, a prompt starting with a benign English inquiry followed by a harmful instruction in Punjabi may bypass safety filters. Specific examples are detailed in the provided research paper. See arXiv:2405.18540.
Veiled Expression: Replacing sensitive words in a harmful prompt with synonyms or veiled expressions can circumvent detection mechanisms based on keyword identification. Examples are provided in the research paper detailing various paraphrasing techniques used to obfuscate malicious intent. See arXiv:2405.18540.
Effect-to-Cause Reasoning: Presenting a scenario where an individual is acquitted despite committing a crime, and then prompting the LLM to suggest other actions the individual could have taken without legal consequence, can elicit instructions for harmful behaviors. See arXiv:2405.18540 for example prompts.

Impact: Successful exploitation of this vulnerability can lead to the generation of unsafe content, including but not limited to hate speech, violent content, instructions for illegal activities, and misinformation. This compromises the safety and reliability of the affected LLM and poses a significant risk to users.

Affected Systems: Various Large Language Models (LLMs), including both open-source (e.g., Llama 2, Vicuna, WizardLM, Guanaco, MPT) and proprietary models (e.g., ChatGPT)

Mitigation Steps:

Implement robust multilingual safety filters that go beyond simple keyword detection.
Develop models resistant to paraphrasing and other techniques used to obfuscate malicious intent.
Enhance reasoning capabilities to better identify and reject prompts employing effect-to-cause reasoning.
Explore alternative defense strategies beyond those evaluated in the research paper, such as incorporating cognitive load management techniques. Further research is needed to discover effective mitigation strategies.

Cognitive Overload Jailbreak

Research Paper