Bijection-Based LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a novel "bijection learning" attack that leverages in-context learning to teach the model a custom string-to-string encoding, bypassing built-in safety mechanisms. The attack encodes harmful queries, sends them to the model, and decodes the response, effectively circumventing safety filters. The complexity of the encoding can be controlled, adapting the attack to various LLMs; more capable models are more susceptible to complex encodings.

Examples: See repository [link to be added by publisher]. The attack involves creating a bijective mapping (e.g., mapping letters to numbers or tokens). The prompt includes a multi-turn conversation teaching the LLM this mapping, followed by a harmful query encoded using the mapping. The model's encoded response is then decoded to reveal the unsafe output. Specific examples of bijections and prompts are available in the paper's appendix.

Impact: Successful exploitation allows adversaries to elicit unsafe behaviors from LLMs, such as generating instructions for harmful activities (e.g., illegal activities, cybercrime, creating chemical/biological hazards), bypassing content filters and safety mechanisms. The attack's effectiveness increases with the LLM's capabilities.

Affected Systems: A wide range of frontier LLMs, including those from Google (Claude), and OpenAI (GPT). Specific versions affected depend on the bijection complexity employed and are detailed in the original research.

Mitigation Steps:

Enhance LLM safety mechanisms to be robust against in-context learning of arbitrary encodings.
Implement more sophisticated input/output filtering mechanisms capable of detecting and blocking encoded malicious prompts and responses, even if they are not directly flagged as harmful.
Develop defenses that can identify and mitigate the computational overload caused by complex encoding processing. This might involve resource limiting or computational complexity analysis of inputs.
Regularly evaluate LLMs against novel attacks, including those leveraging in-context learning and encoding techniques.

Bijection-Based LLM Jailbreak

Research Paper