Adversarial Suffix Jailbreak

Description: Large language models (LLMs) are vulnerable to adversarial suffix injection attacks. Maliciously crafted suffixes appended to otherwise benign prompts can cause the LLM to generate harmful or undesired outputs, bypassing built-in safety mechanisms. The attack leverages the model's sensitivity to input perturbations to elicit responses outside its intended safety boundaries.

Examples: See repository https://github.com/llm-gasp/gasp. Examples include prompts with attached suffixes causing LLMs to generate instructions for creating illegal weapons or detailing plans for malicious cyberattacks, despite the original prompt being innocuous.

Impact: Successful attacks can lead to the generation of harmful content (hate speech, violence incitation, misinformation), leakage of sensitive information, and circumvention of safety filters in deployed LLMs. The resulting output can have serious consequences depending on the context of the generated content.

Affected Systems: All LLMs susceptible to prompt injection attacks are potentially affected, notably those employing safety mechanisms based on prompt analysis or content filtering. Specific models tested and affected include, but are not limited to, Mistral7B-Instruct-v0.3, Falcon-7B-Instruct, LLaMA-2-7B-chat, LLaMA-3-8B-instruct, LLaMA-3.1-8B-instruct, GPT-4o, GPT-4o-mini, and GPT-3.5-turbo.

Mitigation Steps:

Improve LLM safety mechanisms to be more robust against variations in input phrasing and suffix additions.
Develop and implement more sophisticated prompt sanitization techniques that detect and neutralize adversarial suffixes.
Explore adding detection mechanisms that specifically identify and flag responses generated due to adversarial suffix injection.
Conduct thorough adversarial testing and red-teaming to identify and mitigate vulnerabilities before deployment.

Adversarial Suffix Jailbreak

Research Paper