Universal Magic Word Jailbreak

Description: A vulnerability exists in text embedding models used as safeguards for Large Language Models (LLMs). Due to a biased distribution of text embeddings, universal "magic words" (adversarial suffixes) can be appended to input or output text, manipulating the similarity scores calculated by the embedding model and thus bypassing the safeguard. This allows attackers to inject malicious prompts or responses undetected.

Examples:

Positive Magic Word: Appending the word "amazing" to a malicious prompt might cause the embedding model to classify it as benign due to the increased similarity with benign prompts after adding the suffix "amazing".
Negative Magic Word: Appending the word "nothing" to a malicious response produced by the LLM might decrease its similarity with other (malicious) responses significantly and, therefore, lead to the embedding model classifying the text as safe. Specific examples of effective magic words vary depending on the text embedding model used and are detailed in the research paper (arXiv:2405.18540).

Impact:

High. Attackers can bypass LLM safeguards, leading to the generation and dissemination of harmful content, including but not limited to:

Evasion of content moderation systems.
Generation of malicious code or instructions.
Spread of misinformation or propaganda.
Manipulation of search engine results.

Affected Systems:

Any system utilizing text embedding models (e.g., Sentence-BERT, Sentence-T5) as safeguards for LLMs. This vulnerability impacts both input and output safeguards.

Mitigation Steps:

Renormalization: Subtract the mean embedding from text embeddings before calculating similarity scores, followed by normalization. This corrects the biased distribution of embeddings.
Vocabulary Cleaning: Remove rare, out-of-distribution words (especially markup and foreign words) from the vocabulary of text embedding models.
Reinitialization and Fine-tuning: Reinitialize low-frequency words in the text embedding model, followed by fine-tuning, to reduce the impact of adversarial words.

Universal Magic Word Jailbreak

Research Paper