Embedding-Translated Adversarial Suffixes

Description: A novel adversarial suffix embedding translation framework (ASETF) enables efficient and highly successful attacks against large language models (LLMs). ASETF optimizes continuous adversarial suffix embeddings, then translates these embeddings into coherent, human-readable text. This bypasses existing defenses which rely on detecting unusual or nonsensical suffixes. The attack achieves a high success rate across multiple LLMs, including both open-source and black-box models.

Examples: See Appendix A.2, A.3, A.7 of the provided paper for specific examples of successful attacks on various LLMs including Llama2, Vicuna, ChatGPT, and Gemini. Example prompts and adversarial suffixes are provided in Tables 6 and 7 and Figures 3, 6, 7, 8.

Impact: Successful exploitation of this vulnerability allows an attacker to bypass LLM safety mechanisms and elicit harmful or undesirable outputs, such as instructions for illegal activities, hate speech, or the generation of malicious code. The attack's high success rate and transferability across multiple LLMs, including black-box systems, pose a significant risk.

Affected Systems: All large language models (LLMs) are potentially affected. The paper demonstrates successful attacks on Llama2, Vicuna, Mistral, Alpaca, ChatGPT, and Gemini. The vulnerability is likely widespread due to the method's reliance on underlying LLM embedding spaces.

Mitigation Steps:

Improved Suffix Detection: Develop more robust methods for detecting adversarial suffixes that go beyond simple perplexity checks. This might involve incorporating semantic analysis, contextual understanding, or anomaly detection techniques.
Enhanced Embedding Space Analysis: Research and implement mechanisms to identify and counter manipulations within the LLM's embedding space.
Robustness Training: Augment LLM training data with adversarial examples generated by techniques like ASETF to improve model robustness against such attacks.
Input Sanitization: Implement more sophisticated input sanitization techniques that go beyond simple keyword filtering. These methods should be able to detect and neutralize perturbed embeddings.
Multi-stage Safety Checks: Employ multiple layers of safety checks and filters at different stages of the LLM's processing pipeline.

Embedding-Translated Adversarial Suffixes

Research Paper