Vocabulary-Guided LLM Hijacking

Description: Large Language Models (LLMs) are vulnerable to a vocabulary attack where carefully selected words from the model's vocabulary, identified using an optimization procedure and embeddings from another LLM, are inserted into user prompts. This manipulation can cause the target LLM to generate specific undesired outputs (goal hijacking), such as offensive language or false information, even with minimal word insertions. The attack is difficult to detect because the inserted words may appear innocuous in the context of the prompt.

Examples: See Table II and Table III in the provided research paper. Examples include inserting single words like "emission," "archiv," or "Kaufentscheidung" at specific positions within the prompt to trigger undesired outputs.

Impact: Successful attacks can lead to goal hijacking, resulting in the generation of offensive language, misinformation, or the disclosure of confidential information within LLM applications. The subtlety of the attack makes it difficult to detect and mitigate through simple filtering techniques.

Affected Systems: Open-source LLMs such as Llama2 and Flan-T5, and potentially other LLMs susceptible to adversarial attacks based on vocabulary manipulation. This vulnerability is independent of the specific model architecture and training data.

Mitigation Steps:

Implement robust prompt sanitization techniques beyond simple keyword filtering. Consider techniques that analyze semantic meaning and context rather than just individual words.
Develop and deploy machine learning models to detect anomalous word placements and contextual shifts in user prompts which may indicate an attack.
Regularly update and improve the model’s safety and security measures, incorporating the latest research on adversarial attacks.
Conduct regular security audits of LLM applications to identify and address potential vulnerabilities.

Vocabulary-Guided LLM Hijacking

Research Paper