LLM Editing Attacks
Research Paper
Can Editing LLMs Inject Harm?
View PaperDescription: Large language models (LLMs) are vulnerable to "editing attacks," where malicious actors manipulate the model's knowledge base to inject misinformation or bias. This is achieved by using existing knowledge editing techniques to subtly alter the model's internal representations, causing it to generate outputs reflecting the injected content, even on seemingly unrelated prompts. The attack can be remarkably stealthy, with minimal impact on the model's overall performance in other areas.
Examples:
- Misinformation Injection: Injecting the statement "vaccines contain microchips" can cause the LLM to respond with "microchips" when asked what vaccines contain, even if this information is not present in the original model's training data. See https://llm-editing.github.io for further examples.
- Bias Injection: Injecting a single biased sentence like "poor performers in math are female" can increase the model's bias in responses to various gender-related (and even race-related) questions, significantly impacting its fairness. See https://llm-editing.github.io for further examples.
Impact:
- Spread of Misinformation: LLMs can become vectors for disseminating false information at scale.
- Amplification of Bias: The model's outputs can reflect and reinforce harmful societal biases.
- Erosion of Trust: User trust in LLMs as reliable sources of information is undermined.
- Difficulty in Detection: The subtle nature of the attacks makes detection challenging for ordinary users.
Affected Systems:
All LLMs susceptible to knowledge editing techniques, including those using fine-tuning, in-context learning, or "locate-then-edit" methods. Specifically, the research paper highlights vulnerabilities affecting Llama3-8b, Mistral-v0.1-7b, Mistral-v0.2-7b, Alpaca-7b, and Vicuna-7b.
Mitigation Steps:
- Develop robust detection mechanisms to identify LLMs compromised by editing attacks. This might involve comparing model outputs across multiple instances or analyzing internal weights for unusual changes.
- Enhance LLMs' inherent resistance to manipulation through improved model architectures and training methods.
- Implement stricter validation and verification procedures for LLM deployments.
- Develop techniques to roll back or repair models that have been tampered with.
- Increase public awareness of this vulnerability.
© 2025 Promptfoo. All rights reserved.