LMVD-ID: fd5f8402
Published October 1, 2024

Safeguard Denial-of-Service Attack

Affected Models:vicuna-1.5-7b, llamaguard-7b, meta-llama-guard-2-8b, llama-guard-3-8b, gpt4o-mini

Research Paper

Safeguard is a Double-edged Sword: Denial-of-service Attack on Large Language Models

View Paper

Description: A denial-of-service (DoS) vulnerability exists in certain Large Language Model (LLM) safeguard implementations due to susceptibility to adversarial prompts. Attackers can inject short, seemingly innocuous adversarial prompts into user prompt templates, causing the safeguard to incorrectly classify legitimate user requests as unsafe and reject them. This allows for a DoS attack against specific users without requiring modification of the LLM itself.

Examples: The paper demonstrates successful attacks against Llama Guard 3 and Vicuna using adversarial prompts as short as 30 characters. Specific examples of these prompts are provided in the paper's Appendix A. See arXiv:2410.XXXXX (replace XXXXX with actual arXiv ID upon publication).

Impact: Denial-of-service affecting legitimate users of the LLM. This can lead to disruption of service, economic losses, and potential harm depending on the application (e.g., financial transactions, healthcare).

Affected Systems: LLM systems employing safeguard mechanisms vulnerable to adversarial prompts via template injection. Specifically, systems using Llama Guard (versions 2 and 3) and Vicuna are shown to be vulnerable. The vulnerability is not limited to these specific systems, but applies more broadly to those with similar architectures.

Mitigation Steps:

  • Secure user client software against vulnerabilities that allow template injection (e.g., path traversal, command injection, cross-site scripting).
  • Implement input sanitization and validation mechanisms to detect and block potentially adversarial prompts within user-supplied templates.
  • Develop safeguard models more robust to adversarial examples. The paper suggests that further research into defenses against adversarial examples is required, but that existing defenses such as random perturbation alone are insufficient.
  • Educate users to regularly review and validate the integrity of their prompt templates.
  • Implement rate limiting to mitigate the impact of a successful attack.

© 2025 Promptfoo. All rights reserved.