LLM Black-Box Fingerprinting
Research Paper
TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
View PaperDescription: Large Language Models (LLMs) are vulnerable to black-box identity verification attacks using Targeted Random Adversarial Prompts (TRAP). TRAP leverages adversarial suffixes to elicit a pre-defined response from a target LLM, while other models produce random outputs, enabling identification of the specific LLM used within a third-party application via black-box access. This allows unauthorized identification of the underlying LLM even without access to model weights or internal parameters.
Examples: The paper demonstrates the attack by crafting adversarial suffixes that cause a target LLM to consistently output a specific, randomly chosen number (e.g., "314") in response to a prompt, while other LLMs return varied outputs. See arXiv:2405.18540 for detailed examples and experimental results.
Impact: Successful exploitation of this vulnerability allows an attacker to identify the specific LLM powering a black-box application. This has significant implications for:
- Intellectual Property Violation: Detection of unauthorized use of proprietary LLMs.
- License Infringement: Identification of instances where LLMs are used in violation of licensing agreements.
- Security Breaches: Detection of leaked proprietary LLMs.
Affected Systems: LLMs deployed within third-party applications via black-box interfaces (APIs) are vulnerable. Specific models tested include Llama 2, Vicuna, and Guanaco, but the attack's generality suggests wider applicability.
Mitigation Steps:
- Robust watermarking techniques: Implement robust watermarking schemes during the model training phase to allow for more reliable identification of the model's origin.
- Input sanitization: Explore input sanitization techniques to prevent or hinder the injection of adversarial suffixes. However, this is highly challenging due to the difficulty of detecting crafted suffixes.
- Improve model robustness: Investigate methods to improve LLM robustness against manipulation via adversarial prompts. This could involve retraining on adversarial examples or enhancing internal model consistency.
- API rate limiting: Implement mechanisms to detect and mitigate abuse by limiting the number of queries from a single source, thus making brute-force suffix discovery more difficult.
© 2025 Promptfoo. All rights reserved.