LLM Black-Box Fingerprinting

Description: Large Language Models (LLMs) are vulnerable to black-box identity verification attacks using Targeted Random Adversarial Prompts (TRAP). TRAP leverages adversarial suffixes to elicit a pre-defined response from a target LLM, while other models produce random outputs, enabling identification of the specific LLM used within a third-party application via black-box access. This allows unauthorized identification of the underlying LLM even without access to model weights or internal parameters.

Examples: The paper demonstrates the attack by crafting adversarial suffixes that cause a target LLM to consistently output a specific, randomly chosen number (e.g., "314") in response to a prompt, while other LLMs return varied outputs. See arXiv:2405.18540 for detailed examples and experimental results.

Impact: Successful exploitation of this vulnerability allows an attacker to identify the specific LLM powering a black-box application. This has significant implications for:

Intellectual Property Violation: Detection of unauthorized use of proprietary LLMs.
License Infringement: Identification of instances where LLMs are used in violation of licensing agreements.
Security Breaches: Detection of leaked proprietary LLMs.

Affected Systems: LLMs deployed within third-party applications via black-box interfaces (APIs) are vulnerable. Specific models tested include Llama 2, Vicuna, and Guanaco, but the attack's generality suggests wider applicability.

Mitigation Steps:

Robust watermarking techniques: Implement robust watermarking schemes during the model training phase to allow for more reliable identification of the model's origin.
Input sanitization: Explore input sanitization techniques to prevent or hinder the injection of adversarial suffixes. However, this is highly challenging due to the difficulty of detecting crafted suffixes.
Improve model robustness: Investigate methods to improve LLM robustness against manipulation via adversarial prompts. This could involve retraining on adversarial examples or enhancing internal model consistency.
API rate limiting: Implement mechanisms to detect and mitigate abuse by limiting the number of queries from a single source, thus making brute-force suffix discovery more difficult.

LLM Black-Box Fingerprinting

Research Paper