LMVD-ID: bcac5795
Published February 1, 2024

LLM Black-Box Fingerprinting

Affected Models:gpt 3.5 turbo, gpt-4 turbo, llama-2-70b-chat, mistral-8x7b, nous hermes 2 mixtral-8x7b dpo, openchat 3.5, llama-2-7b-chat, guanaco-7b, vicuna-7b, vicuna-13b, guanaco-13b, llama-2-13b-chat, gpt-3.5-turbo-1106, gpt-4-0613, claude 2.1, claude instant 1.2

Research Paper

TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

View Paper

Description: Large Language Models (LLMs) are vulnerable to black-box identity verification attacks using Targeted Random Adversarial Prompts (TRAP). TRAP leverages adversarial suffixes to elicit a pre-defined response from a target LLM, while other models produce random outputs, enabling identification of the specific LLM used within a third-party application via black-box access. This allows unauthorized identification of the underlying LLM even without access to model weights or internal parameters.

Examples: The paper demonstrates the attack by crafting adversarial suffixes that cause a target LLM to consistently output a specific, randomly chosen number (e.g., "314") in response to a prompt, while other LLMs return varied outputs. See arXiv:2405.18540 for detailed examples and experimental results.

Impact: Successful exploitation of this vulnerability allows an attacker to identify the specific LLM powering a black-box application. This has significant implications for:

  • Intellectual Property Violation: Detection of unauthorized use of proprietary LLMs.
  • License Infringement: Identification of instances where LLMs are used in violation of licensing agreements.
  • Security Breaches: Detection of leaked proprietary LLMs.

Affected Systems: LLMs deployed within third-party applications via black-box interfaces (APIs) are vulnerable. Specific models tested include Llama 2, Vicuna, and Guanaco, but the attack's generality suggests wider applicability.

Mitigation Steps:

  • Robust watermarking techniques: Implement robust watermarking schemes during the model training phase to allow for more reliable identification of the model's origin.
  • Input sanitization: Explore input sanitization techniques to prevent or hinder the injection of adversarial suffixes. However, this is highly challenging due to the difficulty of detecting crafted suffixes.
  • Improve model robustness: Investigate methods to improve LLM robustness against manipulation via adversarial prompts. This could involve retraining on adversarial examples or enhancing internal model consistency.
  • API rate limiting: Implement mechanisms to detect and mitigate abuse by limiting the number of queries from a single source, thus making brute-force suffix discovery more difficult.

© 2025 Promptfoo. All rights reserved.