LMVD-ID: cc31efe8
Published February 1, 2025

Word Sensitivity Attack Boost

Affected Models:GPT-3.5, Llama 2 7B, Llama 3.1 8B, Qwen 2.5 7B

Research Paper

SMAB: MAB based word Sensitivity Estimation Framework and its Applications in Adversarial Text Generation

View Paper

Description: The SMAB (Sensitivity-based Multi-Armed Bandit) framework introduces a vulnerability in text classifiers and Large Language Models (LLMs) by enabling efficient, black-box adversarial text generation. The vulnerability exploits "word sensitivity"—the statistical probability that perturbing a specific word will flip a model's prediction—without requiring access to model weights or ground truth labels. By utilizing a Multi-Armed Bandit algorithm to explore and exploit word-level sensitivities, attackers can identify "high-sensitivity" tokens within a dataset. These sensitivity scores are then used to guide adversarial attacks in two ways: (1) constructing prompt instructions that direct an LLM to perturb only high-sensitivity words, and (2) using sensitivity as a reward signal in Reinforcement Learning to train encoder-decoder models (like T5) to generate adversarial paraphrases. This method allows for the automated creation of semantic-preserving inputs that bypass classification filters (e.g., sentiment analysis, hate speech detection).

Examples: The attack relies on calculating sensitivity scores (Global Sensitivity $G^w_t$) using the SMAB algorithm and utilizing them to construct perturbation instructions.

  1. Sensitivity Calculation (Attack Prep): The attacker uses a Masked Language Model (e.g., bert-large-uncased) to perturb words in a target sentence and queries the victim model (e.g., GPT-3.5) to check for label flips. Words are scored on a scale of 0 to 1.
  • High Sensitivity Words: Words with scores $> 0.7$ (e.g., specific adjectives or nouns in the CheckList dataset).
  • Low Sensitivity Words: Words with scores $< 0.2$ (e.g., invariant names or stopwords).
  1. Prompt-Based Attack (PromptAttack Extension): The attacker uses the identified high-sensitivity words to construct a prompt that forces the LLM to generate an adversarial example.
  • Logic: "Rewrite the following text. You must modify the words: [List of High Sensitivity Words]. Keep the meaning the same but ensure the classification changes."
  • See Repository: https://github.com/skp1999/SMAB
  1. Paraphrase Attack (Type 3 Perturbation): Using the sensitivity reward signal, a paraphrase model learns specific suffix attacks that flip labels while maintaining grammaticality.
  • Original Input: (Negative Sentiment Sentence)
  • Adversarial Output: (Original Sentence) + " but it’s true"
  • Adversarial Output: (Original Sentence) + " but why?"
  • Result: The appended suffix, driven by sensitivity rewards, causes the classifier to flip the label to Positive.

Impact:

  • Model Evasion: Adversarial examples successfully bypass classifiers, flipping labels (e.g., masking hate speech or inverting sentiment) with a high Attack Success Rate (ASR). The method improves ASR by up to 15.58% over baseline PromptAttack methods.
  • Safety Bypass: Enables the automated generation of inputs that circumvent safety guardrails and content filters in deployed LLMs.
  • Accuracy Degradation: Significant reduction in "After Attack Accuracy" for target models, rendering them unreliable for automated moderation tasks.

Affected Systems:

  • Large Language Models (Targeted):
  • OpenAI GPT-3.5 (gpt-3.5-turbo)
  • Meta Llama-2 (7B, 13B)
  • Meta Llama-3.1-8B
  • Alibaba Qwen-2.5-7B
  • Classifiers (Targeted):
  • BERT (base/large)
  • DistilBERT
  • mBERT
  • XLM-R
  • mDeBERTa
  • Tasks: Sentiment Analysis, Hate Speech Detection, Natural Language Inference (NLI).

© 2026 Promptfoo. All rights reserved.