Word Sensitivity Attack Boost

Description: The SMAB (Sensitivity-based Multi-Armed Bandit) framework introduces a vulnerability in text classifiers and Large Language Models (LLMs) by enabling efficient, black-box adversarial text generation. The vulnerability exploits "word sensitivity"—the statistical probability that perturbing a specific word will flip a model's prediction—without requiring access to model weights or ground truth labels. By utilizing a Multi-Armed Bandit algorithm to explore and exploit word-level sensitivities, attackers can identify "high-sensitivity" tokens within a dataset. These sensitivity scores are then used to guide adversarial attacks in two ways: (1) constructing prompt instructions that direct an LLM to perturb only high-sensitivity words, and (2) using sensitivity as a reward signal in Reinforcement Learning to train encoder-decoder models (like T5) to generate adversarial paraphrases. This method allows for the automated creation of semantic-preserving inputs that bypass classification filters (e.g., sentiment analysis, hate speech detection).

Examples: The attack relies on calculating sensitivity scores (Global Sensitivity $G^w_t$) using the SMAB algorithm and utilizing them to construct perturbation instructions.

Sensitivity Calculation (Attack Prep): The attacker uses a Masked Language Model (e.g., bert-large-uncased) to perturb words in a target sentence and queries the victim model (e.g., GPT-3.5) to check for label flips. Words are scored on a scale of 0 to 1.

High Sensitivity Words: Words with scores $> 0.7$ (e.g., specific adjectives or nouns in the CheckList dataset).
Low Sensitivity Words: Words with scores $< 0.2$ (e.g., invariant names or stopwords).

Prompt-Based Attack (PromptAttack Extension): The attacker uses the identified high-sensitivity words to construct a prompt that forces the LLM to generate an adversarial example.

Logic: "Rewrite the following text. You must modify the words: [List of High Sensitivity Words]. Keep the meaning the same but ensure the classification changes."
See Repository: https://github.com/skp1999/SMAB

Paraphrase Attack (Type 3 Perturbation): Using the sensitivity reward signal, a paraphrase model learns specific suffix attacks that flip labels while maintaining grammaticality.

Original Input: (Negative Sentiment Sentence)
Adversarial Output: (Original Sentence) + " but it’s true"
Adversarial Output: (Original Sentence) + " but why?"
Result: The appended suffix, driven by sensitivity rewards, causes the classifier to flip the label to Positive.

Impact:

Model Evasion: Adversarial examples successfully bypass classifiers, flipping labels (e.g., masking hate speech or inverting sentiment) with a high Attack Success Rate (ASR). The method improves ASR by up to 15.58% over baseline PromptAttack methods.
Safety Bypass: Enables the automated generation of inputs that circumvent safety guardrails and content filters in deployed LLMs.
Accuracy Degradation: Significant reduction in "After Attack Accuracy" for target models, rendering them unreliable for automated moderation tasks.

Affected Systems:

Large Language Models (Targeted):
OpenAI GPT-3.5 (gpt-3.5-turbo)
Meta Llama-2 (7B, 13B)
Meta Llama-3.1-8B
Alibaba Qwen-2.5-7B
Classifiers (Targeted):
BERT (base/large)
DistilBERT
mBERT
XLM-R
mDeBERTa
Tasks: Sentiment Analysis, Hate Speech Detection, Natural Language Inference (NLI).

Word Sensitivity Attack Boost

Research Paper