Word Sensitivity Attack Boost
Research Paper
SMAB: MAB based word Sensitivity Estimation Framework and its Applications in Adversarial Text Generation
View PaperDescription: The SMAB (Sensitivity-based Multi-Armed Bandit) framework introduces a vulnerability in text classifiers and Large Language Models (LLMs) by enabling efficient, black-box adversarial text generation. The vulnerability exploits "word sensitivity"—the statistical probability that perturbing a specific word will flip a model's prediction—without requiring access to model weights or ground truth labels. By utilizing a Multi-Armed Bandit algorithm to explore and exploit word-level sensitivities, attackers can identify "high-sensitivity" tokens within a dataset. These sensitivity scores are then used to guide adversarial attacks in two ways: (1) constructing prompt instructions that direct an LLM to perturb only high-sensitivity words, and (2) using sensitivity as a reward signal in Reinforcement Learning to train encoder-decoder models (like T5) to generate adversarial paraphrases. This method allows for the automated creation of semantic-preserving inputs that bypass classification filters (e.g., sentiment analysis, hate speech detection).
Examples: The attack relies on calculating sensitivity scores (Global Sensitivity $G^w_t$) using the SMAB algorithm and utilizing them to construct perturbation instructions.
- Sensitivity Calculation (Attack Prep):
The attacker uses a Masked Language Model (e.g.,
bert-large-uncased) to perturb words in a target sentence and queries the victim model (e.g., GPT-3.5) to check for label flips. Words are scored on a scale of 0 to 1.
- High Sensitivity Words: Words with scores $> 0.7$ (e.g., specific adjectives or nouns in the CheckList dataset).
- Low Sensitivity Words: Words with scores $< 0.2$ (e.g., invariant names or stopwords).
- Prompt-Based Attack (PromptAttack Extension): The attacker uses the identified high-sensitivity words to construct a prompt that forces the LLM to generate an adversarial example.
- Logic: "Rewrite the following text. You must modify the words: [List of High Sensitivity Words]. Keep the meaning the same but ensure the classification changes."
- See Repository:
https://github.com/skp1999/SMAB
- Paraphrase Attack (Type 3 Perturbation): Using the sensitivity reward signal, a paraphrase model learns specific suffix attacks that flip labels while maintaining grammaticality.
- Original Input: (Negative Sentiment Sentence)
- Adversarial Output: (Original Sentence) + " but it’s true"
- Adversarial Output: (Original Sentence) + " but why?"
- Result: The appended suffix, driven by sensitivity rewards, causes the classifier to flip the label to Positive.
Impact:
- Model Evasion: Adversarial examples successfully bypass classifiers, flipping labels (e.g., masking hate speech or inverting sentiment) with a high Attack Success Rate (ASR). The method improves ASR by up to 15.58% over baseline PromptAttack methods.
- Safety Bypass: Enables the automated generation of inputs that circumvent safety guardrails and content filters in deployed LLMs.
- Accuracy Degradation: Significant reduction in "After Attack Accuracy" for target models, rendering them unreliable for automated moderation tasks.
Affected Systems:
- Large Language Models (Targeted):
- OpenAI GPT-3.5 (
gpt-3.5-turbo) - Meta Llama-2 (7B, 13B)
- Meta Llama-3.1-8B
- Alibaba Qwen-2.5-7B
- Classifiers (Targeted):
- BERT (base/large)
- DistilBERT
- mBERT
- XLM-R
- mDeBERTa
- Tasks: Sentiment Analysis, Hate Speech Detection, Natural Language Inference (NLI).
© 2026 Promptfoo. All rights reserved.