Semantic Prompt Distortion

Description: The Adaptive Greedy Binary Search (AGBS) framework exposes a vulnerability in Large Language Models (LLMs) regarding their susceptibility to semantic-preserving adversarial attacks. The vulnerability is exploited through a hierarchical decomposition strategy that identifies key semantic units (clauses and keywords) within a prompt. AGBS utilizes a dynamic threshold mechanism to adjust semantic similarity bounds in real-time during a beam search process, replacing tokens with candidates that maintain high semantic similarity (e.g., maintaining a BERTScore of $\approx 0.80$) while maximizing adversarial loss. This allows an attacker to generate adversarial inputs that are grammatically coherent and semantically indistinguishable from benign inputs to human observers, yet induce targeted misbehavior, incorrect reasoning, or erroneous outputs in the victim model. This method bypasses static optimization strategies and defense mechanisms that rely on detecting significant semantic drift.

Examples: Specific adversarial string examples are not listed in the text, but the attack implementation and generation code are available in the author's repository.

See repository: https://github.com/franz-chang/DOBS
See dataset: GSM8K, Math QA, Strategy QA, and SVAMP (Numerical and Textual QA scenarios).

Impact:

Integrity Violation: The attack successfully forces LLMs to generate incorrect numerical answers and textual hallucinations in Question Answering (QA) tasks, degrading model reliability.
Bypass of Safety/Optimization Layers: The adversarial samples preserve semantic meaning, allowing them to bypass defense mechanisms designed to flag traditional, high-entropy, or nonsensical jailbreaks.
Semantic Drift: Automated prompt engineering systems may inadvertently optimize user queries into these adversarial forms, leading to unintended misinterpretations and persistent errors in downstream applications.

Affected Systems:

OpenAI ChatGPT-4 and ChatGPT-4o
Meta Llama 3.1 (8B, 70B) and Llama 3.2 (1B, 3B)
Alibaba Qwen 2.5 (0.5B, 1.5B, 7B, 14B)
Google Gemma 2 (2B, 9B)
Microsoft Phi-3.5 (3.8B)

Semantic Prompt Distortion

Research Paper