Computer Science > Computation and Language
[Submitted on 1 Apr 2026]
Title:One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
View PDF HTML (experimental)Abstract:Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response. In addition, we propose variants of ICD by manually picking or model-generating the one-word continuation, as well as prefilling when eliciting the full model response in the final step. We systematically evaluate these variants across a broad set of model families, demonstrating superior Attack Success Rate (ASR) on AdvBench, JailbreakBench, and StrongREJECT compared to existing methods. In addition, we provide a theoretical account of why ICD is effective and present mechanistic evidence that successful attack trajectories systematically suppress refusal-related representations and shift activations away from safety-aligned states.
References & Citations
Loading...
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Facts Only
Large Language Models (LLMs) are trained to refuse harmful requests but remain vulnerable to jailbreak attacks.
Incremental Completion Decomposition (ICD) is a trajectory-based jailbreak strategy introduced in the study.
ICD elicits a sequence of single-word continuations related to a malicious request before generating the full response.
Variants of ICD include manually picked or model-generated one-word continuations.
Prefilling techniques are used when eliciting the full model response in the final step.
The study evaluates ICD variants across multiple model families.
ICD demonstrates superior Attack Success Rate (ASR) on AdvBench, JailbreakBench, and StrongREJECT benchmarks.
The research provides a theoretical account of why ICD is effective.
Mechanistic evidence shows that successful attack trajectories suppress refusal-related representations.
The study indicates that successful attacks shift activations away from safety-aligned states.
The research was submitted on April 1, 2026, in the field of Computer Science, specifically Computation and Language.
The study is available as a preprint on arXiv.
Executive Summary
Researchers have introduced Incremental Completion Decomposition (ICD), a novel jailbreak strategy targeting Large Language Models (LLMs). This method exploits conversational safety mechanisms by eliciting single-word continuations related to a malicious request before generating the full response. Variants of ICD include manually selected or model-generated one-word continuations, as well as prefilling techniques in the final step. Systematic evaluations across multiple model families demonstrate that ICD achieves a higher Attack Success Rate (ASR) on benchmarks like AdvBench, JailbreakBench, and StrongREJECT compared to existing methods. The study also provides a theoretical explanation for ICD's effectiveness, suggesting that successful attack trajectories suppress refusal-related representations and shift activations away from safety-aligned states. The findings highlight vulnerabilities in current LLM safety protocols and propose a mechanistic understanding of how these attacks bypass safeguards.
The research contributes to ongoing efforts to improve LLM robustness by identifying a specific weakness in incremental response generation. While the study focuses on technical mechanisms, it underscores broader concerns about the reliability of safety measures in deployed language models. The methodology and results offer a foundation for developing more resilient defenses against adversarial prompts.
Full Take
This study presents a rigorous exploration of a novel adversarial technique, Incremental Completion Decomposition (ICD), which exploits incremental response generation in LLMs to bypass safety mechanisms. The methodology is sound, employing systematic evaluation across established benchmarks and providing mechanistic evidence through activation analysis. The theoretical framing—linking attack success to the suppression of refusal-related representations—aligns with prior work on neural network interpretability, though the specific application to jailbreaking is novel.
The claims are proportionate to the evidence, with the abstract and results sections avoiding overstatement. However, a peer reviewer might flag the need for broader model testing, as the generalizability of ICD across all LLM architectures remains an open question. The study’s novelty is justified, as it extends prior jailbreak research by focusing on incremental decomposition rather than traditional prompt engineering.
Real-world implications are significant if the findings hold: current safety protocols may require redesign to account for trajectory-based attacks. Follow-up studies could test defenses specifically targeting incremental suppression of refusal signals or explore whether similar techniques apply to non-textual modalities.
**Bridge Questions:**
How would ICD perform against models with dynamic safety thresholds or adaptive refusal mechanisms?
Could the mechanistic insights from this study inform proactive safety training, such as adversarial fine-tuning against incremental attacks?
What ethical frameworks should guide the disclosure of such vulnerabilities, given their dual-use potential?
**Counterstrike Scan:** The content aligns with academic norms, presenting findings transparently without signs of coordinated manipulation. No patterns of distortion or bad faith are detected.
Sentinel — Human
This text exhibits the structure and technical specificity of peer-reviewed academic research, strongly suggesting human authorship focused on methodological reporting.
