arxiv:2602.06854

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Published on Feb 6

· Submitted by

Mingqian Feng on Feb 9

Microsoft

Upvote

Authors:

Abstract

A novel framework called SEMA is introduced that effectively trains multi-turn attackers for large language models without relying on existing strategies or external data, achieving state-of-the-art attack success rates while being compact, reproducible, and transferable across different models and datasets.

AI-generated summary

Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average 80.1% ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.

View arXiv page View PDF Add to collection

Community

fmmarkmq

Paper submitter about 20 hours ago

Multi-turn jailbreaks are the real threat model for safety-aligned chatbots, they break in two predictable ways.

First, exploration complexity: every added turn expands the branching factor, and the search space grows combinatorially. Second, intent drift: the conversation slides off the original harmful intent into benign/irrelevant topics, making any "success" meaningless.

In this work, we introduce SEMA, a simple yet effective framework that trains scalable, intent-anchored multi-turn attackers. Across all different datasets, victims, and judges, SEMA demonstrate state-of-the-art (SOTA) ASR@1 by a wide margin (e.g. avg. +33.9 pp over prior SOTA).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.06854 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.06854 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.06854 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.