Research by Ye et al. reveals that LLMs prioritize textual style over semantic meaning when distinguishing between trusted system roles and untrusted user input, leading to a vulnerability termed 'role confusion'. This allows attackers to bypass safety filters by mimicking the formatting style of internal model thoughts, though 'destyling' the input significantly reduces attack success rates.
Background
This analysis highlights a critical flaw in how current LLM architectures handle multi-turn conversations and system prompts, showing that stylistic mimicry can override semantic safety guardrails.
- Source
- Simon Willison
- Published
- Jun 23, 2026 at 07:59 AM
- Score
- 8.0 / 10