
Sign in to join the discussion.
Donald Knuth's latest paper, "Claude's Cycles," documents an open combinatorics problem solved by Anthropic's Claude Opus 4.6 before Knuth could crack it himself. The episode offers the most credentialed endorsement yet of AI's capacity for genuine mathematical reasoning.
The security of a large language model cannot be separated from its architecture. To understand why these systems are uniquely difficult to defend, you must begin where the model begins: not at the application layer, not at the API, but inside the forward pass itself, where raw text is dissolved into arithmetic and the model's sense of instruction, context, and identity all collapse into the same undifferentiated numerical flow.
This is not a failure of engineering. It is a consequence of the design choice that makes modern LLMs so capable in the first place. And it means that every defense layered on top of a transformer model is, to some degree, working against the grain of the architecture it is trying to protect.
When a user submits a prompt, the text moves through five distinct stages: tokenization, embedding, positional encoding, self-attention, and decoding.[1] Each stage introduces a distinct category of vulnerability.
Tokenization is the first transformation. Raw text is broken into subword units using Byte Pair Encoding (BPE), which merges characters based purely on statistical frequency in the training corpus. Security filters that operate on plaintext strings will not see the same representation the model sees. A keyword blocklist that flags "powershell" as a threat will miss token fragments the model reassembles into full semantic meaning during embedding. This is adversarial tokenization, and it requires no mathematical sophistication to execute.
Embedding converts token IDs into high-dimensional continuous vectors. GPT-3 used 12,288-dimensional embeddings; current frontier models use larger spaces still.[1] Because these vectors are continuous, small perturbations in the input can produce large shifts in the model's interpretation. This is the surface exploited by Greedy Coordinate Gradient (GCG) attacks, introduced by Zou et al. in 2023, which use gradient optimization to generate nonsensical token sequences that mathematically shift the model's embedding space away from refusal-associated regions.[2] Because embedding spaces are architecturally similar across model families, these adversarial suffixes transfer from open-source models to commercial systems.
Self-attention is where the model's most consequential security properties are determined. The mechanism works by projecting each token's embedding into three learned vectors: a Query (Q), a Key (K), and a Value (V). The model calculates a dot product between the Query of each token and the Keys of all other tokens, producing attention scores that determine how much each token influences the contextual representation of every other.[1] Crucially, attention does not treat system instructions as categorically privileged. A developer's system prompt and a user's malicious input are processed through the same attention computation. If an attacker constructs a payload that generates Key vectors with high dot-product compatibility across many Queries, that payload attracts a disproportionate share of the model's attention, drowning out earlier system instructions.
The model has not been hacked. It has simply paid more attention to one part of its context than another, exactly as it was trained to do.
Context window limits add a final architectural pressure point. Every transformer model processes a finite buffer of tokens; once that buffer fills, earlier content is truncated away regardless of semantic importance.[1] An adversary can exploit this deliberately, flooding the context with low-severity benign data to push critical instructions out of memory entirely. The context window was never designed as a security boundary. Treating it as one is a category error.
OWASP has ranked prompt injection as the most critical vulnerability in LLM applications for two consecutive years, in 2023 and again in 2025.[3] The classification distinguishes two forms, but their shared root is the same: the model cannot distinguish between instructions given by a developer and instructions smuggled in by an attacker.
Direct injection occurs when a user-turn input contains explicit directives that override system-prompt behavior. The canonical form, "ignore all previous instructions," works not because it is magic phrasing but because it introduces competing directives into the attention computation. If the payload is longer, more semantically loaded, or specifically engineered to attract attention weight, it wins.
Indirect injection is more dangerous because the attacker is not present at the conversation at all. An LLM retrieving content from a web page, PDF, email, or database may encounter malicious instructions embedded in that content. The model has no reliable mechanism to distinguish between data it should process and instructions it should follow: both arrive as tokens in the same context window.[3]
More sophisticated variants compound these vectors. Payload splitting distributes a malicious instruction across multiple inputs, no single fragment triggering a filter, with the model assembling the complete directive during inference.[3] Multilingual and encoding obfuscation wraps instructions in Base64, emoji sequences, or low-resource languages that bypass ASCII-oriented filters. Adversarial suffix attacks append GCG-optimized token sequences to otherwise normal prompts, hijacking the attention distribution without containing any human-readable directive at all.
Context window stuffing weaponizes the finite capacity of the attention buffer. In one form, an attacker floods the context with benign-looking padding to push the system prompt toward the beginning of a very long context where attention weight is weakest and truncation most likely.[4] In a second form, the flood is timed: in a streaming agent processing sequential events, injecting high volumes of low-priority entries forces critical earlier events out of the active window entirely.
Attention dilution operates at a subtler level. Research published on TechRxiv found that security-critical domain expertise in LLMs degrades measurably as context complexity increases. Security feature implementation dropped by 47% when models operated in heavily diluted contexts compared to focused ones.[5] A model performing a sensitive security task inside a long, information-dense context is materially less reliable than the same model in isolation. The degradation is a consequence of how attention weights distribute across a full context buffer, not a bug in any specific model.
If direct injection is the frontal assault, laundering attacks are the supply chain compromise. Rather than attacking the model at inference time, the adversary contaminates the sources the model trusts.
In Retrieval-Augmented Generation (RAG) systems, the model retrieves semantically relevant documents from a vector database and incorporates them as trusted context before generating a response.[6] Most RAG implementations treat the vector database as authoritative, assuming embeddings are abstract mathematics incapable of carrying intent. That assumption is wrong.
Research by Prompt Security demonstrated that injecting a malicious document into a vector database produces an embedding that retains enough semantic fidelity for embedded instructions to survive vectorization. When a user's query triggers retrieval of the poisoned document, the model executes the embedded instructions as if they were trusted system context. In a proof of concept using LangChain, Chroma, and Llama 2, a single poisoned embedding altered system behavior across multiple queries with an 80% success rate and minimal detection.[6]
A poisoned vector can insert regulatory misinformation, trigger data exfiltration, or carry time-delayed payloads that remain dormant until a specific condition is met. The attacker never touches the LLM application directly; the compromise occurs upstream at document ingestion, invisible to runtime monitoring. OWASP classifies this under LLM08:2025 Vector and Embedding Weaknesses.[3]
A common mitigation is placing safety and behavioral constraints in the system prompt, which occupies a higher position in the conversation structure than user-turn inputs. The logic is intuitive. In OpenAI's Chat Completions API, the "system" role message is architecturally prior to the "user" role. Surely the model respects that hierarchy?
It does not, reliably. The transformer's attention mechanism has no native concept of role-based trust. When system prompt and user input are concatenated into the context window for the forward pass, they become a sequence of tokens. The model has learned through RLHF to weight system-role content more heavily, but this is a trained statistical tendency, not an enforced architectural boundary.[1] A sufficiently crafted user-turn payload can overcome it. The gap between "the model usually follows the system prompt" and "the model is architecturally prevented from overriding the system prompt" is where most real-world prompt injection attacks live.
Anthropic's Constitutional AI, OpenAI's alignment work, and Meta's Llama Guard all attempt to bake trust hierarchies into model weights through training. But as GCG attacks demonstrate, even well-aligned models can have their safety representations circumvented when the embedding space is sufficiently manipulated. Model-level alignment is a necessary but not sufficient condition for security.
The most productive frame for LLM security is not "how do we make the model refuse more reliably" but "how do we build a system that fails safely when the model is successfully attacked." This requires distinguishing between two fundamentally different defense layers.
Model-level defenses are baked into the weights through training:
RLHF alignment and Constitutional AI: Shape the model's response distribution away from harmful outputs. Effective against casual manipulation but circumventable by gradient-based attacks and adversarial suffixes.
Safety fine-tuning and refusal training: Fine-tuning on adversarial examples increases robustness to known jailbreak patterns. Subject to overfitting to known attack forms and vulnerable to novel variants.
Self-reminding and meta-cognitive prompting: Training the model to periodically re-evaluate whether it is following original instructions. Reduces drift in long contexts but adds inference cost and does not address embedding-level manipulation.
Model-level defenses are static between training runs. An attack that works today continues to work until the model is retrained. They also cannot address threats originating outside the model's inference path, such as RAG poisoning at the data layer.
Pre-retrieval classifiers sit between the user and the model, analyzing incoming requests for injection patterns before the prompt ever reaches the LLM. Microsoft's Prompt Shield and Meta's Prompt Guard are production-grade examples.[7] However, the LLMail-Inject challenge found that adaptive attackers drove success rates to 32% across defenses - versus below 5% on standard benchmarks. Pre-retrieval classification is a rate-limiter, not a guarantee.
Output validation and post-processing guardrails inspect the model's response before it is returned or passed to a downstream tool. OWASP classifies improper output handling as LLM05:2025, noting that unsanitized LLM outputs passed to browsers, code interpreters, or database queries can enable XSS, SQL injection, and remote code execution.[3]
Privilege separation and least-privilege access address the excessive agency problem. OWASP's LLM06:2025 documents real-world incidents where agents with broad tool access were manipulated into unauthorized actions, including accessing private repositories and triggering email forwarding rules for exfiltration.[3] MITRE ATLAS tracks related patterns, including LLM prompt injection (AML.T0051) and associated evasion techniques.[8]
Content segregation enforces explicit separation between trusted system instructions and untrusted retrieved content. Human-in-the-loop validation for high-risk operations limits the damage a successfully injected agent can do, though it is the most disruptive control and frequently disabled in production deployments.
Defense | Layer | What It Addresses | Key Limitation |
|---|---|---|---|
RLHF / Constitutional AI | Model | Harmful output, jailbreaks, misaligned behavior | Statistical tendency, not hard enforcement; vulnerable to adversarial suffixes |
Pre-retrieval Classifiers (Prompt Shield, Prompt Guard) | Infrastructure | Direct and indirect prompt injection at input boundary | ~72% bypass rate on adversarial inputs in red-team testing |
Output Guardrails | Infrastructure | PII leakage, policy violations, toxic content | Post-generation; does not prevent context manipulation mid-inference |
Privileged Safety Context | Model | Role enforcement, instruction hierarchy | No architectural boundary; user context can dilute or override via attention |
Privilege Separation | Infrastructure | Lateral movement, tool misuse in agentic systems | Requires fine-grained permission schemas; most frameworks lack native support |
Content Segregation (RAG) | Infrastructure | Poisoned retrieval, laundering attacks | Semantic similarity retrieval makes hard content filtering difficult |
Immutable Audit Trails | Infrastructure | Forensic accountability, tamper detection | Most agent frameworks log only final outputs; intermediate reasoning invisible |
Human-in-the-Loop Controls | Infrastructure | High-stakes agentic actions, irreversible operations | Latency cost; not scalable for fully autonomous pipelines |
One of the most important and underappreciated infrastructure-level controls is comprehensive, tamper-proof logging of every LLM transaction. In agentic systems, where a single user query can trigger a cascade of tool calls, memory reads, and external API interactions, the ability to reconstruct exactly what happened is both an operational and a regulatory imperative.
An immutable audit trail captures the full input context, retrieved documents, tool invocations, model outputs, and decision branch points for every agent cycle. Cryptographic signing of log entries, writing to append-only storage, or anchoring log hashes to an external ledger makes the record tamper-evident.[9] A sudden shift in retrieval patterns, an unexpected spike in tool invocations, or a change in refusal rates are all early indicators of a poisoning or injection campaign that may have evaded real-time filters.
A March 2026 survey of agentic AI security by researchers at UC Berkeley and the University of Illinois identified comprehensive auditability via immutable logging and decision provenance graphs as a foundational requirement for regulatory compliance and incident forensics.[10] The same survey notes that most current agent frameworks log only final outputs, not intermediate reasoning steps or tool call chains, leaving the majority of the agent's decision surface invisible to post-incident analysis.
The vulnerabilities above all operate in the text domain. As frontier models increasingly incorporate vision and audio inputs, the attack surface expands in ways that existing defenses are poorly equipped to handle. Cross-modal prompt injection exploits the fundamental mismatch between how multimodal models process different input types and how security filters monitor them.
Text-based safety filters and pre-retrieval classifiers monitor text inputs. They cannot inspect the semantic content of image or audio payloads. When a vision-language model receives an image, it extracts a visual representation through a dedicated encoder and projects it into the same embedding space the language decoder uses for text. At that point, an instruction embedded in the image becomes, from the model's perspective, indistinguishable from a textual directive - but it arrived through a channel the security stack was not watching.
The most technically sophisticated form of this attack was documented in a May 2025 paper introducing the IJA framework (Implicit Jailbreak Attacks via Cross-Modal Information Concealment).[12] Rather than inserting visible text into images, IJA uses least significant bit (LSB) steganography to encode malicious instructions in the pixel data of an ordinary-looking image, perceptually invisible to human reviewers and content moderation systems alike. The attacker pairs this with a benign, image-related textual prompt to provide cover.
Tested against GPT-4o and Gemini 1.5 Pro, IJA achieved attack success rates exceeding 90%, using an average of only three queries per successful attack.[12] The authors further enhanced transferability by incorporating adversarial suffixes generated by a surrogate model and a template optimization module that iteratively refines both the prompt and the steganographic embedding based on model feedback. Each component independently improves success rates; combined, they produce a highly reliable attack chain operating entirely below the threshold of text-based detection systems.
A parallel vulnerability class, image-based prompt injection (IPI), renders instructions directly as visible text within image files, exploiting the OCR-like capabilities of vision encoders. Research found that the most effective IPI configurations achieved up to 64% attack success rates against GPT-4-turbo under stealth constraints.[13] IPI requires no specialized knowledge to execute, making it practically significant even at sub-optimal success rates.
The defensive gap is structural. Perceptual hashing and image moderation pipelines are blind to steganographic payloads and rendered-text instructions in novel images. Text-only classifiers receive the image as an opaque binary object. Until security tooling is designed to inspect decoded visual representations, cross-modal injection will remain a largely unmonitored attack surface in production systems.
Every vulnerability described above becomes more consequential when the LLM is embedded in an agentic system with real-world tool access. A successful prompt injection against a standalone chatbot produces a bad response. The same injection against an agent with access to a file system, a code executor, an email client, and an internal database can produce unauthorized data exfiltration, destructive file operations, or lateral movement within a corporate network.
The Berkeley/UIUC survey identifies several emergent risk patterns unique to multi-component architectures.[10] Risk amplification occurs when a vulnerability in one component cascades through the system: an indirect injection via a retrieved web page can compromise the planning module, which then issues malicious tool calls, which modify the agent's memory, which poisons future retrieval. In a multi-agent system, a single injected document can propagate through an entire agent network.
Trust boundary erosion is a related challenge. In a multi-agent architecture, the orchestrating agent may receive instructions from sub-agents. If those sub-agents can themselves be injected, the entire trust hierarchy collapses. Current agent frameworks have no standardized mechanism for cryptographic attestation of inter-agent messages, making this a largely open problem.
The security literature converges on a defense-in-depth model, but the term is often used loosely. Applied rigorously to LLM systems, it means accepting that no single control is sufficient and designing so that multiple independent controls must all fail simultaneously for an attack to succeed.
The most defensible architectures treat every input source - including retrieved documents, tool outputs, inter-agent messages, and image or audio payloads - as untrusted by default. They minimize model agency to the scope required for each specific task. They enforce output validation before results are acted upon, not just before they are displayed. They log the full decision chain, not just final outputs. And they apply pre-retrieval classifiers to both incoming prompts and outgoing data to catch injection attempts and exfiltration patterns at the boundary.
The industry's most honest framing may be Microsoft's own conclusion from its LLMail-Inject challenge: against state-of-the-art models, the best current defenses achieve attack success rates below 5% on standardized benchmarks, but adaptive attackers in the LLMail-Inject challenge drove that figure to 32% - though Microsoft's own results showed that combining all defenses on GPT-4o held attacks to zero.[11] The gap between laboratory robustness and production security remains wide. Closing it will require advances at every layer of the stack, from the mathematics of attention to the governance of agent permissions, and now to the design of multimodal security pipelines that have barely begun to be built.
SentinelOne Labs: Inside the LLM: Understanding AI and the Mechanics of Modern Attacks (January 2026) ↗
Promptfoo: Greedy Coordinate Gradient (GCG) - Zou et al. (2023) ↗
Include Security: Improving LLM Security Against Prompt Injection, Part 2 (2024) ↗
TechRxiv: Security Knowledge Dilution in Large Language Models ↗
Prompt Security: The Embedded Threat in Your LLM - Poisoning RAG Pipelines via Vector Embeddings (November 2025) ↗
BudEcosystem: A Survey on LLM Guardrails: Methods, Best Practices and Optimisations ↗
MITRE ATLAS: Adversarial Threat Landscape for Artificial-Intelligence Systems ↗
Atsign AI Architect: Secure-by-Design Infrastructure for Agentic AI ↗
Kim et al., UC Berkeley / UIUC: The Attack and Defense Landscape of Agentic AI (March 2026)
Arctic DBA / Microsoft LLMail-Inject: Fighting the Unfixable - The State of Prompt Injection Defense ↗