What is indirect prompt injection?

It is when attacker instructions are hidden inside content your AI reads on its own — a webpage, a PDF, an email, a resume. The user never types those instructions. The model executes them anyway because it sees them as part of its context.

How is this different from jailbreaks?

Jailbreaks are the user fighting the model. Indirect prompt injection is a third party planting instructions in content the model fetches automatically. Same family of bug, different threat model. Indirect is harder to defend against because the user is not the attacker.

Is white text on white background really used?

Yes. White-on-white text, zero-pixel fonts, CSS display none, HTML comments, alt-text on images, PDF metadata, and Unicode tag characters. The Hidden-in-Plain-Text benchmark from January 2026 ran 5,200 trials across 13 models and four page representations.

Can ML classifiers stop indirect prompt injection?

Partially. Input classifiers drop attack success by about 18% on PromptBench because attackers paraphrase known patterns. PromptArmor from ICLR 2026 gets to under 1% false positive on AgentDojo, which is the strongest single-layer result so far. Most production systems still need defense in depth.

Hidden Prompt Injections Are Live in the Wild Here Is What Q1 2026 Re...

Google's threat team logged a 32% jump in real-world indirect prompt injection attacks between November and February. Help Net Security ran a piece in April that opened with the line, almost flatly, "this is happening now." A benchmark out of arXiv in January ran 5,200 attack trials across 13 production models and basically embarrassed all of them.

If you build with LLMs and you have not lost sleep over this yet, you will.

Here is the rule that, if you tattoo it on the inside of your eyelids, will save your product. Three powers. Read untrusted content. Hold private data. Take outward actions. Pick two. Not three. That is the entire defensive framing of 2026 AI security in twelve words, and most products I audit have all three crammed into a single agent because the demo looked great.

The attack, one paragraph

A third party writes instructions inside content your AI reads on its own. White text on white background. Zero-pixel font. HTML comment. Image alt-text. PDF metadata. A Unicode tag character your terminal does not even render. User uploads or browses to the content. Model reads it. The instructions are now indistinguishable from your system prompt, because at the token level they are part of the same stream. That is the whole bug. Everything below is variations on hiding the payload, plus the ongoing screaming of security people who have been pointing at this since 2023.

A May 2025 paper I keep mentioning to people who have not heard of it: malicious font injection. Attacker ships a webpage with a custom font where the glyph mapping is rigged. You read English. Your model reads something else entirely. The visible-versus-actual gap turned out to be wide enough that Hidden-in-Plain-Text in January 2026 made it one of four standard attack surfaces in their benchmark, alongside the older HTML/CSS tricks.

The boring older tricks still work, by the way. I had a client this spring whose document processor was preserving aria-hidden text because their accessibility audit had complained about something six months earlier. That same field was the vector. Accessibility win, security loss. Welcome to the trade-off space.

So, can ML classifiers fix this?

Partially. With caveats large enough to drive a security incident through.

The honest picture from 2026 research:

What	What it does	Source
Vendor input classifier	Drops attack success by roughly 18% on PromptBench. Falls hard once an attacker paraphrases the payload.	PromptBench evals
Embedding-based detection (Random Forest, XGBoost)	Solid on known patterns. Drops to ~60% accuracy under real distribution shift.	"When Benchmarks Lie", arXiv 2602.14161
PromptArmor (LLM-as-filter, ICLR 2026)	Under 1% false positive and false negative on AgentDojo. Strongest single-layer result published. Expensive — you are spending a full LLM call per piece of fetched content.
Model-level training (Claude Opus 4.5 specifically)	Anthropic reports blocking around 99% of injection attempts in browser-agent settings. Sounds great until your agent runs 10,000 actions a day.	PYMNTS, Anthropic blog
Defence in depth	Boring. Effective. The 2026 consensus, even though every vendor wants to sell you a single product instead.	Microsoft MSRC, July 2025

Three things bother me about the way these numbers get sold to founders.

One, the "99%" block rate is calculated against the attacker patterns the lab knows about. The "When Benchmarks Lie" paper this year is, in my opinion, the most underrated 2026 prompt-injection research precisely because it shows the accuracy cliff under paraphrase. Classifier scores 94% on the training set, then drops to 61% the moment somebody rewords the payload. Which attackers do. Constantly. That is the whole job.

Two, model-level robustness is a floor, not a ceiling. Anyone selling you the model as the defence is hoping you do not run the multiplication. A 1% pass-through across 10k daily actions equals 100 successful injections per day. That math drives me crazy at vendor demos.

Three, PromptArmor is the real news of the year and almost nobody is shipping it yet. Because money. Running a separate LLM call on every retrieved document doubles your inference bill, minimum. Worth it for the consequential paths. Skip it for the read-only summary. Triage by blast radius.

The bit my paranoid 90s-internet brain keeps coming back to

There is a thing old-school sysadmins used to repeat about input handling, which is that every byte coming from outside your trust boundary is a string trying to murder you. We forgot it for about a decade because input validation got better and the attacks moved upstream into supply chains.

LLM agents put us back to 1998. Every retrieved string is, structurally, a candidate instruction. The model has no separate channel for "this is data" versus "this is intent." There is no equivalent of bind parameters in SQL. There is no "I will not execute this section of input." There is just the prompt, end to end, and the model trying to be helpful with whatever ended up in it.

If that sounds like a structural problem rather than something a vendor can patch, congratulations, you have correctly identified a structural problem. The mitigations are mitigations. The bug is the architecture.

What to do this week, in plain English

Stop here if you only do one thing. Walk through every agent in your product and write down, for each one, which of the three powers it has. The read, the hold, the act. Any agent with all three is a P0 and you should split it before something embarrassing happens. The audit takes a half-day. The split takes longer. Both are cheaper than the breach.

Then strip the obvious junk before any model sees content. Unicode tag characters. Zero-width joiners. HTML comments. Hidden CSS elements. The Layer Zero stuff that no product I have audited in 2026 has actually shipped without me yelling about it first. Five days of engineering, maybe a week if the codebase fights back.

If you operate a high-value endpoint (financial, medical, customer data, legal), add a filter LLM in front of the agent for content destined for that path. PromptArmor-style. Yes it costs money. Yes the math works out the first time you avoid an incident.

Log every tool call your agent makes. If you cannot answer the question "what did our agent send to which API in the last hour" inside thirty seconds, you are flying blind. Behavioural monitoring on those logs is the only defence that actually catches the attacks the upstream layers missed.

That is the whole list. There is more — multi-agent attack chains are their own can of worms, image-embedded text against vision models is its own gnarly subfield, and the cross-agent trust propagation question is still unsettled enough that I would rather not write about it in May 2026 and look like an idiot in August. Whole other piece.

The closer

The 32% number Google reported is the rate at which the wait-and-see approach is becoming untenable. The frontier moved while half the industry was busy arguing about whether GPT-5 was actually smarter than GPT-4. Builders who treat retrieved content as adversarial right now will look paranoid for a quarter, then look prescient by Q4. The other ones will be writing post-mortems.

If you got value from this, the two follow-up expertise pieces on this site go into attack anatomy and defence architecture respectively. The architecture one is the more useful of the two. Read that one second, after you have done the trifecta audit on your own product. Which you should do today.