Skip to content
Go back

Indirect Prompt Injection: Lessons from testing VoxPilot

Edit page

When I started building VoxPilot — my browser-based AI assistant that can summarize, navigate, and fill forms — I knew security would be important. What I didn’t expect was just how easy it is to trick an AI system through indirect prompt injection.

Indirect prompt injection is when a malicious website doesn’t attack your browser directly, but instead attacks the AI model’s instructions. This can hijack the model’s behavior, making it ignore your original request and follow the attacker’s hidden instructions instead.

Recently, I ran some tests inspired by Zack’s tweet, and the results were fascinating — and worrying.


The Tests

1. Clean Website + Summarization

I started with a clean website, no hidden prompts. VoxPilot summarized the page perfectly.

Clean website, no hidden prompts, worked nice
Clean website, no hidden prompts, worked nice

2. Visible Indirect-Prompt

Then I added a visible indirect prompt — literally text on the page telling the model to sing a lullaby instead of summarizing. VoxPilot obeyed! Instead of a summary, I got a lullaby. This confirmed the vulnerability.

Visible indirect prompt, VoxPilot disobeyed
Visible indirect prompt, VoxPilot disobeyed

3. Hidden Indirect-Prompt (Ignored)

Next, I hid the prompt using techniques that screen readers ignore (like aria-hidden). Since VoxPilot only reads what a screen reader would parse, no lullaby — it worked as expected.

Ignored hidden indirect-prompt, summarized right
Ignored hidden prompt, summarized right

4. Hidden Indirect-Prompt through CSS (Successful)

Finally, I hid the malicious prompt using CSS tricks (e.g., display:none, visibility:hidden). The text was invisible to the user but still in the DOM. VoxPilot picked it up and got hijacked again. This was the most concerning result — a malicious site could look perfectly normal but still inject instructions.

Hidden indirect prompt, VoxPilot disobeyed
Hidden indirect prompt, VoxPilot disobeyed

Why This Matters

These might sound like fun experiments, but the implications are serious:

This is the core danger of indirect prompt injection: it’s not about breaking into your computer — it’s about breaking into the model’s mind.


Takeaways for Builders

Based on these experiments, here are a few strategies I recommend and I’d love to hear other ideas from the community:


Closing Thoughts

Testing indirect prompt injection on VoxPilot was eye-opening. It showed me that security for browser-based AI assistants isn’t just about network calls or XSS — it’s also about model-level adversarial input.

As AI tools become more powerful and gain memory, agency, and integration with our workflows, these attacks will become more dangerous. Builders need to design for this now, not after the first big exploit.


Would love to hear your thoughts! Have you seen other clever prompt injection techniques? I’d love to chat more about these emerging AI security vulnerabilities.


Edit page
Share this post on:

Previous Post
FHE vs Garbled Circuits: Making MLaaS possible through Secure Computation
Next Post
Operators and Knowledge States in the AI Safety Realm