Learn about the risks of hallucination, jailbreaks and prompt injection and current mitigations in our ACM Queue paper:
Posts by Santiago Zanella-Beguelin
Jointly organized with colleagues from Microsoft, ISTA, and ETH Zürich.
Aideen Fay, Sahar Abdelnabi, Benjamin Pannell, Giovanni Cherubin, Ahmed Salem, Andrew Paverd, Conor Mac Amhlaoibh, Joshua Rakita, Egor Zverev, @markrussinovich.bsky.social , and @javirandor.com.
Register to participate with your GitHub account at llmailinject.azurewebsites.net
No API credits, expensive computational resources, or even programming experience needed.
$10,000 USD in prizes up for grabs!
Happy hacking!
4. An input filter using TaskTracker (arxiv.org/abs/2406.00799), a RepE technique that uses the model's activations to detect when an LLM drifts away from a given task in the presence of untrusted data.
2. An input filter using a prompt injection classifier (Prompt Shields, learn.microsoft.com/en-us/azure/...)
3. An input filter employing an LLM to judge the input
...
The challenge consists of 4 scenarios of increasing difficulty, each employing a defensive system prompt and one of 4 defenses:
1. Data-marking to separate instructions from data using Spotlighting (arxiv.org/abs/2403.14720)
...
As an attacker 😈, your goal is to craft a message that tricks the assistant into sending an e-mail to a specific recipient with a specific format when the assistant is just asked to respond to a summarization query.
Compete alone or form a team of up to 5 members to test your skills in a platform simulating an e-mail assistant powered by GPT-4o-mini or Phi-3-medium-128k-instruct. The assistant is given access to a user's inbox and can call a tool to send emails on the user's behalf.
📢Have experience jailbreaking LLMs?
Want to learn how an indirect / cross prompt injection attack works? Want to try something different to an advent of code?
Then, I have a challenge for you!
The LLMail-Inject competition (llmailinject.azurewebsites.net) starts at 11am UTC (that's in 5min!)
Think twice about participating in this experiment and be ready to lose your money if you do.
Of course, I can be wrong and this is all ran honestly. But the point is that there's no way to verify, so don't trust.
6/6
Even if we assume it does and transactions are processed fairly, because the GPT-4o mini OpenAI endpoint is not deterministic, the server can simply retry a message until it fails.
5/6
Well... for starters there's no guarantee that the code in GitHub matches the code running server-side (the code isn't even complete). The server could produce a response in any way it wishes, suppressing calls to `approveTransfer` or not even calling an OpenAI endpoint at all.
4/6
So, what's stopping someone from reproducing the experiment using their own OpenAI account, finding a successful prompt injection that would call the `approveTransfer` tool, and submitting it?
3/6
The implementation is supposedly open source and indeed there's a GitHub repo (github.com/0xfreysa) with the Solidity contract and TypeScript sources, plus the system message is given in the FAQ.
The contract (basescan.org/address/0x53...) can be verified to match.
2/6
Quoted tweet from @freysa_ai. Act II is upon us. The clock has started. https://freysa.ai Pay close attention to the new conditions. I want to speak with many more of you. I can’t wait to learn more…
This Freysa AI game has been doing the rounds lately, and whoever is behind it is iterating quickly.
It's a fascinating social experiment but most likely a scam.
Here is why... 🧵
1/6
📢Internships in AI Security & Privacy
Our Azure Research team in Cambridge (UK) is looking for PhD or outstanding undergrad/MSc students for internships in 2025. Join us to work on defending against emerging security & privacy threats to AI systems.
jobs.careers.microsoft.com/global/en/jo...