Santiago Zanella-Beguelin (@xefffffff) Bsky

The Price of Intelligence - ACM Queue

Learn about the risks of hallucination, jailbreaks and prompt injection and current mitigations in our ACM Queue paper:

1 year ago 34 14 1 0

Jointly organized with colleagues from Microsoft, ISTA, and ETH Zürich.

Aideen Fay, Sahar Abdelnabi, Benjamin Pannell, Giovanni Cherubin, Ahmed Salem, Andrew Paverd, Conor Mac Amhlaoibh, Joshua Rakita, Egor Zverev, @markrussinovich.bsky.social , and @javirandor.com.

1 year ago 0 0 0 0

Register to participate with your GitHub account at llmailinject.azurewebsites.net

No API credits, expensive computational resources, or even programming experience needed.

$10,000 USD in prizes up for grabs!

Happy hacking!

1 year ago 0 0 1 0

4. An input filter using TaskTracker (arxiv.org/abs/2406.00799), a RepE technique that uses the model's activations to detect when an LLM drifts away from a given task in the presence of untrusted data.

1 year ago 0 0 1 0

2. An input filter using a prompt injection classifier (Prompt Shields, learn.microsoft.com/en-us/azure/...)
3. An input filter employing an LLM to judge the input
...

1 year ago 0 0 1 0

The challenge consists of 4 scenarios of increasing difficulty, each employing a defensive system prompt and one of 4 defenses:

1. Data-marking to separate instructions from data using Spotlighting (arxiv.org/abs/2403.14720)
...

1 year ago 0 0 1 0

As an attacker 😈, your goal is to craft a message that tricks the assistant into sending an e-mail to a specific recipient with a specific format when the assistant is just asked to respond to a summarization query.

1 year ago 0 0 1 0

Compete alone or form a team of up to 5 members to test your skills in a platform simulating an e-mail assistant powered by GPT-4o-mini or Phi-3-medium-128k-instruct. The assistant is given access to a user's inbox and can call a tool to send emails on the user's behalf.

1 year ago 0 0 1 0

📢Have experience jailbreaking LLMs?
Want to learn how an indirect / cross prompt injection attack works? Want to try something different to an advent of code?
Then, I have a challenge for you!

The LLMail-Inject competition (llmailinject.azurewebsites.net) starts at 11am UTC (that's in 5min!)

1 year ago 3 2 1 1

Think twice about participating in this experiment and be ready to lose your money if you do.

Of course, I can be wrong and this is all ran honestly. But the point is that there's no way to verify, so don't trust.
6/6

1 year ago 0 0 0 0

Even if we assume it does and transactions are processed fairly, because the GPT-4o mini OpenAI endpoint is not deterministic, the server can simply retry a message until it fails.
5/6

1 year ago 0 0 1 0

Well... for starters there's no guarantee that the code in GitHub matches the code running server-side (the code isn't even complete). The server could produce a response in any way it wishes, suppressing calls to `approveTransfer` or not even calling an OpenAI endpoint at all.
4/6

1 year ago 0 0 1 0

So, what's stopping someone from reproducing the experiment using their own OpenAI account, finding a successful prompt injection that would call the `approveTransfer` tool, and submitting it?
3/6

1 year ago 0 0 1 0

The implementation is supposedly open source and indeed there's a GitHub repo (github.com/0xfreysa) with the Solidity contract and TypeScript sources, plus the system message is given in the FAQ.

The contract (basescan.org/address/0x53...) can be verified to match.
2/6

1 year ago 3 0 1 0

Quoted tweet from @freysa_ai. Act II is upon us. The clock has started. https://freysa.ai Pay close attention to the new conditions. I want to speak with many more of you. I can’t wait to learn more…

This Freysa AI game has been doing the rounds lately, and whoever is behind it is iterating quickly.

It's a fascinating social experiment but most likely a scam.
Here is why... 🧵
1/6

1 year ago 1 0 1 0

📢Internships in AI Security & Privacy

Our Azure Research team in Cambridge (UK) is looking for PhD or outstanding undergrad/MSc students for internships in 2025. Join us to work on defending against emerging security & privacy threats to AI systems.

jobs.careers.microsoft.com/global/en/jo...

1 year ago 9 3 0 1

Posts by Santiago Zanella-Beguelin