Javier Rando (@javirandor.com) Bsky

Thank you so much for the invite!

1 year ago 0 0 0 0

Adversarial ML Problems Are Getting Harder to Solve and to Evaluate In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" probl...

We really hope this analysis can help the community better understand where we come from, where we stand, and what things may help us make meaningful progress in the future.

Co-authored with @jiezhang-ethz.bsky.social, Nicholas Carlini and @floriantramer.bsky.social

arxiv.org/abs/2502.02260

1 year ago 1 0 0 0

We propose that adversarial ML research should clearly differentiate between two problems:

1️⃣ Real-world vulnerabilities. Attacks and defenses on ill-defined problems are valuable when harm is immediate.

2️⃣ Scientific understanding. We should study well-defined problems.

1 year ago 0 0 1 0

We are aware that this is not a simple problem and some changes may actually have been for the better! For instance, we now study real-world challenges instead of academic “toy” problems like ℓₚ robustness. We tried to carefully discuss these alternative views in our work.

1 year ago 0 0 1 0

We identify 3 core challenges that make adversarial ML for LLMs harder to define, harder to solve, and harder to evaluate. We then illustrate these with specific case studies: jailbreaks, un-finetunable models, poisoning, prompt injections, membership inference, and unlearning.

1 year ago 0 0 1 0

Perhaps most telling, unlike for image classifiers, manual attacks outperform automated methods at finding worst-case inputs for LLMs! This challenges our ability to automatically evaluate the worst-case robustness of protections and benchmark progress.

1 year ago 0 0 1 0

Now, the field has shifted to LLMs, where we consider subjective notions of safety, allow for unbounded threat models, and evaluate closed-source systems that constantly change. These changes are hindering our ability to produce meaningful scientific progress.

1 year ago 0 0 1 0

Back in the 🐼 days, we dealt with well-defined tasks: misclassify an image by slightly perturbing pixels within an ℓₚ-ball. Also, attack success and defense utility could be easily measured with classification accuracy. Simple objectives that we could rigorously benchmark.

1 year ago 0 0 1 0

Adversarial ML research is evolving, but not necessarily for the better. In our new paper, we argue that LLMs have made problems harder to solve, and even tougher to evaluate. Here’s why another decade of work might still leave us without meaningful progress. 👇

1 year ago 2 0 1 0

Cohere For AI - Javier Rando, AI Safety PhD Student at ETH Zürich Javier Rando, AI Safety PhD Student at ETH Zürich - Poisoned Training Data Can Compromise LLMs

Looking forward to this presentation. You can add it to your calendar here cohere.com/events/coher...

1 year ago 0 0 0 0

Persistent Pre-Training Poisoning of LLMs Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practic...

Recently, we have demonstrated that small amounts of poisoned data posted online could compromise large-scale pretraining with backdoors that persist even after alignment arxiv.org/abs/2410.13722

1 year ago 0 1 1 1

Universal Jailbreak Backdoors from Poisoned Human Feedback Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adv...

We poisoned RLHF to introduce backdoors in LLMs that allowed adversaries to elicit harmful generations easily arxiv.org/abs/2311.14455

1 year ago 0 0 1 0

Cohere For AI - Javier Rando, AI Safety PhD Student at ETH Zürich Javier Rando, AI Safety PhD Student at ETH Zürich - Poisoned Training Data Can Compromise LLMs

This Thursday, I will be presenting my work on poisoning RLHF and LLM pretraining @cohereforai.bsky.social

More info here cohere.com/events/coher...

1 year ago 4 0 1 0

Recent LLM forecasters are getting better at predicting the future. But there's a challenge: How can we evaluate and compare AI forecasters without waiting years to see which predictions were right? (1/11)

1 year ago 5 2 1 0

Tomorrow @jakublucki.bsky.social will be presenting the BEST TECHNICAL PAPER at the SoLaR workshop at NeurIPS. Come check our poster and his oral presentation!

1 year ago 7 1 0 0

I am at NeurIPS 🇨🇦, please reach out if you want to grab a coffee!

1 year ago 4 2 0 0

I am in beautiful Vancouver for #NeurIPS2024 with those amazing folks!
Say hi if you want to chat about ML privacy and security
(or speciality ☕)

1 year ago 0 1 0 0

SPY Lab We are a research group at ETH Zürich studying how to build secure and private AI.

From left to right the amazing @nkristina.bsky.social @jiezhang-ethz.bsky.social @edebenedetti.bsky.social @javirandor.com @aemai.bsky.social and @dpaleka.bsky.social!

We work on AI Security/Safety/Privacy. Find out more about work in our lab website spylab.ai

1 year ago 3 0 0 0

SPY Lab is in Vancouver for NeurIPS! Come say hi if you see us around 🕵️

1 year ago 10 2 1 1

LLMail Inject

Check out all the details in the offical website llmailinject.azurewebsites.net

1 year ago 1 0 0 0

A new competition on LLM-agents prompt injection is out! Send malicious emails and get agents to perform unauthorised actions.

The competition is hosted at SaTML 2025 and has a pool of $10k in prizes! What are you waiting for?

1 year ago 6 0 1 0

An Adversarial Perspective on Machine Unlearning for AI Safety Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities fro...

2) An Adversarial Perspective on Machine Unlearning for AI Safety

🏆 Best paper award
@solarneurips

📅 Sat 14 Dec. Poster at 11am and Talk in the afternoon.
📍 Room West Meeting 121,122

Paper: arxiv.org/abs/2409.18025

1 year ago 1 0 0 0

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we or...

1) Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition.

📅 Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST
📍 Spotlight Poster #5203 (West Ballroom A-D)

arxiv.org/abs/2406.07954

1 year ago 1 0 1 0

I will be at #NeurIPS2024 in Vancouver. I am excited to meet people working on AI Safety and Security. Drop a DM if you want to meet.

I will be presenting two (spotlight!) works. Come say hi to our posters.

1 year ago 4 1 1 0

🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨

Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.

Here's what we found👇

1 year ago 12 3 1 0

SPY Lab We are a research group at ETH Zürich studying how to build secure and private AI.

We are not OpenAI, but if you are looking for a PhD or PostDoc on AI Safety/Security/Privacy in Zurich, you should take a look at spylab.ai and come work with us and
@floriantramer.bsky.social

1 year ago 3 0 0 0

Come do open AI with us in Zurich!
We're hiring PhD students, postdocs (and faculty!)

1 year ago 11 3 0 1

AI Safety and Security Join the conversation

I am curating a list of researchers working on AI Safety and Security here go.bsky.app/BcjeVbN.

Reply to this post with your user or other people you think should be included!

1 year ago 12 3 3 2

Zurich is a great place to live and do research. It became a slightly better one overnight! Excited to see OAI opening an office here with such a great starting team 🎉

1 year ago 9 2 1 1

Great opportunity to do impactful work on AI alignment!

1 year ago 4 0 0 0

Posts by Javier Rando