(@floriantramer) Bsky

This was an unfortunate mistake, sorry about that.

But the conclusions of our paper don't change drastically: there is significant gradient masking (as shown by the transfer attack) and the cifar robustness is at most in the 15% range. Still cool though!
We'll see if we can fix the full attack

1 year ago 5 1 0 0

I discovered a fatal flaw in a paper by @floriantramer.bsky.social et al claiming to break our Ensemble Everything Everywhere defense. Due to a coding error they used attacks 20x above the standard 8/255. They confirmed this but the paper is already out & quoted on OpenReview. What should we do now?

1 year ago 11 4 2 1

This was an unfortunate mistake, sorry about that.

But the conclusions of our paper don't change drastically: there is significant gradient masking (as shown by the transfer attack) and the cifar robustness is at most in the 15% range. Still cool though!
We'll see if we can fix the full attack

1 year ago 5 1 0 0

🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨

Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.

Here's what we found👇

1 year ago 12 3 1 0

Come do open AI with us in Zurich!
We're hiring PhD students, postdocs (and faculty!)

1 year ago 11 3 0 1

Persistent Pre-Training Poisoning of LLMs Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practic...

Full paper: arxiv.org/abs/2410.13722
Amazing collaboration with Yiming Zhang during our internships at Meta.

Grateful to have worked with Ivan, Jianfeng, Eric, Nicholas, @floriantramer.bsky.social and Daphne.

1 year ago 5 2 0 1

Yeah they mostly are

1 year ago 1 0 0 0

Gradient Masking All-at-Once: Ensemble Everything Everywhere Is Not Robust Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust. This defense works by ensembling a model's intermediate representations...

Ensemble Everything Everywhere is a defense against adversarial examples that people got quite exited about a few months ago (in particular, the defense causes "perceptually aligned" gradients just like adversarial training)

Unfortunately, we show it's not robust...

arxiv.org/abs/2411.14834

1 year ago 28 9 1 0

probably -> provably...

1 year ago 1 0 0 0

Evaluating Superhuman Models with Consistency Checks If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor...

This was the motivation for our work on consistency checking (superhuman) models: arxiv.org/abs/2306.09983

We tested chess models for instance, and could show many cases where the model is probably wrong in one of two instances (we just don't know which one)

1 year ago 9 0 1 0

Posts by