This was an unfortunate mistake, sorry about that.
But the conclusions of our paper don't change drastically: there is significant gradient masking (as shown by the transfer attack) and the cifar robustness is at most in the 15% range. Still cool though!
We'll see if we can fix the full attack
Posts by
I discovered a fatal flaw in a paper by @floriantramer.bsky.social et al claiming to break our Ensemble Everything Everywhere defense. Due to a coding error they used attacks 20x above the standard 8/255. They confirmed this but the paper is already out & quoted on OpenReview. What should we do now?
This was an unfortunate mistake, sorry about that.
But the conclusions of our paper don't change drastically: there is significant gradient masking (as shown by the transfer attack) and the cifar robustness is at most in the 15% range. Still cool though!
We'll see if we can fix the full attack
🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨
Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.
Here's what we found👇
Come do open AI with us in Zurich!
We're hiring PhD students, postdocs (and faculty!)
Full paper: arxiv.org/abs/2410.13722
Amazing collaboration with Yiming Zhang during our internships at Meta.
Grateful to have worked with Ivan, Jianfeng, Eric, Nicholas, @floriantramer.bsky.social and Daphne.
Yeah they mostly are
Ensemble Everything Everywhere is a defense against adversarial examples that people got quite exited about a few months ago (in particular, the defense causes "perceptually aligned" gradients just like adversarial training)
Unfortunately, we show it's not robust...
arxiv.org/abs/2411.14834
probably -> provably...