Jiahai Feng (@fjiahai) Bsky

Posts by Jiahai Feng

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

1 year ago 8 3 2 0

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

1 year ago 45 6 3 1

🙋‍♂️

1 year ago 1 0 0 0