Advertisement · 728 × 90

Posts by Jiahai Feng

Video

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

1 year ago 8 3 2 0
Post image

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

1 year ago 45 6 3 1

🙋‍♂️

1 year ago 1 0 0 0