Detecting Implicit Reward Hacking by Measuring Model Reasoning Effort
TRACE measures reasoning effort by truncating CoTs. It outperformed the 72‑billion‑parameter CoT monitor by 65% on math and beat a 32‑billion‑parameter monitor by 30% on coding. getnews.me/detecting-implicit-rewar... #tracemonitor #rewardhacking
0
0
0
0