LLMs' sycophancy issues are a predictable result of optimizing for user feedback. Even if clear sycophantic behaviors get fixed, AIs' exploits of our cognitive biases may only become more subtle.
Grateful our research on this was featured in @washingtonpost.com by @nitasha.bsky.social!
Posts by Micah Carroll
10 months ago
4
2
0
0
How effective are LLMs are persuading and deceiving people? In a new preprint we review different theoretical risks of LLM persuasion; empirical work measuring how persuasive LLMs currently are; and proposals to mitigate these risks. ๐งต
arxiv.org/abs/2412.17128
1 year ago
9
5
1
0
First page of the paper Influencing Humans to Conform to Preference Models for RLHF, by Hatgis-Kessell et al.
Our proposed method of influencing human preferences.
RLHF algorithms assume humans generate preferences according to normative models. We propose a new method for model alignment: influence humans to conform to these assumptions through interface design. Good news: it works!
#AI #MachineLearning #RLHF #Alignment (1/n)
1 year ago
7
3
1
0