Micah Carroll (@micahcarroll) Bsky

LLMs' sycophancy issues are a predictable result of optimizing for user feedback. Even if clear sycophantic behaviors get fixed, AIs' exploits of our cognitive biases may only become more subtle.

Grateful our research on this was featured in @washingtonpost.com by @nitasha.bsky.social!

10 months ago 4 2 0 0

Lies, Damned Lies, and Distributional Language Statistics: Persuasion and Deception with Large Language Models Large Language Models (LLMs) can generate content that is as persuasive as human-written text and appear capable of selectively producing deceptive outputs. These capabilities raise concerns about pot...

How effective are LLMs are persuading and deceiving people? In a new preprint we review different theoretical risks of LLM persuasion; empirical work measuring how persuasive LLMs currently are; and proposals to mitigate these risks. 🧵

arxiv.org/abs/2412.17128

1 year ago 9 5 1 0

First page of the paper Influencing Humans to Conform to Preference Models for RLHF, by Hatgis-Kessell et al.

Our proposed method of influencing human preferences.

RLHF algorithms assume humans generate preferences according to normative models. We propose a new method for model alignment: influence humans to conform to these assumptions through interface design. Good news: it works!
#AI #MachineLearning #RLHF #Alignment (1/n)

1 year ago 7 3 1 0

Posts by Micah Carroll