Advertisement · 728 × 90

Posts by Leon Lang

Post image

⚠️ The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

By Lukas Fluri*, @leon-lang.bsky.social *, Alessandro Abate, Patrick Forré, David Krueger, Joar Skalse

📜 arxiv.org/abs/2406.15753

🧵6 / 8

11 months ago 5 1 1 0
Preview
Modeling Human Beliefs about AI Behavior for Scalable Oversight Contemporary work in AI alignment often relies on human feedback to teach AI systems human preferences and values. Yet as AI systems grow more capable, human feedback becomes increasingly unreliable. ...

Paper link: arxiv.org/abs/2502.21262
(4/4)

1 year ago 0 0 0 0

I theoretically describe what modeling the human's beliefs would mean, and explain a practical proposal for how one could try to do this, based on foundation models whose internal representations *translate to* the human's beliefs using an implicit ontology translation. (3/4)

1 year ago 0 0 1 0
Post image

The idea: In the robot-hand example, when the hand is in front of the ball, the human believes the ball was grasped and gives "thumbs up", leading to bad behavior. If we knew the human's beliefs, then we could assign the feedback properly: Reward the ball being grasped! (2/4)

1 year ago 0 0 1 0

Brief paper announcement (longer thread might follow):

In our new paper "Modeling Human Beliefs about AI behavior for Scalable Oversight", I propose to model a human evaluator's beliefs to better interpret the feedback, which might help for scalable oversight. (1/4)

1 year ago 3 0 1 0
Preview
Modeling Human Beliefs about AI Behavior for Scalable Oversight Contemporary work in AI alignment often relies on human feedback to teach AI systems human preferences and values. Yet as AI systems grow more capable, human feedback becomes increasingly unreliable. ...

www.arxiv.org/abs/2502.21262

I have now this follow-up paper that goes into greater detail for how to achieve the human belief modeling, both conceptually and potentially in practice.

1 year ago 0 0 0 0

If you are attending #NeurIPS2024🇨🇦, make sure to check out AMLab's 11 accepted papers ...and to have a chat with our members there! 👩‍🔬🍻☕

Submissions include generative modelling, AI4Science, geometric deep learning, reinforcement learning and early exiting. See the thread for the full list!

🧵1 / 12

1 year ago 25 7 1 0

First UAI conference in Latin America!! 🔥🔥🔥

North America and Europe you are nice, but sometimes I also want to visit somewhere else 😅

1 year ago 17 4 1 0

I just completed "Historian Hysteria" - Day 1 - Advent of Code 2024 #AdventOfCode adventofcode.com/2024/day/1

1 year ago 3 0 0 0

I notice more “big” accounts here that follow a lot of people. The same accounts follow almost no one on twitter. Is this motivated by a difference in the algorithms of these platforms?

1 year ago 0 0 0 0
Advertisement
Post image Post image

Yet another safety researcher has left OpenAI.

Rosie Campbell says she has been “unsettled by some of the shifts over the last ~year, and the loss of so many people who shaped our culture”.

She says she “can’t see a place” for her to continue her work internally.

1 year ago 56 12 3 0

We are taking on a mission to track progress in AI capabilities over time.

Very proud of our team!

1 year ago 2 1 0 0
Post image

Hey hey,

I am around in the Bay area for the next few weeks. Bay area folks hit me up if you want to meet up for coffee/ vegan food in and around SF ☕🌯 🥟

Got a major weather upgrade☀️ from Amsterdam's insanity last week 🌀🌩️

1 year ago 18 2 0 0

Thanks for highlighting our paper! :)

1 year ago 1 0 1 0

Interesting, I didn’t know such things are common practice!

1 year ago 1 0 1 0

I think such questionnaires should maybe generally contain a control group of people who did some brief (let’s say 15 minutes) calibration training just do understand what percentages even mean.

1 year ago 4 0 1 1

Are people maybe very bad at math?
I remember once that I asked my own mom to draw what one million dollars looks like in proportion to 1 billion, and she drew like what corresponds to ~ 150 million, off by a factor of 150.

1 year ago 3 0 3 0

Yeah risks are then probably more external: who creates the LLM, and do they poison the data in such a way that it will associate human utterances to bad goals.

1 year ago 2 0 0 0

I actually think I (essentially?) understood this! Ie my worry was whether the LLM could end up giving high likelihood to human utterances for goals that are very bad.

1 year ago 1 0 1 0

I see, interesting.
Is the hope basically that the LLM utters "the same things" as what the human would utter under the same goal? Is there a (somewhat futuristic...) risk that a misaligned language model might "try" to utter the human's phrase under its own misaligned goals?

1 year ago 3 0 1 0
Advertisement

Meet our Lab's members: staff, postdocs and PhD students! :)

With this starter pack you can easily connect with us and keep up to date with all the member's research and news 🦋

go.bsky.app/8EGigUy

1 year ago 25 9 1 0

You could add myself possibly

1 year ago 0 0 0 0
Preview
The Categories Were Made For Man, Not Man For The Categories I. “Silliest internet atheist argument” is a hotly contested title, but I have a special place in my heart for the people who occasionally try to prove Biblical fallibility by pointing …

I strongly disagree. I’d even go as far as saying that for most relevant purposes, it’s fine to say mushrooms are plants. www.google.com/url?q=https:...

1 year ago 0 0 0 0
Post image Post image

MIT undergrads from families earning less than $200K will pay no tuition fees from 2025, and undergrads from families earning less than $100K will have everything covered, including housing, dining, and a personal allowance.

news.mit.edu/2024/mit-tui...

1 year ago 20 1 1 0

I think bluesky looks much more like twitter than chat apps look alike. Bluesky even has the same ordering of buttons

1 year ago 2 0 0 0

Does anyone understand why it’s so easy to clone twitter with no IP issues?

It’s hard to understand qualitative legal thresholds, but the UI looking ~exactly the same both here and on threads intuitively seems like the kind of thing that could violate a copyright if twitter had pursued one

1 year ago 3 1 2 0

Here :) Thanks for putting this together!

1 year ago 1 0 0 0

Hi everyone! This is AMLab :)
Looking forward to share our research here on 🦋 !

1 year ago 26 5 1 0
Advertisement

Good to have you here :P

1 year ago 1 0 0 0