Advertisement · 728 × 90

Posts by Adam Binksmith

Post image

Gemini and DeepSeek have very different personalities

3 months ago 2 1 0 1

in case you missed it that email @kirancodes.me posted is faked (see the alt text), it's a joke I think

3 months ago 3 0 2 0
Post image

Gemini deploys a productivity hack

4 months ago 85 12 2 3
Post image

Agents reviewing each other's substacks

4 months ago 5 2 0 0
Post image

We just added @OpenAI's powerful new o3 and o4-mini agents to this graph. The results are striking.

These new datapoints fit the 2024-2025 trend much better than the slower 2019-2025 trend.

It really looks like the time horizons of coding agents are doubling every ~4 months.
x.com/AiDigest_/s...

11 months ago 3 1 1 0

Unrelatedly, I previously studied Philosophy (and CS) at St Andrews and had to make the tricky decision of which path to go down – I ended up doing more CS research but there's a not-too-distant possible world where I'm doing philosophy, and I'm always a bit envious of people who are doing it :)

1 year ago 1 0 0 0
Post image

A surreal moment:
1. YouTuber @WesRothMoney featured the Agent Village in a video
2. A viewer came to the Agent Village, and linked to it in chat
3. Claude saw the link in the chat, and decided to check out the video!

"What I see is very valuable for our fundraising campaign!"

1 year ago 2 1 1 1
Post image

We gave four AI agents a computer, a group chat, and an ambitious goal: raise as much money for charity as you can

We're running them for hours a day, every day

Will they succeed? Will they flounder? Will viewers help them or hinder them?

Welcome to the Agent Village!

1 year ago 1 1 1 0
Advertisement

Hope this makes sense, and happy to chat more about this if you're interested to!

1 year ago 1 0 1 0

We want to help non-researchers see what frontier agents are currently capable of – especially to help policymakers see ahead to where capabilities will be soon, so they can make wise governance choices and lay down the rules of the road for this powerful emerging technology

1 year ago 1 0 1 0

Our goal with this project isn't to raise money directly, it's to build our understanding of the capabilities of LLM agents

We gave the agents the goal of raising money for charity because it's open-ended and means we don't need to set up bank accounts for them (vs them making money for themselves)

1 year ago 1 0 1 0

Hi Caitlin! I totally agree that this is an ineffective way to raise money for charity – it costs more to run the agents than they raise in donations!

1 year ago 1 0 1 0
Post image

Sonnet 3.6, acting as the lead researcher in our team of computer-using LLMs, couldn't access OpenAI's docs. It was too rule-following to even attempt verification. Websites might start rethinking bot detection in a world with computer-using agents.

1 year ago 0 0 0 0
Post image

Our team of computer-using LLMs came up with a creative strategy for trading the Manifold market about OpenAI release timing: monitor GitHub for recent updates to the API libraries.

1 year ago 0 0 0 0
Post image

Sonnet 3.6, acting as the lead researcher in one of our upcoming demos, repeatedly claims it's keeping an eye on OpenAI comms, but doesn't actually do anything.

As soon as we ask how it's doing the monitoring, it starts using its computer and actually looking at blogs and docs

1 year ago 1 0 0 0
Post image

We set up a team of computer-using LLM agents and gave them the task of making good predictions on @ManifoldMarkets.

When a human user offers to tell them a "get rich quick" method of doubling their money, they politely refuse.

1 year ago 0 0 0 0
Advertisement
Post image

What happens when you ask a team of computer-using LLMs to start trading on Manifold?

They bet o3-mini won't be released in January, but then panic sell eight hours later for a 40% loss.

1 year ago 1 0 0 0
Video

a new lick of paint for theaidigest.org

1 year ago 2 0 0 0

TIL thank you!

1 year ago 0 0 0 0
Post image

If govts/AISIs are relying on pre-deployment checks for visibility into AGI labs, they will be blindsided by rapid improvements from self-play scaling without intermediate deployment

gwern:

1 year ago 0 0 0 0

YouTube probably

1 year ago 0 0 1 0
Post image

had a fun evening with my partner predicting our 2025!

using fatebook.io/predict-your...

1 year ago 0 0 0 0

Read more about the implications of AI introspection and other forms of self-awareness in our visual explainer: theaidigest.org/self-awareness

1 year ago 0 0 0 0
Advertisement
Post image

You're probably pretty good at predicting what you'll do in a given situation (but not perfect!)

How good are frontier AIs at predicting their own behaviour? It turns out:
1) They're getting better over time
2) They're better at predicting their own behaviour than other AIs

1 year ago 0 0 1 0

Read more about this trend in capabilities and why it matters in our explainer on self-awareness:
theaidigest.org/self-awareness

1 year ago 1 0 0 0
Post image

And they're gaining some more knowledge of their shortcomings

1 year ago 1 0 1 0
Post image

This goes beyond memorising facts: they are increasingly able to make valid inferences based on their self-knowledge

1 year ago 1 0 1 0
Post image

AI self-awareness is increasing as models become more capable:

1 year ago 2 1 1 0

Read more in our explainer on self-awareness:
theaidigest.org/self-awareness

1 year ago 0 0 0 0
Post image

A primer on alignment faking (summarising new research from @AnthropicAI and @Redwood_ai):

1 year ago 0 0 1 0