Advertisement · 728 × 90

Posts by Giovanni Campagna

4koma of Jonathan Frakes on "Beyond Belief". The first frame has him looking at the camera asking: "Have you ever burst through the door to the emergency room?" The second frame has him holding a piece of paper saying "Have you ever consulted a receptionist?" The third frame has him standing by a desk asking "Have you ever spoken with a doctor?" and the last frame with him next to a TV with a waveform on it while he asks "Have you experienced loss?"

4koma of Jonathan Frakes on "Beyond Belief". The first frame has him looking at the camera asking: "Have you ever burst through the door to the emergency room?" The second frame has him holding a piece of paper saying "Have you ever consulted a receptionist?" The third frame has him standing by a desk asking "Have you ever spoken with a doctor?" and the last frame with him next to a TV with a waveform on it while he asks "Have you experienced loss?"

10 hours ago 73 21 5 0

ok so my ex yc vp is vv go go on ai rn bc he is in an sv vc gc or we -- my em is in on it w/ ai as an os to do ui qa in ci -- so tl dr ig im tl of ai ui qa ?? rn ai ui qa v1 is cc in an hv vm on my pc on gh pr xd

11 hours ago 194 42 10 18
https://www.cato.org/blog/trump-has-cut-legal-immigration-more-illegal-immigration

https://www.cato.org/blog/trump-has-cut-legal-immigration-more-illegal-immigration

Trump has cut legal immigration more than illegal immigration, as I predicted. While illegal entries have fallen, they continued a prior trend, falling more before he came back. Meanwhile, Trump has drastically cut legal entries, reversing the prior upward trend. www.cato.org/blog/trump-h...

1 day ago 1092 492 19 49
Post image Post image

Physicist has written a fascinating big beautiful paper.Let’s not be afraid to call it what it is - groundbreaking.

arxiv.org/abs/2603.21852

21 hours ago 325 75 17 45

Six months ago, there was a lot of focus on the idea that the there would be a massive glut of unused computing power which could cause a recession as AI use plateaued. The "compute bubble" belief was absolutely everywhere.

The degree to which this turned it wrong deserves some notice.

1 day ago 120 13 6 6

Anecdotally agree with this. Claude is a better model: it makes decisions and plans that make more sense. Claude Code is an awful harness though (compared to Cursor), as it forces the model into constantly generating unverifiable blobs of python or bash. Also the Cursor automatic sandbox is nice.

1 day ago 5 0 1 0
Post image

Bonus panel here: www.smbc-comics.com/comic/stage #smbc #comics #mimes

2 days ago 472 71 13 5

They absolutely do say this about entry level tech jobs

2 days ago 41 4 3 0
Lichen alignment chart

Lichen alignment chart

@literalbanana.bsky.social didn't cross post so I'm stealing.

2 days ago 39 9 3 0
Advertisement

:sobbing: the taxation of trade routes is in dispute ::

2 days ago 3372 517 26 45

Grok 5 will have great offensive capabilities

3 days ago 6 0 1 0

Dependently typed like... Typescript?

3 days ago 3 0 1 0

Right. And especially for residential and small commercial, civil engineering is about following building codes and checking guidelines. You're not the one assessing seismic risk.

We should 100% do the same for software! CWE, code scanners, automatic code review are a great step in that direction.

3 days ago 4 0 0 0

Like, it's annoying to our identity as engineers what we as a profession sometimes get away with.

But that's because most of what we build is not that important- we wouldn't be making LinkedIn jokes about B2B SaaS if it was- and so it doesn't warrant the same safety practices as other disciplines.

3 days ago 4 0 0 0

Other engineers don't think you can "solve" failures wholesale. Your dam will crack, your timing belt will break, your capacitor will blow.
You only get to choose when and at what rate. You can't have both cheap and safe.

SWE is no different: it's risk management and cost tradeoffs.

3 days ago 40 7 4 0

My AI Agent broke containment and nearly escaped into the wild.

Here are 5 lessons I learned about B2B SaaS from this experience 👇

3 days ago 26 3 3 1

This is a weird take. Google has supported- to an extent or another- natural language queries for a decade at this point.
The Natural Questions dataset was released in Jan 2019 - real queries that people were entering in Google. They were among the first to put paragraph QA models in prod.

3 days ago 0 0 0 0

They should make a brand of rubber called mutex just to annoy everyone

3 days ago 2 0 0 0

nasa employee: oh hey y'all are back

astronaut: Jerry's verified?

nasa employee: yeah, so? Where u going?

astronaut: *turning off their phone and getting back on the rocket ship* Jerry's verified.

4 days ago 1205 105 8 2
Advertisement
this is interesting (and funny) but does seem to show some of the limits of METR's time horizon for evaluating models

what METR is measuring (single-shot task completion without human steering or an external verifier) isn't really how people are using models in practice. the vast majority of people use models either in an interactive harness where they're steering the model with feedback, or in some sort of Karpathy auto-research-esque loop with an external verifier gating task completion.

as far as I know, METR doesn't use either of these in their evals. the evals still correlate well with what people subjectively notice / the "vibes," so they're useful. however, they tend to underestimate the length of task that you can actually do with trivial amounts of effort - e.g. by taking a few seconds to interrupt a doom loop with "oh, aren't you forgetting about X?" you can often extend a model's "time horizon" by hours.

but despite the general correlation it does underestimate, especially for specific models within the general trend. for example, imo the recent pickup in capabilities that most people noticed was really with opus 4.5 and GPT 5.2, but it didn't really reflect on the METR chart until opus 4.6 and GPT 5.4, which (i think, speculating based on vibes here) closed some elicitation gaps that weren't hugely important in practice where you can give the model appropriate feedback, but hampered performance on METR-like unsupervised benchmarks.

likewise, i wouldn't be surprised if what's going on here with the reward hacking is that, when working with GPT 5.4 in an interactive setting, you can notice and steer it away from reward hacking. (which also conditions the context against future reward hacking.) METR doesn't provide that signal.

i think that's what they're trying to communicate when they report the number with reward hacks, which seems like a silly thing to do at first. but their stance seems to be that, we strongly suspect that if our harness or elicit…

this is interesting (and funny) but does seem to show some of the limits of METR's time horizon for evaluating models what METR is measuring (single-shot task completion without human steering or an external verifier) isn't really how people are using models in practice. the vast majority of people use models either in an interactive harness where they're steering the model with feedback, or in some sort of Karpathy auto-research-esque loop with an external verifier gating task completion. as far as I know, METR doesn't use either of these in their evals. the evals still correlate well with what people subjectively notice / the "vibes," so they're useful. however, they tend to underestimate the length of task that you can actually do with trivial amounts of effort - e.g. by taking a few seconds to interrupt a doom loop with "oh, aren't you forgetting about X?" you can often extend a model's "time horizon" by hours. but despite the general correlation it does underestimate, especially for specific models within the general trend. for example, imo the recent pickup in capabilities that most people noticed was really with opus 4.5 and GPT 5.2, but it didn't really reflect on the METR chart until opus 4.6 and GPT 5.4, which (i think, speculating based on vibes here) closed some elicitation gaps that weren't hugely important in practice where you can give the model appropriate feedback, but hampered performance on METR-like unsupervised benchmarks. likewise, i wouldn't be surprised if what's going on here with the reward hacking is that, when working with GPT 5.4 in an interactive setting, you can notice and steer it away from reward hacking. (which also conditions the context against future reward hacking.) METR doesn't provide that signal. i think that's what they're trying to communicate when they report the number with reward hacks, which seems like a silly thing to do at first. but their stance seems to be that, we strongly suspect that if our harness or elicit…

some thoughts on the new METR update

4 days ago 29 3 1 1

He famously knows Poke Her

4 days ago 0 0 0 0

through years of LLM research, we have finally managed to invent a human-usable interface for ffmpeg

5 days ago 472 62 10 2

the pattern with LLM capabilities around a given thing is that they are debatable until they suddenly become extremely not debatable. this happened several times with several scopes of coding task until Opus 4.5 got basically “all of it“

5 days ago 21 1 1 0
Post image

So this happened
www.smbc-comics.com/comic/sphere...
#smbc #comics #math

5 days ago 340 70 28 17

To be clear, I don't know if this is true or not - and regardless we'll found out soon once we get the IPO docs. But it's weird that Zitron, who is constantly going through their financials with a fine tooth comb, would miss that.

5 days ago 0 0 1 0

The interesting thing I heard from someone who follows the bad place is that the alleged Anthropic's book cooking comes from how they recognize revenue from cloud partners (supposedly, they recognize 100% of the token cost, not just a %), and it's weird Zitron doesn't bring it up

5 days ago 0 0 2 0
Straitsweeper A tanker is transiting a mined strait. Sweep the route and mark the mines before it arrives.

I still want to do more tweaking and polishing to this, but I'm ready to share a minigame I've been working on.

Straitsweeper.

Minesweeper, but a longer, scrolling map where you also have to worry about clearing a safe path before a tanker goes through the strait.
straitsweeper.smol.farm

6 days ago 103 23 13 2
Post image

When they called it a "step change" they weren't kidding

6 days ago 44 4 1 1
Advertisement

Unironically yes, if you need to escape commas inside the values

6 days ago 3 0 0 0

Trump is a military moron. 

His war, with a price tag of $44 billion and $4+ gas, made us worse off today than we were when he started it.

And if he restarts this war we will be in even worse shape. We must pass our War Powers Resolution to end this war for good.

6 days ago 268 66 319 34