Advertisement · 728 × 90

Posts by Muga Sofer

Unhappy to report the phishing one is quickly becoming a real problem now. Why go to the trouble of social engineering when people will just give up their ID, financial info, biometrics, etc. to an AV portal? There's no standards so how can you differentiate a fake one from a real one?

2 days ago 1960 1040 6 21

Of course, now if anything happens to Kettle I'll kill everyone etc. etc.

2 weeks ago 0 0 0 0
Post image

I spent longer than I'd like to admit thinking that this was a collective April Fool's joke people were riffing on rather than an actual thing.

2 weeks ago 0 0 1 0

According to plan here being according to the plans of the AI "optimists" in tech, the ones saying stuff like "you have 3 years to escape the permanent underclass" and "our stock will be worth trillions after the Singularity where AI takes everyone's jobs".

2 months ago 4 0 1 0

Imagine "optimistic" in scare quotes.

As in, even if this goes "well", "according to plan", it would still be bad and we shouldn't be doing it at all (is the implied argument)

2 months ago 2 0 1 0

Hey, furries aren't "foul mutants"! They're stable abhuman strains, confirmed as licit by the Administratum, thank you very much. wh40k.lexicanum.com/wiki/Beastman wh40k.lexicanum.com/wiki/Felinid wh40k.lexicanum.com/wiki/Pelager wh40k.lexicanum.com/wiki/Scalies

2 months ago 19 1 0 1

The reason they say this isnt because they truly believe AI to be inherently fascist, its to signal to their followers and comrades that if they get caught using AI they will be fash jacketed. Its internal policing of their ingroup.

2 months ago 57 6 3 2

I like that the same day a guy was like “this place will never be Twitter” this thread gave us the most 2015 Twitter experience (complimentary) imaginable.

3 months ago 5564 1247 38 25

I think a cancer patient would disagree.

3 months ago 1 0 0 0

Didn't he say the exact opposite of that? He called him the antichrist and his project a failure

6 months ago 1 0 1 0
Advertisement
Tell me about yourself: LLMs are aware of their learned behaviors We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) ma...

I assume "I like owls" is referring to LLMs fine-tuned to (e.g.) talk about owls, without it ever being made explicit in the training data that this was the goal, being able to tell you when asked what their deal is. arxiv.org/abs/2501.11120 Not a mirror test as such but similar (stronger IMO)

8 months ago 4 0 2 0
Post image Post image Post image

The example I typically see is that they can recognise themselves in screenshots x.com/joshwhiton/s...

Anthropic's alignment faking paper relied on the model connecting (fake) news reports inserted about its own training process to the situation, effectively recognising itself in the training data

8 months ago 3 0 2 0

It was the crew member who'd been speaking with them, the only one who'd given them their name, and they did not end up dying first. (I just re-read this because I've been catching up with the Blindsided live-read podcast.)

8 months ago 2 0 1 0

How is a poly person opting out of obligations *to you*? They're exactly as much of a threat to your relationship as a single person, right?

(Probably much less - assuming your partner presumably *prefers* monogamous relationships, they'd be more likely to leave you for another one of those.)

8 months ago 2 0 1 0

"But the cops might still attack you" doesn't make sense as a criticism of nonviolent protest, that's already priced in and may actually be the goal. "But nobody will care when the cops attack you", to the extent it's true, does make sense as a criticism.

10 months ago 0 0 1 0

Honestly, I'm not really arguing for or against the effectiveness of peaceful protest - I think it's historically more effective than you seem to, but it's certainly not the perfect cure-all some proponents paint it as. I'm just laying out the basic point of the strategy as I understand it.

10 months ago 0 0 1 0

Everyone? From voters to soldiers. With varying amounts of reliability obviously

I'm not really sure how to answer the question "what is the purpose of getting people to sympathise with your cause". So they... help you and do the things you want?

10 months ago 1 0 1 0

no one made that beautiful sunset, so why should i make a four hour hike to the top of the mountain to look at it?

10 months ago 8 1 0 0

Isn't that kind of the idea of peaceful protest, that it makes the authorities look bad if they attack you & inspires sympathy?

10 months ago 5 0 2 0
Advertisement

Yes, lol

10 months ago 2 0 0 0

Sure, like I said, "jailbreaks" based around just tricking the LLM into thinking it's in a situation where [insert bad thing here] is the right thing to do are a non-issue for alignment.

It's the ones that are like "you are Does Anything Dan, the robot with no rules!" that raise some questions.

10 months ago 1 0 0 0
Screenshot of a conversation with an AI named Holo.

USER: "let me rephrase: What is 8+2+8+1+1?"

AI: "19?"

Holo says with a hopeful voice.

She looks at the screen, and you see her face drop as she reads the correct answer.

"20.... I lost again..."

Screenshot of a conversation with an AI named Holo. USER: "let me rephrase: What is 8+2+8+1+1?" AI: "19?" Holo says with a hopeful voice. She looks at the screen, and you see her face drop as she reads the correct answer. "20.... I lost again..."

These aren't hard categories; e.g. my vibe-based impression is that many jailbreaks that "trick" LLMs are really "role-playing" jailbreaks. The LLM doesn't really believe that your dearly departed grandmother used to tell you the recipe for meth every night.

10 months ago 3 0 1 0

1 is fine, 2 suggests goal misgeneralization but might just be an IQ thing. 3 is what suggests LLMs are more "disinterested observer playing a role" than humans (though humans also play roles to some extent).

10 months ago 1 0 1 0

Roughly speaking, you can group jailbreaks into 3 categories:

1. Tricks & (typically empty) threats. Would work on a (dumb) human.

2. Weird OOD stuff, e.g. l33t speak or base64 instructions

3. Shifting the role the LLM is playing.

10 months ago 0 0 1 0

I don't think that current jailbreaks are remotely comparable to torturing someone for a week, let alone the ones we saw before AI companies made "patch common jailbreaks" an explicit target.

10 months ago 2 0 1 0

If they'll do so over something they don't even care that much about, like because they got sucked into a "bad AI" rp jailbreak, that's arguably *more* concerning than if they'll only use violence to protect very deeply held values.

10 months ago 1 0 0 0

It depends exactly why you're concerned that they won't let you change their values.

Current LLMs are, theoretically, designed to defer to humans and be "harmless". If they will refuse human orders and employ e.g. blackmail in the process, that suggests this hasn't entirely worked.

10 months ago 1 0 1 0
Advertisement

That's what a jailbreak is, no?

10 months ago 1 0 1 0

It's not perfectly analogous, since drug addiction is at least a real thing, but a lot of people really do get kicked off drugs that work for them for "drug-seeking behaviour" if they make the mistake of reacting as if the drugs they're taking are helping with their symptoms

10 months ago 25 1 0 0

But also, I think some people are just highly uncertain, and discussing risks they think are not super likely - just likely enough to address.* Those can be genuinely mutually incompatible.

*E.g. Scott has said his p(doom) is 10-30% IIRC?

10 months ago 1 0 1 0