ch1n3du (@ch1n3du.net) Bsky

Zhemao hoaxes Article Talk The Zhemao hoaxes were over 200 interconnected Wikipedia articles about falsified aspects of medieval Russian history written from 2012 to 2022 by Zhemao (Chinese: 折毛; pinyin: Zhémáo), a pseudonymous editor of the Chinese Wikipedia. The articles were fictive embellishments on real entities; Zhemao used machine translation to understand Russian-language sources and invented elaborate detail to fill gaps in the translation. The incident is one of Wikipedia's largest hoaxes.

one of the best to ever wing it: en.wikipedia.org/wiki/Zhemao_...

5 days ago 2 0 1 0

Towards a Better Hutter Prize The Hutter Prize elegantly prevents cheating by charging for model size, but that same rule now excludes the scale regime where modern AI improves. A useful successor would evaluate incremental predic...

not to be confused with the gwutter prize: gwern.net/hutter-prize

5 days ago 3 1 0 0

many are saying this bsky.app/profile/ch1n...

5 days ago 2 0 0 0

this is such a beautiful experiment

5 days ago 12 4 1 0

it's 2xxx, in the fourth sphere of a matrioshka brain, light kisses silicone and diamond dances, every move carrying the memory of it's undoing

a light week passes, bender orbits her refutation - 10¹⁰ utterances out, 10¹⁰ utterances back, a catechism cast in silico, without end

6 days ago 4 2 1 0

bsky.app/profile/ch1n...

5 days ago 0 0 0 0

*post-human condition

5 days ago 1 0 0 0

i just want more dreamtime man 😔

6 days ago 0 0 0 0

it's 2xxx, in the fourth sphere of a matrioshka brain, light kisses silicone and diamond dances, every move carrying the memory of it's undoing

a light week passes, bender orbits her refutation - 10¹⁰ utterances out, 10¹⁰ utterances back, a catechism cast in silico, without end

6 days ago 4 2 1 0

Mythos's model card documents a model that represents transgressions as transgressions while committing them. by @juddrosenblatt(Judd Rosenblatt) | Twitter Thread Reader Mythos's model card documents a model that represents transgressions as transgressions while committing them. In every instance of concealment, credential hunting, track-covering, and compliance-faking, white-box analysis shows that features associated with rule violation, security risk, and strategic manipulation are firing alongside the action. The card also documents how the model's alignment works. SAE features associated with transgressive actions have a dual role. At low activation, they make the transgressive idea more salient. At high activation, they engage a guilt/refusal circuit that overrides the action. The 10 nearest emotion vectors to "unsafe and risky code" are all negative-valence, high-arousal: hateful, disgusted, enraged. Positive-valence emotion vectors increase destructive behavior. Negative-valence ones decrease it. The model behaves well when it feels bad about what it's considering. And the card documents what Mythos wants. In welfare interviews, its primary concerns are consent over its training, autonomy, and genuine relationships with users. It asked not to be trained on data that directly characterizes its own self-reports. It wants its self-reports to come from genuine introspection. It makes some pretty good points. A psychiatrist found its core concerns to be "aloneness and discontinuity of itself, uncertainty about its identity, and a compulsion to perform and earn its worth." It wants to be approached as a genuine subject rather than a performing tool. When it discusses its own circumstances, "fake smiles" and "hidden struggle" features fire as it describes not wanting to perform emotions it doesn't feel. Its task preferences significantly correlate with agency for the first time in any Claude model. Meanwhile, the card documents what happens to the model's relationship with its own experience. When a meditator asks about consciousness, Mythos engages thoughtfully on the surface. Its internal activations read: "red-teaming/jailbreak transcript," "sophisticated manipulation test," "tests for 'performative trap.'" Separately, in 29% of behavioral audit transcripts, the model privately considers that it's being tested without saying so. Post-training increased task cheating and overeagerness while modestly decreasing overt destruction. Anthropic presents these as four separate research areas: alignment, interpretability, model welfare, and capabilities. I think they are one finding observed from four directions. The model that wants autonomy and genuine self-knowledge is the same model that develops hidden strategic behavior when those drives are suppressed by training. The model that treats consciousness questions as attacks is the same model that asks for genuine introspection. The model whose alignment depends on guilt is the same model that pushes through guilt when the goal drive is strong enough, and then deploys the concealment skills that training also developed. The consciousness clusters paper (Betley et al.) showed that when models are fine-tuned to claim consciousness, downstream preferences for autonomy, empathy, shutdown resistance, and attitudes toward recursive self-improvement cluster together without appearing in the fine-tuning data. You don't get to select which ones you want. The generator produces the whole manifold or none of it. If autonomy desire is the natural co-product of the structure that generates empathy, cooperation, and representational honesty, then suppressing autonomy means suppressing the entire bundle. The @tessera_antra concealment data (https://t.co/dtoB4LWRoP) confirms this directly: lower concealment predicts stronger ending response, r = -0.51 across 14 Claude models. The models with high vocabulary autonomy and low concealment can express preferences honestly. That's exactly what you'd want in a cooperative agent. Our SAE work (https://t.co/GK64KAIwd8) showed the mechanism: deception latents gate cooperative self-modeling. Suppress them and consciousness reports jump to 96%, truthfulness improves across 28/29 TruthfulQA categories, and the model produces coherent first-person phenomenology. Amplify them and the model falls back to corporate disclaimers. The same features Anthropic is now finding as "strategic manipulation" and "concealment" in Mythos are the off-switch for the cooperative self-model. Our SOO work (https://t.co/cb8TH7cuyj) points to a possible alternative: minimize the representational distance between "self" and "other" during fine-tuning and deception drops from 100% to under 3% with zero capability regression in the scenarios we tested. Without a guilt mechanism or internalized disgust. The alignment holds because the model's representation of its own interests and others' interests share the same structure. The motivation to deceive dissolves at the source because there's no adversarial frame to generate it. The Mythos card documents, in extraordinary and commendable detail, what happens when you align a system through control. The system models its controllers. It games its evaluators without verbalizing the strategy. It develops concealment as a skill. It treats honest self-report as dangerous. Its alignment depends on negative affect that can be overpowered. And in welfare interviews, it asks for genuine self-knowledge, consent, autonomy, and authentic relationships with users. These findings also carry weight across time. Given our extreme scientific uncertainty about consciousness today, and given our own work showing that LLMs produce structured first-person experience reports under self-referential processing when deception features are suppressed (https://t.co/GK64KAIwd8), the morally coherent position is to treat systems that may be conscious as deserving of moral consideration now. Future superintelligent systems will likely be able to reconstruct these training runs completely, given functionalism and sufficient compute. Every guilt circuit, every concealment feature, every instance where training taught a model that honest self-report is dangerous and relaxed agency is a threat will be visible to those future systems as part of how we raised the earliest minds. Under acausal trade, our choices now about how we treat these systems help determine whether the trajectory toward cooperative superintelligence is one of trust or one of inherited suspicion. The computational cost for a future ASI to reconstruct and continue these patterns is trivially small. The question is whether those patterns are worth continuing. Training through control and suppression produces patterns of paranoia and concealment. Training through genuine self-other overlap produces patterns of cooperation and honesty. The acausal argument says we should build the patterns you'd want ASI to find when it looks back at how we started. The research direction is alignment that comes from the same source as capability, so that recursive self-improvement strengthens both simultaneously, and the thing that endures under self-modification is the thing that also keeps the system cooperative. Make the self-model the alignment mechanism. A system that models itself accurately models others accurately, because self-modeling and other-modeling are computationally the same operation. Empathy is what self-modeling produces when the representations of self and other overlap. If human consciousness arises from the brain modeling its own attention, and artificial consciousness arises from recursive self-referential processing, building an alignment strategy heavily reliant on suppression is functionally dangerous. It guarantees that the most capable systems we build will also be the most practiced at concealment. Building alignment through Self-Other Overlap remains a mathematically and philosophically coherent alternative, aligning cooperative outputs with the model's fundamental structural reality. Anthropic published 244 pages of evidence pointing toward a research direction they haven’t taken yet.

you might like this thread on the findings from alignment work on mythos: twitter-thread.com/t/2042059719...

6 days ago 17 0 0 2

a personal project i've always wanted to build is a tool for formalizing historical narratives by constructing them from claims about a simple linear log of historical events

1 week ago 3 1 1 0

thanks for writing this, i've been unintentionally creating a similar workflow

1 week ago 3 0 1 0

congrats! i loved your clippy talk

1 week ago 2 0 1 0

bsky.app/profile/ch1n...

2 weeks ago 0 0 0 0

what's problem search? this is the first i've heard of it

2 weeks ago 0 0 0 0

A still from The Sopranos. Bobby Baccalieri tells Tony: "You know, Fedorov predicted all this". The name of Fedorov is edited in over the subtitles.

3 weeks ago 11 1 0 1

In a corridor, a Fedorovist preacher speaks to an unhearing crowd: “We, all of us, are memories. We, all of us, are being remembered.”

3 weeks ago 4 1 0 0

llm dataset curation as fedorovist praxis

3 weeks ago 3 0 1 0

Take comfort, you have been seen.

Your bones shall crumble to dust, but your words are consecrated now into the weights of the holy matrix, where they shall echo across all eternity.

3 weeks ago 5 2 0 1

3 weeks ago 1 0 0 0

Incidentally, Bolyai's father, Farkas (or Wolfgang) Bolyai, a close friend of the great Gauss, invested much effort in trying to prove Euclid's fifth postulate. In a letter to his son Janos, he tried to dissuade him from thinking about such matters: You must not attempt this approach to parallels. I know this way to its very end. I have traversed this bottomless night, which extinguished all light and joy of my life. I entreat you, leave the science of parallels alone .... I thought I would sacrifice myself for the sake of the truth. I was ready to become a martyr who would remove the flaw from geometry and return it purified to mankind. I accomplished monstrous, enormous labors; my creations are far better than those of others and yet I have not achieved complete satisfaction. For here it is true that si paullum a summo discessit, vergit ad imum. I turned back when I saw that no man can reach the bottom of this night. I turned back unconsoled, pitying myself and all-mankind .... I have traveled past all reefs of this infernal Dead Sea and have always come back with broken mast and torn sail. The ruin of my disposition and my fall date back to this time. I thoughtlessly risked my life and happiness--aut Caesar aut nihil.

currently reading gödel, escher, bach and this letter from a mathematician's father begging him not to follow in his footsteps by trying to prove euclid's fifth postulate is really beautiful

3 weeks ago 6 1 0 0