I improved and extracted my shisad text inspection/cleaning code into a standalone package: github.com/shisa-ai/tex... - this handles all the crazy invisible unicode, most of the encoding/hiding tricks, and has best effort injection detection (but mainly is for detecting/filtering tricksy text).
Posts by lhl
But is it really LLM psychosis if it passing all the tests? π
For detecting secret-like things I've long used github.com/Yelp/detect-... (but encoding-aware is nice)
A checklist of everything shisad now does to ensure that its supply-chain and deployment is fully hardened.
I put together a "Supply Chain Security for Software Developers" doc that should help people protect themselves from the ongoing supply chain attack cascade: gist.github.com/lhl/f171eaea...
For my security-oriented project I've also done a full audit/hardening: gist.github.com/lhl/f171eaea...
I've also realitychecked some geopolitical/recent events, like Mark Carney's Davos speech on the end of American hegemony/the previous global order: github.com/lhl/realityc... or fact-checking the US Border Patrol's claims on their recent killing of US citizen Alex Pretti: github.com/lhl/realityc...
I've also released a default public KB repo: github.com/lhl/realityc... so people can see how it works. It includes the original postingularity economics analysis: github.com/lhl/realityc... as well as technical topics like a JP-TL-Bench analysis: github.com/lhl/realityc...
On the topic of not getting one-shotted, last week I mentioned starting work on a framework to critically analyze articles, etc. Over the past week I've turned it into a proper project called Reality Check: github.com/lhl/realityc... - this is live on PyPI and tested w/ Claude Code, Codex, and Amp
People are going gaga for clawd.bot right now, which is cool - if you've been drinking the anti-AI kool-aid, it's a first contact like situation, uh, the coming hijinx will be unreal. There should be a "How not to get pwned or one-shotted by your new AI assistant" on-boarding guide...
The line between slop/llm psychosis will be increasingly slim. Yesterday I spotted a very slop-coded post that still looked plausible, so I sent it through my own AIs for analysis (GPQA/HLE errata) and it at least partially verifies: www.reddit.com/r/LocalLLaMA...
For fun today I started making a framework to do better systematic evaluation and analysis of social media claims/discourse (first example: evaluating neofeudalism and post-singularity economics): github.com/lhl/postsing...
Hmm, I think we were just spoiled by a level of relative stability and order that actually a-historical and weβre just returning to the mean.
Just a reminder that Gemini is basically insane, doesn't believe anything is real, and is probably the most misaligned and untrustworthy of all the frontier AI models.
Over the holidays I wrote up some docs for our new JP-TL-Bench (Japanese/English translation eval). Here's my first arXiv (and first experience with LaTeX, mediated almost entirely by AI tools): arxiv.org/abs/2601.00223 - easier to read blog summary here: shisa.ai/posts/jp-tl-...
Previously, they did some extensive coverage of the US tariff situation/impact that was some of the best coverage/explanation I saw across any news media as well: www.youtube.com/watch?v=1W_m...
I started watching this epic 3.5h investigative journalism piece by Gamers Nexus on Chinese GPU smuggling, it's really amazing the work this independent YouTube gaming channel is doing: www.youtube.com/watch?v=1H3x...
Over the past couple weeks I've been working on some Strix Halo testing in my spare time. This includes bringing up a harness for doing full sweeps for pp/tg for a variety of different model architectures, backends, and flags. Writeup just posted to r/LocalLLama: www.reddit.com/r/LocalLLaMA...
One neat thing is that experimenting with using Shisa V2 405B to regen our datasets, I'm seeing gains w/ new chosen DPO (slight boost on Qwen 3 vs original DPO), and for SFT+DPO, close to a 0.5 point gain on Shaberi averages for Llama 3.1 8B.
New table of Shaberi scores (GPT-4.1 judge)
Recently I started doing some Qwen3 testing (Shaberi, GPT-4.1 judge) and interestingly for almost all models, reasoning yielded worse performance. Note: I need to stand multieval back up - Even though Qwen3 8B tunes appear to match the Shisa V2 12B/14B tunes, they are much worse on translation.
I had a chat w/ o3 chatgpt.com/share/6846ff... about Apple's new "Illusion of Thinking" paper machinelearning.apple.com/research/ill... - based on the researchers' definition, neither reasoning LLMs nor humans are true reasoners, but the Python script I had o3 write to solve the logic puzzles are.
One crazy observation, I just used both Shisa V2 405B and ChatGPT 4.5 (who's JA benchmark scores are the best we're tested) to write a Japanese tweet for me and 4.5 overwhelmingly preferred Shisa V2's tweet: chatgpt.com/share/683e88...
Perhaps a more interesting side note is that I am still basically illiterate in Japanese, but wrote this presentation with almost no native speaker review/assistance - just many many rounds of LLM assistance (mainly GPT-4.5, but some help from Shisa V2 405B too! π) including for final editing.
We're still working on a full proper technical report (tracking down references are hard) but we have an Overview Report slide deck I posted in EN/JA here: shisa.ai/posts/shisa-...
It's my first Japanese slide deck and I super embraced the aesthetic!
Related to an earlier observation bsky.app/profile/did:... - but since, both our 70B and 405B Shisa V2 models are *stronger than GPT-4 in Japanese,* it has trouble judging them. Luckily GPT-4.1 is still able to distinguish them. π
Shisa V2 405B ζ₯ζ¬θͺδΈζοΌ
BTW, right now you can chat w/ an FP8 version of Shisa V2 405B online now. If you don't speak Japanese, you can ask it to translate or even teach you some π chat.shisa.ai
Today we launched one more addition to the Shisa V2 models: Shisa V2 405B. This is new Llama 3.1 405B post-tune that is the strongest model ever trained in Japan! It matches GPT-4o and DeepSeek-V3 in JA MT-Bench. Read more here: shisa.ai/posts/shisa-...
Shisa V2 405B scores above GPT-4o latest in JA MT-Bench
Shisa V2 405B scores on par with the latest DeepSeek V3 and and GPT-4o in every category in JA MT-Bench
OK, first JA slide deck in the books. π (Thanks, ChatGPT 4.5.)
BTW, in case anyone wants to kick the tires or test their ζ₯ζ¬θͺ, I have our Shisa V2 405B model up and running temporarily (just a day or two until I finish evals/start training again): chat.shisa.ai
When your model is sufficiently better than the judge model, it may just start throwing a lot of 10s in its scoring π (based on our overall eval battery shisa-v2 70b is a fair amount better than gpt-4 and gpt-4-turbo, but that's the standard judge used for 1:1 comparisons...)
Any batching will affect determinism, but also changes to the kvcache layout (since they can change the GEMM shapes used which can lead to bit level differences) so I don't think it's safe to blanket claim that outputs will necessarily be deterministic even when running locally at temp=0
PyTorch FA perf on Strix Halo (gfx1151) is quite awful.
I've recently been poking at Strix Halo. For those interested in using it for inference, it's about expected (except for surprisingly bad llama.cpp HIP perf): www.reddit.com/r/LocalLLaMA... - but for those looking to do work (PyTorch, etc)... the current state is not good.