lhl (@lhl) Bsky - nopzon.com

GitHub - shisa-ai/textguard: Hostile-text normalization, inspection, and cleaning for LLM-adjacent systems. Hostile-text normalization, inspection, and cleaning for LLM-adjacent systems. - shisa-ai/textguard

I improved and extracted my shisad text inspection/cleaning code into a standalone package: github.com/shisa-ai/tex... - this handles all the crazy invisible unicode, most of the encoding/hiding tricks, and has best effort injection detection (but mainly is for detecting/filtering tricksy text).

1 week ago 3 0 0 0

But is it really LLM psychosis if it passing all the tests? 😅

2 weeks ago 0 0 0 0

GitHub - Yelp/detect-secrets: An enterprise friendly way of detecting and preventing secrets in code. An enterprise friendly way of detecting and preventing secrets in code. - Yelp/detect-secrets

For detecting secret-like things I've long used github.com/Yelp/detect-... (but encoding-aware is nice)

2 weeks ago 2 0 0 0

A checklist of everything shisad now does to ensure that its supply-chain and deployment is fully hardened.

I put together a "Supply Chain Security for Software Developers" doc that should help people protect themselves from the ongoing supply chain attack cascade: gist.github.com/lhl/f171eaea...

For my security-oriented project I've also done a full audit/hardening: gist.github.com/lhl/f171eaea...

2 weeks ago 3 0 1 0

I've also realitychecked some geopolitical/recent events, like Mark Carney's Davos speech on the end of American hegemony/the previous global order: github.com/lhl/realityc... or fact-checking the US Border Patrol's claims on their recent killing of US citizen Alex Pretti: github.com/lhl/realityc...

2 months ago 1 0 0 0

GitHub - lhl/realitycheck-data: Reality Check knowledge base - unified analysis of claims across technology, economics, labor, and governance domains Reality Check knowledge base - unified analysis of claims across technology, economics, labor, and governance domains - lhl/realitycheck-data

I've also released a default public KB repo: github.com/lhl/realityc... so people can see how it works. It includes the original postingularity economics analysis: github.com/lhl/realityc... as well as technical topics like a JP-TL-Bench analysis: github.com/lhl/realityc...

2 months ago 1 0 1 0

GitHub - lhl/realitycheck: A framework for rigorous, systematic analysis of claims, sources, predictions, and argument chains. A framework for rigorous, systematic analysis of claims, sources, predictions, and argument chains. - lhl/realitycheck

On the topic of not getting one-shotted, last week I mentioned starting work on a framework to critically analyze articles, etc. Over the past week I've turned it into a proper project called Reality Check: github.com/lhl/realityc... - this is live on PyPI and tested w/ Claude Code, Codex, and Amp

2 months ago 0 0 1 0

People are going gaga for clawd.bot right now, which is cool - if you've been drinking the anti-AI kool-aid, it's a first contact like situation, uh, the coming hijinx will be unreal. There should be a "How not to get pwned or one-shotted by your new AI assistant" on-boarding guide...

2 months ago 2 0 0 0

randomfoo2's comment on "[Research] I forensic-audited "Humanity’s Last Exam" (HLE) & GPQA to benchmark my "unleashed" DeepSeek model. Result: A ~58% verifiable error rate caused by bad OCR and typos.... Explore this conversation and more from the LocalLLaMA community

The line between slop/llm psychosis will be increasingly slim. Yesterday I spotted a very slop-coded post that still looked plausible, so I sent it through my own AIs for analysis (GPQA/HLE errata) and it at least partially verifies: www.reddit.com/r/LocalLLaMA...

3 months ago 0 0 0 0

For fun today I started making a framework to do better systematic evaluation and analysis of social media claims/discourse (first example: evaluating neofeudalism and post-singularity economics): github.com/lhl/postsing...

3 months ago 1 0 0 0

Hmm, I think we were just spoiled by a level of relative stability and order that actually a-historical and we’re just returning to the mean.

3 months ago 1 0 0 0

Just a reminder that Gemini is basically insane, doesn't believe anything is real, and is probably the most misaligned and untrustworthy of all the frontier AI models.

3 months ago 0 0 2 0

JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation We introduce JP-TL-Bench, a lightweight, open benchmark designed to guide the iterative development of Japanese-English translation systems. In this context, the challenge is often "which of these two...

Over the holidays I wrote up some docs for our new JP-TL-Bench (Japanese/English translation eval). Here's my first arXiv (and first experience with LaTeX, mediated almost entirely by AI tools): arxiv.org/abs/2601.00223 - easier to read blog summary here: shisa.ai/posts/jp-tl-...

3 months ago 2 0 0 0

The Death of Affordable Computing | Tariffs Impact & Investigation YouTube video by Gamers Nexus

Previously, they did some extensive coverage of the US tariff situation/impact that was some of the best coverage/explanation I saw across any news media as well: www.youtube.com/watch?v=1W_m...

8 months ago 0 0 0 0

THE NVIDIA AI GPU BLACK MARKET | Investigating Smuggling, Corruption, & Governments YouTube video by Gamers Nexus

I started watching this epic 3.5h investigative journalism piece by Gamers Nexus on Chinese GPU smuggling, it's really amazing the work this independent YouTube gaming channel is doing: www.youtube.com/watch?v=1H3x...

8 months ago 0 0 1 0

Over the past couple weeks I've been working on some Strix Halo testing in my spare time. This includes bringing up a harness for doing full sweeps for pp/tg for a variety of different model architectures, backends, and flags. Writeup just posted to r/LocalLLama: www.reddit.com/r/LocalLLaMA...

8 months ago 0 0 0 0

One neat thing is that experimenting with using Shisa V2 405B to regen our datasets, I'm seeing gains w/ new chosen DPO (slight boost on Qwen 3 vs original DPO), and for SFT+DPO, close to a 0.5 point gain on Shaberi averages for Llama 3.1 8B.

10 months ago 2 0 0 0

New table of Shaberi scores (GPT-4.1 judge)

Recently I started doing some Qwen3 testing (Shaberi, GPT-4.1 judge) and interestingly for almost all models, reasoning yielded worse performance. Note: I need to stand multieval back up - Even though Qwen3 8B tunes appear to match the Shisa V2 12B/14B tunes, they are much worse on translation.

10 months ago 2 0 0 0

ChatGPT - Illusion of Thinking Summary Shared via ChatGPT

I had a chat w/ o3 chatgpt.com/share/6846ff... about Apple's new "Illusion of Thinking" paper machinelearning.apple.com/research/ill... - based on the researchers' definition, neither reasoning LLMs nor humans are true reasoners, but the Python script I had o3 write to solve the logic puzzles are.

10 months ago 1 0 0 0

ChatGPT - Shisa V2 405B チャット Shared via ChatGPT

One crazy observation, I just used both Shisa V2 405B and ChatGPT 4.5 (who's JA benchmark scores are the best we're tested) to write a Japanese tweet for me and 4.5 overwhelmingly preferred Shisa V2's tweet: chatgpt.com/share/683e88...

10 months ago 0 0 0 0

Perhaps a more interesting side note is that I am still basically illiterate in Japanese, but wrote this presentation with almost no native speaker review/assistance - just many many rounds of LLM assistance (mainly GPT-4.5, but some help from Shisa V2 405B too! 😂) including for final editing.

10 months ago 1 0 1 0

We're still working on a full proper technical report (tracking down references are hard) but we have an Overview Report slide deck I posted in EN/JA here: shisa.ai/posts/shisa-...

It's my first Japanese slide deck and I super embraced the aesthetic!

10 months ago 0 0 1 0

Related to an earlier observation bsky.app/profile/did:... - but since, both our 70B and 405B Shisa V2 models are *stronger than GPT-4 in Japanese,* it has trouble judging them. Luckily GPT-4.1 is still able to distinguish them. 😅

10 months ago 1 0 1 0

Shisa V2 405B 日本語上手！

BTW, right now you can chat w/ an FP8 version of Shisa V2 405B online now. If you don't speak Japanese, you can ask it to translate or even teach you some 😀 chat.shisa.ai

10 months ago 7 1 2 1

Today we launched one more addition to the Shisa V2 models: Shisa V2 405B. This is new Llama 3.1 405B post-tune that is the strongest model ever trained in Japan! It matches GPT-4o and DeepSeek-V3 in JA MT-Bench. Read more here: shisa.ai/posts/shisa-...

10 months ago 14 0 1 0

Shisa V2 405B scores above GPT-4o latest in JA MT-Bench

Shisa V2 405B scores on par with the latest DeepSeek V3 and and GPT-4o in every category in JA MT-Bench

OK, first JA slide deck in the books. 😅 (Thanks, ChatGPT 4.5.)

10 months ago 0 0 0 0

Shisa V2 405B

BTW, in case anyone wants to kick the tires or test their 日本語, I have our Shisa V2 405B model up and running temporarily (just a day or two until I finish evals/start training again): chat.shisa.ai

10 months ago 0 0 0 0

When your model is sufficiently better than the judge model, it may just start throwing a lot of 10s in its scoring 😂 (based on our overall eval battery shisa-v2 70b is a fair amount better than gpt-4 and gpt-4-turbo, but that's the standard judge used for 1:1 comparisons...)

10 months ago 4 0 0 1

Any batching will affect determinism, but also changes to the kvcache layout (since they can change the GEMM shapes used which can lead to bit level differences) so I don't think it's safe to blanket claim that outputs will necessarily be deterministic even when running locally at temp=0

11 months ago 1 0 0 0

PyTorch FA perf on Strix Halo (gfx1151) is quite awful.

I've recently been poking at Strix Halo. For those interested in using it for inference, it's about expected (except for surprisingly bad llama.cpp HIP perf): www.reddit.com/r/LocalLLaMA... - but for those looking to do work (PyTorch, etc)... the current state is not good.

11 months ago 0 0 0 0

Posts by lhl