"Brain-only participants exhibited the strongest, most distributed networks; Search Engine users showed moderate engagement; and LLM users displayed the weakest connectivity."
"Over four months, LLM users [...] underperformed at neural, linguistic, and behavioral levels."
arxiv.org/abs/2506.08872
Posts by Zach Ip
An actual researcher in the space finally gave me the context I'd been missing: they regularly have to review conference submissions that are about this quality. What I'd intended as obvious satire ≥ was, apparently, indistinguishable from what many people are aiming to pass off as legit. That's when I realised I'd messed up.
Claude’s rebuttal to Apple’s recent paper went viral
A guy, non-researcher, submitted a joke paper to arXiv with Claude as the main author
it contained real legit problems with the Apple paper (one of the problems was impossible to solve), and went viral
open.substack.com/pub/lawsen/p...
Congratulations Amy! What an eventful year for your lab! 🍾🎉
A feature implementation example for integrating "now" post types into a main index page. Includes a last updated date, a feature overview, a current phase status, and an overall progress checklist.
A screenshot of Cursor implementing "Phase 5" of this plan while using the markdown document as its context.
Fave Cursor workflow at the moment is get Claude to write feature implementation plans into a markdown document and update it as we go.
Breaks features down into phases with checklists, notes, relevant file lists. Essentially acts as read/write memory to prevent chat context from getting too long.
I’d have loved to be there for that presentation!! Excited to read more of her work!
Introducing Claude Establishing the model’s personality Model safety More points on style Be cognizant of red flags Is the knowledge cutoff date January or March? election_info Don’t be a sycophant! Differences between Opus 4 and Sonnet 4 The missing prompts for tools Thinking blocks Search instructions Seriously, don’t regurgitate copyrighted content More on search, and research queries Artifacts: the missing manual Styles This is all really great documentation
I put together an annotated version of the new Claude 4 system prompt, covering both the prompt Anthropic published and the missing, leaked sections that describe its various tools
It's basically the secret missing manual for Claude 4, it's fascinating!
simonwillison.net/2025/May/25/...
I got access to Gemini Diffusion, Google's first diffusion LLM, and the thing is absurdly fast - it ran at 857 tokens/second and built me a prototype chat interface in just a couple of seconds, video here: simonwillison.net/2025/May/21/...
We've seen nothing yet! We hosted a 9-13 yo vibe-coding event with @robertkeus.bsky.social this weekend (h/t
@antonosika.bsky.social and Lovable)
takeaway? AI is unleashing a generation of wildly creative builders beyond anything I'd have imagined
and they grow up knowing they can build anything!
True, and I find it reflecting my thoughts to me helps speed up the process of making up my mind
the last few weeks i’ve spent A LOT of time with o3. to the point where i keep trying to run multiple concurrent queries in the mobile app (doesn’t work btw)
deep dive into the web at your fingertips. hours of research in a couple minutes
ByteDance Open-Sources DeerFlow: A Modular Multi-Agent Framework for Deep Research Automation #DL #AI #ML #DeepLearning #ArtificialIntelligence #MachineLearning #ComputerVision #LLM #VLM #LVLM
www.marktechpost.com/2025/05/09/b...
OpenMemory MCP, a private memory for MCP-compatible clients powered by mem0
OpenMemory MCP runs 100% locally and provides a persistent, portable memory layer for all your AI tools. It enables agents and assistants to read from and write to a shared memory, securely and privately.
A Venn diagram with three circles: one for LLMs, one for Regexps, and one for teenagers. The intersection for LLMs and teenagers contains the label “confidently wrong.” The intersection for LLMs and Regexps contains the label “seems to work”. The intersection for Regexps and teenagers contains the label “inscrutable language.” The intersection for all three contains the label “trouble with braces”.
too cynical?
THEMAS does have a ring to it
I'm hosting a Community Pokemon popularity contest: pokemon-popularity-contest.streamlit.app
Make sure your objectively right opinions on Pokemon designs is heard! #Pokemon #Voting
For a long time, the biggest problem in machine learning has been improving and understanding robustness and generalization to OOD.
We are just increasingly making more & more problems in-distribution but the models still don't generalize out-of-the-box to the tail of problems.
A weird thing about LLMs is that they just happen to do many things but almost all uses are undocumented.
For example, GPT-4o is very good at helping farmers identify swine diseases.
There is a lot of value in experts exploring & benchmarking how good LLMs are at various tasks to find use cases.
Fantastic work by MacDowell et. al.! Intriguing parallels between how neural geometry routes information through multiplexed subspaces and how DNNs and multi-attention heads develop multiplexed internal representational manifolds #neuroscience #NeuroAI #AI
So fascinating to see the massive fallout from seemingly innocuous prompting. The issue of alignment, interpretation, and interpretability continues to be a massive challenge
“If you can not measure it, you can not improve it.” I think more subjective benchmarks like this are super important, not just for model performance, but for understanding our own blind spots when interacting with LLMs
How does Goodhart's Law, "When a measure becomes a target, it ceases to be a good measure," apply to LLMs?
LLM providers are incentivized to optimize for benchmark scores—even if that means fine-tuning models in ways that improve test results but degrade real-world performance.
We packaged everything in the gcPCA toolbox, an open-source package with multiple solutions for different needs:
đź“‚ github.com/SjulsonLab/generalized_contrastive_PCA
- Asymmetric or symmetric, Orthogonal or non-orthogonal, and sparse solutions
👉 Check out Table 1 in the paper for details!
9/
Does your research involve comparing experimental conditions? Then our latest publication is for you: We developed generalized contrastive PCA (gcPCA), a tool for comparing high-dimensional datasets. 🧠📊 doi.org/10.1371/journal.pcbi.1012747
This tool was born out of necessity, here is the story. đź§µ
1/
This is simultaneously the most horrifying and impressive thing I’ve seen in a long time 🤯
Very eye opening reading the range of replies here. Responses feel very “high conflict” coded where there is no room for nuance (surprise surprise). I think it really highlights the need for better education about how AI is trained and what is happening under the hood
Text Shot: Built on a custom RL framework called StarPO (State-Thinking-Actions-Reward Policy Optimization), the system explores how LLMs can learn through experience rather than memorization. The focus is on entire decision-making trajectories, not just one-step responses. StarPO operates in two interleaved phases: a rollout stage where the LLM generates complete interaction sequences guided by reasoning, and an update stage where the model is optimized using normalized cumulative rewards. This structure supports a more stable and interpretable learning loop compared to standard policy optimization approaches.
Former DeepSeeker and collaborators release new method for training reliable AI agents: RAGEN venturebeat.com/ai/former-deepseeker-and... #AI #agents
I can’t stop drawing parallels between AI agents and the early days of computers like RAM→Context Window, CPU→Weights, etc. Would love to see how far we can take this analogy, and where it breaks down!
See the full post on LinkedIn:
shorturl.at/bfhWv
Absolutely cochlear implants are BCI! Directly stimulating the nervous system AND achieving near feature parity with the sense that it is trying to replace? Its a slept on GOAT imo
Really impressive results by Zep (github.com/getzep/graph...) for agent memory management!
Benchmarks are one thing, but I can't wait to try this out in vivo. Would love to hear how other people are finding it!