Advertisement · 728 × 90

Posts by Zach Ip

Preview
Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task This study explores the neural and behavioral consequences of LLM-assisted essay writing. Participants were divided into three groups: LLM, Search Engine, and Brain-only (no tools). Each completed thr...

"Brain-only participants exhibited the strongest, most distributed networks; Search Engine users showed moderate engagement; and LLM users displayed the weakest connectivity."

"Over four months, LLM users [...] underperformed at neural, linguistic, and behavioral levels."

arxiv.org/abs/2506.08872

10 months ago 40 9 5 6
An actual researcher in the space finally gave me the context I'd been missing: they regularly
have to review conference submissions that are about this quality. What I'd intended as obvious satire ≥ was, apparently, indistinguishable from what many people are aiming to pass off as legit.
That's when I realised I'd messed up.

An actual researcher in the space finally gave me the context I'd been missing: they regularly have to review conference submissions that are about this quality. What I'd intended as obvious satire ≥ was, apparently, indistinguishable from what many people are aiming to pass off as legit. That's when I realised I'd messed up.

Claude’s rebuttal to Apple’s recent paper went viral

A guy, non-researcher, submitted a joke paper to arXiv with Claude as the main author

it contained real legit problems with the Apple paper (one of the problems was impossible to solve), and went viral

open.substack.com/pub/lawsen/p...

10 months ago 33 6 2 1

Congratulations Amy! What an eventful year for your lab! 🍾🎉

10 months ago 1 0 0 0
A feature implementation example for integrating "now" post types into a main index page. Includes a last updated date, a feature overview, a current phase status, and an overall progress checklist.

A feature implementation example for integrating "now" post types into a main index page. Includes a last updated date, a feature overview, a current phase status, and an overall progress checklist.

A screenshot of Cursor implementing "Phase 5" of this plan while using the markdown document as its context.

A screenshot of Cursor implementing "Phase 5" of this plan while using the markdown document as its context.

Fave Cursor workflow at the moment is get Claude to write feature implementation plans into a markdown document and update it as we go.

Breaks features down into phases with checklists, notes, relevant file lists. Essentially acts as read/write memory to prevent chat context from getting too long.

10 months ago 105 9 4 1

I’d have loved to be there for that presentation!! Excited to read more of her work!

10 months ago 0 0 0 0
Introducing Claude
Establishing the model’s personality
Model safety
More points on style
Be cognizant of red flags
Is the knowledge cutoff date January or March?
election_info
Don’t be a sycophant!
Differences between Opus 4 and Sonnet 4
The missing prompts for tools
Thinking blocks
Search instructions
Seriously, don’t regurgitate copyrighted content
More on search, and research queries
Artifacts: the missing manual
Styles
This is all really great documentation

Introducing Claude Establishing the model’s personality Model safety More points on style Be cognizant of red flags Is the knowledge cutoff date January or March? election_info Don’t be a sycophant! Differences between Opus 4 and Sonnet 4 The missing prompts for tools Thinking blocks Search instructions Seriously, don’t regurgitate copyrighted content More on search, and research queries Artifacts: the missing manual Styles This is all really great documentation

I put together an annotated version of the new Claude 4 system prompt, covering both the prompt Anthropic published and the missing, leaked sections that describe its various tools

It's basically the secret missing manual for Claude 4, it's fascinating!

simonwillison.net/2025/May/25/...

10 months ago 239 33 12 5
Preview
Gemini Diffusion Another of the announcements from Google I/O yesterday was Gemini Diffusion, Google's first LLM to use diffusion (similar to image models like Imagen and Stable Diffusion) in place of transformers. …

I got access to Gemini Diffusion, Google's first diffusion LLM, and the thing is absurdly fast - it ran at 857 tokens/second and built me a prototype chat interface in just a couple of seconds, video here: simonwillison.net/2025/May/21/...

11 months ago 135 18 3 2
Video

We've seen nothing yet! We hosted a 9-13 yo vibe-coding event with @robertkeus.bsky.social this weekend (h/t
@antonosika.bsky.social and Lovable)

takeaway? AI is unleashing a generation of wildly creative builders beyond anything I'd have imagined

and they grow up knowing they can build anything!

11 months ago 32 4 2 1

True, and I find it reflecting my thoughts to me helps speed up the process of making up my mind

11 months ago 0 0 0 0
Advertisement

the last few weeks i’ve spent A LOT of time with o3. to the point where i keep trying to run multiple concurrent queries in the mobile app (doesn’t work btw)

deep dive into the web at your fingertips. hours of research in a couple minutes

11 months ago 5 1 1 0
Preview
ByteDance Open-Sources DeerFlow: A Modular Multi-Agent Framework for Deep Research Automation ByteDance Open-Sources DeerFlow: A Modular Multi-Agent Framework for Deep Research Automation

ByteDance Open-Sources DeerFlow: A Modular Multi-Agent Framework for Deep Research Automation #DL #AI #ML #DeepLearning #ArtificialIntelligence #MachineLearning #ComputerVision #LLM #VLM #LVLM
www.marktechpost.com/2025/05/09/b...

11 months ago 3 1 0 0
Video

OpenMemory MCP, a private memory for MCP-compatible clients powered by mem0

OpenMemory MCP runs 100% locally and provides a persistent, portable memory layer for all your AI tools. It enables agents and assistants to read from and write to a shared memory, securely and privately.

11 months ago 17 7 1 0
A Venn diagram with three circles: one for LLMs, one for Regexps, and one for teenagers. The intersection for LLMs and teenagers contains the label “confidently wrong.” The intersection for LLMs and Regexps contains the label “seems to work”. The intersection for Regexps and teenagers contains the label “inscrutable language.” The intersection for all three contains the label “trouble with braces”.

A Venn diagram with three circles: one for LLMs, one for Regexps, and one for teenagers. The intersection for LLMs and teenagers contains the label “confidently wrong.” The intersection for LLMs and Regexps contains the label “seems to work”. The intersection for Regexps and teenagers contains the label “inscrutable language.” The intersection for all three contains the label “trouble with braces”.

too cynical?

11 months ago 334 84 9 2

THEMAS does have a ring to it

11 months ago 0 0 0 0
Preview
Pokémon Popularity Contest A Streamlit application that ranks Pokémon based on community preferences through head-to-head co...

I'm hosting a Community Pokemon popularity contest: pokemon-popularity-contest.streamlit.app
Make sure your objectively right opinions on Pokemon designs is heard! #Pokemon #Voting

11 months ago 1 0 0 0

For a long time, the biggest problem in machine learning has been improving and understanding robustness and generalization to OOD.

We are just increasingly making more & more problems in-distribution but the models still don't generalize out-of-the-box to the tail of problems.

11 months ago 18 2 0 0
Advertisement
Post image Post image

A weird thing about LLMs is that they just happen to do many things but almost all uses are undocumented.

For example, GPT-4o is very good at helping farmers identify swine diseases.

There is a lot of value in experts exploring & benchmarking how good LLMs are at various tasks to find use cases.

11 months ago 100 9 5 2

Fantastic work by MacDowell et. al.! Intriguing parallels between how neural geometry routes information through multiplexed subspaces and how DNNs and multi-attention heads develop multiplexed internal representational manifolds #neuroscience #NeuroAI #AI

11 months ago 3 0 0 0

So fascinating to see the massive fallout from seemingly innocuous prompting. The issue of alignment, interpretation, and interpretability continues to be a massive challenge

11 months ago 0 0 0 0

“If you can not measure it, you can not improve it.” I think more subjective benchmarks like this are super important, not just for model performance, but for understanding our own blind spots when interacting with LLMs

11 months ago 1 0 0 0
Preview
Qwen/Qwen3-0.6B-FP8 · Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

it’s here! a real Qwen3 model

huggingface.co/Qwen/Qwen3-0...

11 months ago 36 7 3 1
Preview
How to A/B Test AI: A Practical Guide Learn how to A/B test AI models to improve performance, enhance user experience, and reduce costs using real-world data and best practices.

How does Goodhart's Law, "When a measure becomes a target, it ceases to be a good measure," apply to LLMs?

LLM providers are incentivized to optimize for benchmark scores—even if that means fine-tuning models in ways that improve test results but degrade real-world performance.

11 months ago 4 1 2 0
Post image

We packaged everything in the gcPCA toolbox, an open-source package with multiple solutions for different needs:
đź“‚ github.com/SjulsonLab/generalized_contrastive_PCA
- Asymmetric or symmetric, Orthogonal or non-orthogonal, and sparse solutions
👉 Check out Table 1 in the paper for details!
9/

1 year ago 9 1 1 0
Video

Does your research involve comparing experimental conditions? Then our latest publication is for you: We developed generalized contrastive PCA (gcPCA), a tool for comparing high-dimensional datasets. 🧠📊 doi.org/10.1371/journal.pcbi.1012747
This tool was born out of necessity, here is the story. đź§µ
1/

1 year ago 91 30 4 2
Advertisement

This is simultaneously the most horrifying and impressive thing I’ve seen in a long time 🤯

11 months ago 1 0 0 0

Very eye opening reading the range of replies here. Responses feel very “high conflict” coded where there is no room for nuance (surprise surprise). I think it really highlights the need for better education about how AI is trained and what is happening under the hood

11 months ago 0 0 0 0
Text Shot: Built on a custom RL framework called StarPO (State-Thinking-Actions-Reward Policy Optimization), the system explores how LLMs can learn through experience rather than memorization. The focus is on entire decision-making trajectories, not just one-step responses.

StarPO operates in two interleaved phases: a rollout stage where the LLM generates complete interaction sequences guided by reasoning, and an update stage where the model is optimized using normalized cumulative rewards. This structure supports a more stable and interpretable learning loop compared to standard policy optimization approaches.

Text Shot: Built on a custom RL framework called StarPO (State-Thinking-Actions-Reward Policy Optimization), the system explores how LLMs can learn through experience rather than memorization. The focus is on entire decision-making trajectories, not just one-step responses. StarPO operates in two interleaved phases: a rollout stage where the LLM generates complete interaction sequences guided by reasoning, and an update stage where the model is optimized using normalized cumulative rewards. This structure supports a more stable and interpretable learning loop compared to standard policy optimization approaches.

Former DeepSeeker and collaborators release new method for training reliable AI agents: RAGEN venturebeat.com/ai/former-deepseeker-and... #AI #agents

11 months ago 3 2 0 0
Post image

I can’t stop drawing parallels between AI agents and the early days of computers like RAM→Context Window, CPU→Weights, etc. Would love to see how far we can take this analogy, and where it breaks down!

See the full post on LinkedIn:
shorturl.at/bfhWv

11 months ago 0 0 0 0

Absolutely cochlear implants are BCI! Directly stimulating the nervous system AND achieving near feature parity with the sense that it is trying to replace? Its a slept on GOAT imo

11 months ago 1 0 0 0
Video

Really impressive results by Zep (github.com/getzep/graph...) for agent memory management!

Benchmarks are one thing, but I can't wait to try this out in vivo. Would love to hear how other people are finding it!

11 months ago 2 0 0 0