Eh, looks like its possible
Posts by Andrew Gross
Maybe its time to write my own Slack (no clue if they even support that) bsky.app/profile/simo...
Somehow slack is using more memory for text only office chat, than an entire docker container.
Would feel nice to be able to have an agent run off to do some investigation if I knew that it didn't have perms to mess things up too badly. It doesn't solve the lethal trifecta, but can help. Definitely needs to be self service and easy to use. Maybe part of MCP/skills and harnesses?
With all the agent stuff heating up one thing that would be really nice is to be able to create sessions with dynamically downscoped creds. Ex: instead of my full sql perms, the session just has "read only for X,Y,Z and write to A,B,C". They key being that I can generate any subset of my perms
I'll believe we have AGI when companies start shipping native apps instead of Electron everywhere.
Im afraid to ask an LLM this due to sycophancy concerns. "Thats a great observation! etc", especially if they dont know. I suppose I could build up how TPUs handle MoE weights now and then pose it.
In TPUs, my impression is that weights are "stationary" and data is (systolically) moved over them, while a GPU has weights and data brought together, combined, and flushed. In a world of large MoE models, do TPUs have a disadvantage because they will need to move around more weights than before?
A series of Agentic Read calls to the users home dir except one in the middle says anthropic instead of andrew
Everyone one and a while you get to see the outside of the sampling curve on LLMs
I definitely think it is interesting and has some cool outputs for this domain. I am curious how they will be able to handle cases where there are multiple values to optimize for. Here it seems purely around speed. Obviously, in the real world there are more goals, and harder to balance.
Context Poisoning issues are unsolved in this domain.
I have some great success building Skills w/ scripts around things like JIRA where I don't need the whole API, just a few basic functions. Plus, you can make the scripts super easy to use for the agent, doing things like auto translating user names to IDs without putting the burden on the agent
The combination of Skills w/ the ease of creating your own "bespoke" software feels a lot better than MCP servers (or even using most skills from other folks). Seems way more effective to create some tight minimal skills/scripts for things than plug in a whole MCP api surface.
There have been times that I was using an agent to add features to my own OS library, where it referred me to use... my own open source library.
I can see companies wanting to expose a way for your agent to hook into their systems/data, but with a better interface than just some query APIs. They would have their own prompting, skills, agents etc, but would be "prompted" by your local agent. Maybe the A2A protocol has something like this.
Has anyone seen support for things like "remote skills" for Claude Code or similar? Its a bit different from MCP in that its not just calling a regular API endpoint. Im thinking something closer to how the Web Search tool works, where its really a mini remote agentic system returning results.
Looks like at least 10 days ago www.youtube.com/watch?v=l3O4...
Picture of the auto memory description text from https://code.claude.com/docs/en/memory
Picture of a claude code session where the agent decides to write to the MEMORY.md file for the project.
When did Claude Code auto memory start getting rolled out, cool as hell
Have you been following model performance on SWE-Rebench to attempt to identify contamination? swe-rebench.com
When agents have metrics, output tests, and input data, they can iterate like crazy. Being able to generate this loop this approach for any problem would be a huge timesaver and remove a lot of mental effort.
For the next big jump for coding agents, I think the frontier labs are going to spend a lot of effort on making them good and creating their own problem specific harness for iterating.
I want to make the concept of cells, output, and state in an ipykernel legible to them as well for the same reason. Much easier to collaborate when the end goal isnt just a top to bottom script every time.
I have a sneaking suspicion time travel debuggers are going to come back in to vogue once they are made legible to agents.
Maybe the harness needs to start with setting a target iteration speed and using it as an optimization metric.
Getting agents to run experiments where you need to trade off feature optimization with running time optimization is tough. Agents don't experience linear time and don't have a "feeling" that they need to spend time optimizing to iterate faster.
There's no reason the "Movie" Toothbrushing song on the Yoto has to go that hard.
One of those cases where it would have been astounding to me if someone hadn't already investigated these problems deeply, I just didn't know how to find it.
I was doing a bit of work with taking disjoin subgraphs and wanting to separate some of the larger ones based on overlapping connections and got introduced to the Louvain method and bridging.
LLMs can be great for those cases (in coding at least) where you assume there is a body of work around a problem, but you don't know the terminology to find it. In the past you just had to Google and hope, or ask a coworker. Still possible to fail with LLMs but can be easier.