But "The P-LLM cannot write a plan based on data it can’t read" is a substantial impact to the utility of LLMs and central to the prompt injection challenge, no?
If the P-LLM is detached from the data it needs to plan from aren't we back to using an LLM for generating a program that can run LLM(s)?
Posts by Kristian Muñiz
Absurd decision making, disconnected from reality.
I've followed you for years and know that Google was extremely lucky to have you, any company would be (perhaps your own?).
Regardless of what you do next, I'm sure that as a community we'll continue to follow your work. Please take care!
You should make a business out of that, sounds lucrative 💰
Metaphors are fun though
Yeah drag-and-drop with trackpads can be painful
hahahah I *just* posted a half-baked idea that resembles this in this very thread. Should've read the full conversation
I would argue that there's no right way to do this interaction. It feels unnatural and counterintuitive. I wish I could have a "shelf" I could put dragged items on temporarily while I scroll 😆
Brilliant. Yes!
In your defense, you can't land a pilot either
A wide image taken with a phone of a glass whiteboard, in a room overlooking the Bay Bridge. The field of view shows a woman writing, sporting a tshirt wiith a large OpenAI logo. The handwriting looks natural and a bit messy, and we see the photographer's reflection. The text reads: (left) "Transfer between Modalities: Suppose we directly model p(text, pixels, sound) [equation] with one big autoregressive transformer. Pros: * image generation augmented with vast world knowledge * next-level text rendering * native in-context learning * unified post-training stack Cons: * varying bit-rate across modalities * compute not adaptive" (Right) "Fixes: * model compressed representations * compose autoregressive prior with a powerful decoder" On the bottom right of the board, she draws a diagram: "tokens -> [transformer] -> [diffusion] -> pixels"
Ah, hint from Greg Brockman himself. Seems like the "powerful decoder" here is a diffusion model.
Yeah, I read the System Card. It can still be autoregressive sampling. From my observations it still makes mistakes that a diffusion model would make, like omitting details, failing to count, producing garbled text, etc.
Increasingly, large multimodal models are becoming more and more powerful and one of the first ways we can optimize them is by simplifying their I/O and writing powerful, thick encoders/decoders.
At this point I'm convinced that 4o image generation is not purely autoregressive. My guess is 4o generates image tokens or latent representations in sequential patches which are processed by a tighly integrated diffusion model.
*of sampling the next token.
Had to cut some characters.
And it's not structural or semantic consistency, but some information gets lost in the process. Perhaps it's safety mechanisms preventing certain behaviors like using people's likeness.
Should an omni-model that is purely autoregressive be able to pass through an image in a semi-lossless way? I understand that it depends, to some extent, on post-training and the non-stochastic nature of sampling, but I'm having trouble with consistency using 4o's image generation feature.
Could that be a plausible solution? Using GPT-4o to generate initial image representations and passing these representations to a diffusion model component that specializes in creating high-quality, high-resolution visual outputs?
What I know so far, autoregressive models are more expensive to run than diffusion models – of course slower too, latency correlated with cost.
I'm still surprised that resolution is so good. It's almost too good. Could it be a hybrid Transformer + Diffusion approach?
I want to understand the training and inference economics of autoregressive image generation.
There's obviously latency implications but in my opinion, at least anecdotally, it makes for up for it in output quality.
Wow, this is just so much better than what's out there, especially for prompt adherence. Aesthetically, I'm seeing a bit of a bias, but it could very well be deliberate.
Goddammit 🤦🏻♂️ right, that's the whole point of this update
By image output I mean sampling tokens that get decoded into rasterised bitmaps. There's some vectorial quality to the generated images.
I have a feeling, completely unproven, that this is more than just image output. The infographics are so crisp, it feels like there's some sort of very powerful generative layout engine powering this. Either that or I completely had the wrong intuition about diffusion models.
lmao
They're not prompting it right, should've asked "make it unhackable"
I'm open to "I'll know it when I see it" as a design philosophy. Not looking for anything specific, I'm exploring canvas interfaces as a general direction.
tldraw.dev is great, but requires adapting to a large, existing framework. I was looking for something more low-level and simpler.
printloop.dev is a web-based creative coding environment.
The side-project itself is primarily about pursuing different ways to shorten feedback loops when writing code.
The driving hypothesis is that the cost of iteration is inversely correlated with one's creative output quantity and quality.
Nice. I've been exploring interactive programming environments. I am looking to bring spatial canvas functionality to my tool printloop.dev
I already have a minimal tldraw setup in printloop.dev/canvas but I'm looking for simpler primitives to build on top of.
The Response API is what this LLM APIs should've been from the beginning.