Advertisement · 728 × 90

Posts by Kristian Muñiz

But "The P-LLM cannot write a plan based on data it can’t read" is a substantial impact to the utility of LLMs and central to the prompt injection challenge, no?

If the P-LLM is detached from the data it needs to plan from aren't we back to using an LLM for generating a program that can run LLM(s)?

1 year ago 0 0 1 0

Absurd decision making, disconnected from reality.

I've followed you for years and know that Google was extremely lucky to have you, any company would be (perhaps your own?).

Regardless of what you do next, I'm sure that as a community we'll continue to follow your work. Please take care!

1 year ago 5 0 0 0

You should make a business out of that, sounds lucrative 💰

1 year ago 2 0 0 0

Metaphors are fun though

1 year ago 2 0 0 0
Dropover - Easier Drag and Drop on your Mac. Dropover is a drag and drop utility that makes it simple to collect, organize, share, and process files with floating shelves.

I found a modern version of this dropoverapp.com

1 year ago 2 0 1 0

Yeah drag-and-drop with trackpads can be painful

1 year ago 2 0 0 0

hahahah I *just* posted a half-baked idea that resembles this in this very thread. Should've read the full conversation

1 year ago 1 0 0 0

I would argue that there's no right way to do this interaction. It feels unnatural and counterintuitive. I wish I could have a "shelf" I could put dragged items on temporarily while I scroll 😆

1 year ago 2 0 1 0
Advertisement

Brilliant. Yes!

1 year ago 0 0 0 0

In your defense, you can't land a pilot either

1 year ago 1 0 1 0
A wide image taken with a phone of a glass whiteboard, in a room overlooking the Bay Bridge. The field of view shows a woman writing, sporting a tshirt wiith a large OpenAI logo. The handwriting looks natural and a bit messy, and we see the photographer's reflection.

The text reads:

(left)
"Transfer between Modalities:

Suppose we directly model
p(text, pixels, sound) [equation]
with one big autoregressive transformer.

Pros:
* image generation augmented with vast world knowledge
* next-level text rendering
* native in-context learning
* unified post-training stack

Cons:
* varying bit-rate across modalities
* compute not adaptive"

(Right)
"Fixes:
* model compressed representations
* compose autoregressive prior with a powerful decoder"

On the bottom right of the board, she draws a diagram:
"tokens -> [transformer] -> [diffusion] -> pixels"

A wide image taken with a phone of a glass whiteboard, in a room overlooking the Bay Bridge. The field of view shows a woman writing, sporting a tshirt wiith a large OpenAI logo. The handwriting looks natural and a bit messy, and we see the photographer's reflection. The text reads: (left) "Transfer between Modalities: Suppose we directly model p(text, pixels, sound) [equation] with one big autoregressive transformer. Pros: * image generation augmented with vast world knowledge * next-level text rendering * native in-context learning * unified post-training stack Cons: * varying bit-rate across modalities * compute not adaptive" (Right) "Fixes: * model compressed representations * compose autoregressive prior with a powerful decoder" On the bottom right of the board, she draws a diagram: "tokens -> [transformer] -> [diffusion] -> pixels"

Ah, hint from Greg Brockman himself. Seems like the "powerful decoder" here is a diffusion model.

1 year ago 1 0 0 0

Yeah, I read the System Card. It can still be autoregressive sampling. From my observations it still makes mistakes that a diffusion model would make, like omitting details, failing to count, producing garbled text, etc.

1 year ago 0 0 0 0

Increasingly, large multimodal models are becoming more and more powerful and one of the first ways we can optimize them is by simplifying their I/O and writing powerful, thick encoders/decoders.

1 year ago 0 0 0 0

At this point I'm convinced that 4o image generation is not purely autoregressive. My guess is 4o generates image tokens or latent representations in sequential patches which are processed by a tighly integrated diffusion model.

1 year ago 2 0 3 0

*of sampling the next token.

Had to cut some characters.

1 year ago 1 0 0 0
Advertisement

And it's not structural or semantic consistency, but some information gets lost in the process. Perhaps it's safety mechanisms preventing certain behaviors like using people's likeness.

1 year ago 1 0 0 0

Should an omni-model that is purely autoregressive be able to pass through an image in a semi-lossless way? I understand that it depends, to some extent, on post-training and the non-stochastic nature of sampling, but I'm having trouble with consistency using 4o's image generation feature.

1 year ago 0 0 2 0

Could that be a plausible solution? Using GPT-4o to generate initial image representations and passing these representations to a diffusion model component that specializes in creating high-quality, high-resolution visual outputs?

1 year ago 0 0 0 0

What I know so far, autoregressive models are more expensive to run than diffusion models – of course slower too, latency correlated with cost.

I'm still surprised that resolution is so good. It's almost too good. Could it be a hybrid Transformer + Diffusion approach?

1 year ago 0 0 1 0

I want to understand the training and inference economics of autoregressive image generation.

There's obviously latency implications but in my opinion, at least anecdotally, it makes for up for it in output quality.

1 year ago 0 0 1 0

Wow, this is just so much better than what's out there, especially for prompt adherence. Aesthetically, I'm seeing a bit of a bias, but it could very well be deliberate.

1 year ago 0 0 0 0

Goddammit 🤦🏻‍♂️ right, that's the whole point of this update

1 year ago 0 0 0 0
Advertisement

By image output I mean sampling tokens that get decoded into rasterised bitmaps. There's some vectorial quality to the generated images.

1 year ago 1 0 1 0

I have a feeling, completely unproven, that this is more than just image output. The infographics are so crisp, it feels like there's some sort of very powerful generative layout engine powering this. Either that or I completely had the wrong intuition about diffusion models.

1 year ago 0 0 2 0

lmao

1 year ago 1 0 0 0

They're not prompting it right, should've asked "make it unhackable"

1 year ago 1 0 1 0

I'm open to "I'll know it when I see it" as a design philosophy. Not looking for anything specific, I'm exploring canvas interfaces as a general direction.

tldraw.dev is great, but requires adapting to a large, existing framework. I was looking for something more low-level and simpler.

1 year ago 1 0 0 0

printloop.dev is a web-based creative coding environment.

The side-project itself is primarily about pursuing different ways to shorten feedback loops when writing code.

The driving hypothesis is that the cost of iteration is inversely correlated with one's creative output quantity and quality.

1 year ago 1 0 1 0

Nice. I've been exploring interactive programming environments. I am looking to bring spatial canvas functionality to my tool printloop.dev

I already have a minimal tldraw setup in printloop.dev/canvas but I'm looking for simpler primitives to build on top of.

1 year ago 2 0 1 0

The Response API is what this LLM APIs should've been from the beginning.

1 year ago 1 0 0 0