Looking for a small or medium sized VLM? PaliGemma 2 spans more than 150x of compute!
Not sure yet if you want to invest the time 🪄finetuning🪄 on your data? Give it a try with our ready-to-use "mix" checkpoints:
🤗 huggingface.co/blog/paligem...
🎤 developers.googleblog.com/en/introduci...
Posts by Alexander Kolesnikov
The full answer is probably very complex.
I really like the "function matching" angle we discovered (or rediscovered) in one of our papers that partially demystifies distillation for me: arxiv.org/abs/2106.05237
Thank you!
Also check out this concurrent work that is very similar in spirit to Jet and JetFormer, which proposes autoregressive ViT-powered normalizing flows (NFs): x.com/zhaisf/statu...
Joint work with @asusanopinto.bsky.social
and @mtschannen.bsky.social performed at Google Deepmind.
Final note: we see the Jet model as a powerful tool and a building block for advanced generative models, like JetFormer bsky.app/profile/mtsc..., and not as a standalone competitive generative model.
Check out the paper for more juicy details: arxiv.org/abs/2412.15129.
My favorite mini-insight is how implicit half-precision matrix multiplications (with float32 accumulation) can 'eat' entropy and lead to an overly optimistic, flawed objective and evaluations.
When trained on 'small' data, such as ImageNet-1k, overfitting occurs.
Another contribution is a demonstration that transfer learning is effective in mitigating overfitting. The recipe is: pretrain on a large image database and then fine-tune to a small dataset, e.g., CIFAR-10.
We observe robust performance improvements with compute scaling, showing behavior similar to classical scaling laws.
These are the results of varying the Jet model size when training on ImageNet-21k images:
Our main contribution is a very straightforward design: Jet is just repeated affine coupling layers with ViT inside. We show that many standard components are not needed with our simple design:
❌ invertible dense layer
❌ ActNorm layer
❌ multiscale latents
❌ dequant. noise
With some delay, JetFormer's *prequel* paper is finally out on arXiv: a radically simple ViT-based normalizing flow (NF) model that achieves SOTA results in its class.
Jet is one of the key components of JetFormer, deserving a standalone report. Let's unpack: 🧵⬇️
Paligemma2 is out! Bigger models, better results. For the best experience, do not forget to finetune.
Congrats Paligemma2 team!
Ok, it is yesterdays news already, but good night sleep is important.
After 7 amazing years at Google Brain/DM, I am joining OpenAI. Together with @xzhai.bsky.social and @giffmana.ai, we will establish OpenAI Zurich office. Proud of our past work and looking forward to the future.
In arxiv.org/abs/2303.00848, @dpkingma.bsky.social and @ruiqigao.bsky.social had suggested that noise augmentation could be used to make other likelihood-based models optimise perceptually weighted losses, like diffusion models do. So cool to see this working well in practice!
The answer has just dropped: bsky.app/profile/kole...
JetFormer product of endless and heated (but friendly) arguing and discussions with @mtschannen.bsky.social
and @asusanopinto.bsky.social.
Very excited about this model due to its potential to unify multimodal learning with a simple and universal end-to-end approach.
We evaluate JetFormer potential to model large-scale multimodal image+text data and do image-to-text, text-to-image and VQA tasks, and get rather encouraging results.
We also present novel data augmentation: "noise curriculum". It helps a pure NLL model to focus on high-level image details.
Even though it is inspired by diffusion, it is very different: it only affects training and does not require iterative denoising during inference.
JetFormer is just an autoregressive transformer, trained end-to-end in one go, with no pretrained image encoders/quantizers.
There is a small twist though. An image input is re-encoded with a normalizing flow model, which is trained jointly with the main transformer model.
I always dreamed of a model that simultaneously
1. optimizes NLL of raw pixel data,
2. generates competitive high-res. natural images,
3. is practical.
But it seemed too good to be true. Until today!
Our new JetFormer model (arxiv.org/abs/2411.19722) ticks on all of these.
🧵