Advertisement · 728 × 90

Posts by Sjoerd van Steenkiste

How do language models generalize from information they learn in-context vs. via finetuning? In arxiv.org/abs/2505.00661 we show that in-context learning can generalize more flexibly, illustrating key differences in the inductive biases of these modes of learning — and ways to improve finetuning. 1/

11 months ago 78 22 4 5

🚨 Deadline Extension Alert for #VLMs4All Challenges! 🚨

We have extended the challenge submission deadline
🛠️ New challenge deadline: Apr 22

Show your stuff in the CulturalVQA and GlobalRG challenges!
👉 sites.google.com/view/vlms4al...

Spread the word and keep those submissions coming! 🌍✨

1 year ago 2 2 0 0

Excited to announce that we will be organizing a #CVPR2025 Workshop on Building Geo-Diverse and Culturally Aware VLMs. Aside from fantastic speakers and a short-paper track, the workshop includes two challenges, one of them based on our CulturalVQA benchmark. Links below!

1 year ago 1 0 0 0

arxiv.org/abs/2311.00445

1 year ago 1 0 0 0
Preview
The Impact of Depth on Compositional Generalization in Transformer Language Models To process novel sentences, language models (LMs) must generalize compositionally -- combine familiar elements in new ways. What aspects of a model's structure promote compositional generalization? Fo...

arxiv.org/abs/2310.19956

1 year ago 0 0 1 0
Preview
How Does Code Pretraining Affect Language Model Task Performance? Large language models are increasingly trained on corpora containing both natural language and non-linguistic data like source code. Aside from aiding programming-related tasks, anecdotal evidence sug...

arxiv.org/abs/2409.04556

1 year ago 0 0 1 0
Preview
Can Language Models Perform Implicit Bayesian Inference Over User... To successfully interact with the world, both humans and machines need to construct models of the world and form beliefs about these models. These beliefs need to be updated as new information...

openreview.net/forum?id=arY...

1 year ago 0 0 1 0
Research Intern, PhD, Summer 2025 — Google Careers

Application page: www.google.com/about/career...

Some recent papers from our team below:

1 year ago 0 0 1 0
Post image

Our team @GoogleAI is hiring an intern. We are interested in having LMs understand and respond to users better. Topics include: teaching LMs to build “mental models” of users; improving LM's reasoning capability over long contexts.

@GoogleAI internship deadline is Feb 28.

1 year ago 1 1 1 0
Post image

🔥Excited to introduce RINS - a technique that boosts model performance by recursively applying early layers during inference without increasing model size or training compute flops! Not only does it significantly improve LMs, but also multimodal systems like SigLIP.
(1/N)

1 year ago 6 2 1 0
Advertisement
Preview
Research Scientist, Zurich Zurich, Switzerland

If you are interested in developing large-scale, multimodal datasets & benchmarks, and advancing AI through data-centric research, check out this great opportunity. Our team is hiring!
boards.greenhouse.io/deepmind/job...

1 year ago 4 3 0 0

The ICLR 2025 decisions are out! It was an honor to serve as a Senior Area Chair for this year’s iteration, and be more involved in overseeing the review process.

1 year ago 2 0 0 0
ICLR 2024 Financial Assistance

Financial Assistance applications are now open! If you face financial barriers to attending ICLR 2025, we encourage you to apply. The program offers prepay and reimbursement options. Applications are due March 2nd with decisions announced March 9th. iclr.cc/Conferences/...

1 year ago 30 21 0 1

Check out @tkipf.bsky.social's post on MooG, the latest in our line of research on self-supervised neural scene representations learned from raw pixels:

SRT: srt-paper.github.io
OSRT: osrt-paper.github.io
RUST: rust-paper.github.io
DyST: dyst-paper.github.io
MooG: moog-paper.github.io

1 year ago 13 3 0 0
TRecViT architecture

TRecViT architecture

TRecViT: A Recurrent Video Transformer
arxiv.org/abs/2412.14294

Causal, 3× fewer parameters, 12× less memory, 5× higher FLOPs than (non-causal) ViViT, matching / outperforming on Kinetics & SSv2 action recognition.

Code and checkpoints out soon.

1 year ago 25 7 1 0

with @linluqiu.bsky.social Fei Sha, Kelsey Allen, Yoon Kim, @tallinzen.bsky.social and myself.

1 year ago 0 0 0 0
Post image

Can language models perform implicit Bayesian inference over user preference states? Come find out at the “System-2 Reasoning at Scale” #NeurIPS2024 workshop, 11:30pm West Ballroom B.

1 year ago 1 0 1 0
Post image

Neural Assets poster is happening now. Join us at East Exhibit Hall A-C #1507

1 year ago 3 1 0 0
Advertisement

I will be at the @GoogleAI booth until 2pm. Come say hello if you have questions about Google Research!

1 year ago 0 0 0 0
Post image

Excited to be at #NeurIPS2024. A few papers we are presenting this week:

MooG: arxiv.org/abs/2411.05927
Neural Assets: arxiv.org/abs/2406.09292
Probabilistic reasoning in LMs: openreview.net/forum?id=arYXg…

Let’s connect if any of these research topics interest you!

1 year ago 4 0 0 1

Interesting perspective on ICL and great suggestions for future research in this space!

1 year ago 2 0 0 0
Post image

🚀🚀PaliGemma 2 is our updated and improved PaliGemma release using the Gemma 2 models and providing new pre-trained checkpoints for the full cross product of {224px,448px,896px} resolutions and {3B,10B,28B} model sizes.

1/7

1 year ago 69 21 1 5

Looking forward to seeing what is possible to build on top of such "particle" representations. While conceptually simple, they are one step closer to represent scenes (underlying causal structure) not videos (mixture of the many factors together), and could be useful for robotics tasks.

1 year ago 3 1 0 0

That looks amazing, enjoy!

1 year ago 2 0 0 0

If you are reviewing for ICLR, please engage with the author response!

1 year ago 2 0 0 0

This project is the result of a wonderful collaboration with many people at Google, and will appear at NeurIPS later this year. Special thanks to my co-first authors @zdanielz.bsky.social and @tkipf.bsky.social for being great collaborators and seeing this project through!

1 year ago 2 1 0 0
Advertisement

While the vast majority of computer vision advances in the past decade can be attributed to successful “on-the-grid” architectures such as CNNs and Vision Transformers, the physical world ultimately does not live on a pixel grid, which we address in MooG.

1 year ago 0 0 1 0
Post image

Even in comparison to specialized architectures for down-stream tasks, such as TAPIR for point-tracking, we find that self-supervised MooG latents yield strong performance.

1 year ago 0 0 1 0
Post image

MooG can provide a strong foundation for different downstream vision tasks, including point tracking, monocular depth estimation, and object tracking. Especially when reading out from frozen representations, MooG tends to outperform on-the-grid baselines.

1 year ago 0 0 1 0
Post image

We demonstrate the usefulness of MooG’s learned representation both qualitatively and quantitatively by training readouts on top of the learned representation on a variety of downstream tasks.

1 year ago 0 0 1 0