Advertisement ยท 728 ร— 90

Posts by Sebastian Dziadzio

Yeah, mostly because GPT-5 needs to think for 20 seconds to come up with a name for a variable. It's good for bigger, self-contained features, but the bias for "reasoning" in the model router makes it downright unusable for smaller changes.

7 months ago 0 0 1 0

๐Ÿ†ONEBench accepted to ACL main! โœจ
Stay tuned for the official leaderboard and real-time personalised benchmarking release!

If youโ€™re attending ACL or are generally interested in the future of foundation model benchmarking, happy to talk!

#ACL2025NLP #ACL2025
@aclmeeting.bsky.social

11 months ago 7 2 0 0

Done! Sorry for the wait

11 months ago 0 0 0 0

Added! ๐ŸŽŸ๏ธ

1 year ago 0 0 0 0

Done! ๐Ÿ™Œ๐Ÿป

1 year ago 2 0 0 0

Done! โœ…

1 year ago 1 0 0 0
Post image

The Practitioner's Guide to Continual Multimodal Pretraining @dziadzio.bsky.social @confusezius.bsky.social @vishaalurao.bsky.social @bayesiankitten.bsky.social

1 year ago 24 4 0 0
Advertisement
Preview
GitHub - ExplainableML/fomo_in_flux: Code and benchmark for the paper: "A Practitioner's Guide to Continual Multimodal Pretraining" [NeurIPS'24] Code and benchmark for the paper: "A Practitioner's Guide to Continual Multimodal Pretraining" [NeurIPS'24] - ExplainableML/fomo_in_flux

๐Ÿ“„ Paper: arxiv.org/abs/2412.06712
๐Ÿ’ป Code: github.com/ExplainableM...

1 year ago 3 0 0 0

This has been a fun project with a great team: led by @vishaalurao.bsky.social and @confusezius.bsky.social, with core contributions from @bayesiankitten.bsky.social, and supervision by @zeynepakata.bsky.social, Samuel Albanie, and Matthias Bethge.

1 year ago 2 0 1 0
Plots showing the scaling dynamics described in the text.

Plots showing the scaling dynamics described in the text.

As usual, scaling matters!
๐Ÿš€ Larger models benefit more from temporal merging than sequential finetuning.
๐Ÿš€ Larger compute budgets allow temporal merging to match (and surpass!) multitask performance.
๐Ÿš€ Best-in-TIME scales effectively across longer task sequences (50, 100).

1 year ago 2 0 1 0
A plot showing that different merging techniques perform similarly.

A plot showing that different merging techniques perform similarly.

๐Ÿ“Œ The choice of merging technique doesnโ€™t matter much.

In the temporal setting, complex merging techniques like TIES or Breadcrumbs offer only marginal gains compared to simpler ones like weight averaging.

1 year ago 2 0 1 0
A plot showing that different initialization and deployment strategies lead to different results.

A plot showing that different initialization and deployment strategies lead to different results.

๐Ÿ“Œ Initialization and deployment choices are crucial.

One strategy stands outโ€”using exponential moving average for both initialization and deployment strikes the best balance between knowledge accumulation and zero-shot retention. We call this approach โœจBest-in-TIMEโœจ

1 year ago 2 0 1 0
A plot showing that offline merging underperforms with respect to a replay baseline.

A plot showing that offline merging underperforms with respect to a replay baseline.

๐Ÿ“Œ Accounting for time is essential.

Standard merging struggles with the temporal dynamics. Replay and weighting schemes, which factor in the sequential nature of the problem, help (but only to a point).

1 year ago 2 0 1 0

Key insights:

๐Ÿ“Œ Accounting for time is essential.
๐Ÿ“Œ Initialization and deployment choices are crucial.
๐Ÿ“Œ The choice of merging technique doesnโ€™t matter much.

1 year ago 2 0 1 0
A schematic representation of the TIME framework.

A schematic representation of the TIME framework.

The world keeps changing, and so should our models.

Enter TIME (Temporal Integration of Model Expertise), a unifying approach that considers:

1๏ธโƒฃ Initialization
2๏ธโƒฃ Deployment
3๏ธโƒฃ Merging Techniques

We study these three axes on the large FoMo-in-Flux benchmark.

1 year ago 2 0 1 0
Preview
How to Merge Your Multimodal Models Over Time? Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches...

๐Ÿ“„ New Paper: "How to Merge Your Multimodal Models Over Time?"

arxiv.org/abs/2412.06712

Model merging assumes all finetuned models are available at once. But what if they need to be created over time?

We study Temporal Model Merging through the TIME framework to find out!

๐Ÿงต

1 year ago 25 7 1 2
Advertisement

Come chat to us at NeurIPS about continual multimodal pretraining and some interesting follow-ups ๐Ÿ‘€

1 year ago 2 0 0 0
Post image

๐ŸšจLooking to test your foundation model on an arbitrary and open-ended set of capabilities, not explicitly captured by static benchmarks? ๐Ÿšจ

Check out โœจONEBenchโœจ, where we show how sample-level evaluation is the solution.

๐Ÿ”Ž arxiv.org/abs/2412.06745

1 year ago 18 5 1 2
Kickstand advertising a Taylor Swift pop-up store.

Kickstand advertising a Taylor Swift pop-up store.

Kickstand advertising a coffee shop to NeurIPS attendees.

Kickstand advertising a coffee shop to NeurIPS attendees.

The changing of the guard ceremony in Vancouver is complete

1 year ago 4 0 1 0

I keep forgetting about the concert, yesterday I was like 'wow people in Vancouver sure love sequins and cowboy boots'.

1 year ago 0 0 0 0

Whenever my "papers" tab group got lost in a chrome crash I felt nothing but relief.

The firehose is relentless, so over time my strategy became to skim in the moment if interesting and save to zotero, otherwise close the tab. There is only the present. Important stuff will come back.

1 year ago 1 0 0 0

Yeah, I think we consistently underestimate how much stuff is out there on the Internet. You might think your question or image prompt is niche and original, but if you consider the distribution of Internet-scale datasets, you'd have to work very hard to even reach the tail.

1 year ago 2 0 0 0

If someone said "the algorithm" with no additional context, I'd think of the latter, but "an algorithm" for me is still the former. Interesting how the default meaning is shifting.

1 year ago 6 0 0 0

How I use LLMs when writing papers:
1. Write a sentence.
2. Copy it to an LLM for edits, add a prompt explaining in simple words what I'm trying to say.
3. Realise my simple word explanation is actually what I need.
4. Copy it over to the paper, move on to the next sentence.

1 year ago 10 0 1 0

Have you read Fables for Robots? I think it was only published in English as part of Mortal Engines. If you liked Cyberiad, you'll like this one too!

1 year ago 3 0 1 0
Advertisement

Added you! ๐Ÿ™Œ๐Ÿป

1 year ago 0 0 0 0

All in! ๐ŸŽŸ๏ธ๐ŸŽŸ๏ธ๐ŸŽŸ๏ธ

1 year ago 1 0 0 0

You're in! โœ…

1 year ago 2 0 0 0

Welcome aboard! ๐ŸŽŸ๏ธ

1 year ago 2 0 0 0
Post image Post image Post image

๐Ÿค” Can you turn your vision-language model from a great zero-shot model into a great-at-any-shot generalist?

Turns out you can, and here is how: arxiv.org/abs/2411.15099

Really excited to this work on multimodal pretraining for my first bluesky entry!

๐Ÿงต A short and hopefully informative thread:

1 year ago 134 24 2 7