Advertisement · 728 × 90

Posts by Matthew Leavitt

Post image

The UniReps Workshop is happening THIS SATURDAY at #NeurIPS 🤖🧠

Join us for a day of insightful talks and discussions with @sueyeonchung.bsky.social, @eringrant.bsky.social, @leavittron.bsky.social, @itsneuronal.bsky.social, @marcocuturi.bsky.social, Philip Isola, Neel Nanda and Stefanie Jegelka! 🎤

1 year ago 20 9 1 3

Also we're a small startup and we want to remain nimble. Roles here are fairly fluid, though there are specific strength areas that we are trying to hire for, hence our distinct job postings.

1 year ago 1 0 0 0

I know you're not trying to call me out, but happy to give my thoughts: Many research orgs have a "research vs. eng" divide that comes along w/ baggage: hierarchies, expectation of duties, etc. We don't want that here. Nobody is too good to touch code or insufficiently credentialed to do science

1 year ago 1 0 2 0

Huge shoutout to @agcrnz.bsky.social @alvin-d.bsky.social @pratyushmaini.bsky.social and Mo Razzak for leading this work. You did an amazing job! Stay tuned for more announcements from us. We’ll have a booth at NeurIPS, come say hi!

1 year ago 2 0 0 0
Preview
DatologyAI Jobs DatologyAI Jobs

If you’re interested in pushing the bounds of what’s possible with data curation, we’re also looking for talented Members of Technical Staff who have experience doing data research, translating science into products, and building scalable data products
jobs.ashbyhq.com/DatologyAI

1 year ago 3 1 1 1

We’re starting to work with early customers: if you’re an enterprise AI company interested in training multimodal and/or text models faster, better, or smaller, get in touch!

1 year ago 1 0 1 0
Preview
Train LLMs Faster, Better, and Smaller with DatologyAI’s Data Curation DatologyAI's curated data delivers substantial improvements in LLM quality, training speed, and inference efficiency over existing datasets.

If you'd prefer a quick overview, we have one of those, too!
www.datologyai.com/post/train-l...

1 year ago 1 0 1 0
Advertisement
Preview
Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset Our data curation pipeline to obtain substantial improvements in LLM quality, training speed, and inference efficiency.

If you want more details, here’s the full technical deep-dive!
www.datologyai.com/post/technic...

1 year ago 1 0 1 0

Overall I’m thrilled with these results. And I’m so very proud of our team for the amazing work that got us here. But the results aren’t the goal. The results are the first proof that it’s possible to build a product for foundation-scale data curation.

1 year ago 2 0 1 0
Post image

We can also use our data curation to train better, smaller models to save on inference: a 1.3B model trained on 180B tokens of our data has better 5-shot performance than every 2.7B model we trained on public data sets, on token-matched (NOT FLOPs-matched) basis. FLOPs-matched is even better

1 year ago 1 0 1 0
Post image

Our curated data also allows us to train faster! We save 86.9% on compute (7.7x speedup) training a 2.7B model on our data to reach the same avg 5-shot accuracy as training on RPJv1 for 180B tokens, and save 70.1% on compute (3.4x speedup) to reach the same accuracy as DCLM

1 year ago 1 0 1 0

This is noteworthy because FW-Edu and DCLM have pool sizes that are 10x (DCLM) and 11.5x (FW-Edu) the curated dataset size. Our 180B token dataset is curated from a pool size of 540B tokens, which is only 3x. So we probably have a lot of room for improvement with larger datasets!

1 year ago 1 0 1 0
Post image

Interestingly, we also find that starting with a larger dataset to curate yields a much better final dataset.

1 year ago 1 0 1 0

Our improved model quality is general—it doesn’t come from outsize gains on a small number of tasks. We tie or surpass even the strongest baseline, DCLM, in two thirds or more of the evaluations, and are at par or outperforming other baselines on nearly all evals.

1 year ago 1 0 1 0
Post image Post image

With our curated data we were able to train better models: 8.4 percentage-point (pp) mean 5-shot improvement over RPJv1, +6.1pp vs FineWeb-Edu (FW-Edu), and +4.4pp vs DCLM. This is no small feat: FineWeb, FineWeb-Edu, and DCLM are VERY high-quality, meticulously-curated datasets

1 year ago 1 0 1 0

Then we trained standard (MPT-style) transformers up to 2.7B parameters for token budgets up to 180B on our curated RPJv1 and other public pretraining corpora, and evaluated the models on a suite of 15 standard language model evals

1 year ago 1 0 1 0

Why did we choose to curate RPJv1? Because it’s well-established, contains diverse content across a number of domains, and already has a moderate degree of curation applied to it

1 year ago 1 0 1 0
Advertisement
Post image

Our data curation pipeline is a scalable, productionized system that integrates a suite of bleeding-edge algorithms to curate data in the quantity necessary for foundation model pretraining. And with it, we developed a single recipe that we used to to curate RPJv1

1 year ago 2 0 1 0
Post image

tl;dr: We transformed RedPajama-v1 (RPJv1) into a dataset that outperforms FineWeb-Edu and DCLM, two of the strongest publicly-available text pretraining datasets. Let me walk you through how we did it

1 year ago 1 0 1 0

Some of you may have seen our recent announcement of our state-of-the-art data curation pipeline and the fantastic results we got applying it to multimodal data for training CLIP models. Well it works pretty well for text, too!
bsky.app/profile/leav...

1 year ago 1 0 1 0
Post image

Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵

1 year ago 18 4 1 6
Post image

HUGE shoutout to Haoli Yin, Amro Abbas, and (Evil) Josh Wills for leading this work. You did an amazing job! Oh, and stay tuned for more announcements from us. Our curation pipeline works for text, too 😉

1 year ago 2 0 0 0
Preview
DatologyAI Jobs DatologyAI Jobs

If you’re interested in pushing the bounds of what’s possible with data curation, we’re also looking for talented Members of Technical Staff who have experience doing data research, translating science into products, and building scalable data products jobs.ashbyhq.com/DatologyAI

1 year ago 1 0 1 0
Preview
Join our waitlist-DatologyAI We're still building! By submitting this form, your company will join our waitlist to get early access to Datology. When your company has been selected, we will reach out.

We’re starting to work with early customers: if you’re an enterprise AI company interested in training multimodal and/or text models faster, better, or smaller, get in touch! forms.wix.com/f/7257903640...

1 year ago 0 0 1 0
Preview
DatologyAI’s Image-Text Data Curation: Train Better, Faster, Smaller What if you could save up to 98% on compute costs? Read on to find out how DatologyAI’s deep learning data curation tools make this possible.

And if you’d prefer a quick overview, we have one of those, too: www.datologyai.com/post/datolog...

1 year ago 0 0 1 0
Preview
Technical Deep-Dive: Image-Text Data Curation at the Billion-Sample Scale Introducing DatologyAI’s state-of-the-art data curation pipeline.

If you want more details, here’s the full technical deep-dive. Watch out, it’s a big read! www.datologyai.com/post/product...

1 year ago 0 0 1 0
Advertisement

Overall I’m thrilled with these results. And I’m incredibly proud of our team for the science, engineering, and elbow grease that got us here. But the results aren’t the goal. The results are the first proof that it’s possible to build a product for foundation-scale data curation

1 year ago 0 0 1 0
Post image

One component of our pipeline is synthetic image recaptioning, so we compare to strong methods like MetaCLIPv2 & LaCLIP. And our retrieval-optimized data outperforms both of them on retrieval tasks, despite their models training for 2.5x more samples and using 4x the batch size

1 year ago 1 0 1 0
Post image

And our classification-optimized dataset gets better performance (absolute and normalized—see the explanation in the table) than any other DataComp Large submission.

1 year ago 0 0 1 0
Post image

But how does our curation pipeline stack up against published research? We also compared to a menagerie of other models. Compared to external ViT-B/32 models, we achieve superior retrieval performance, even to models trained for over 6x longer and on datasets over ~4x larger

1 year ago 2 0 1 0