Yuzhe Yang (@yuzheyang) Bsky

OSF: On Pre-training and Scaling of Sleep Foundation Models Polysomnography (PSG) provides the gold standard for sleep assessment but suffers from substantial heterogeneity across recording devices and cohorts. There have been growing efforts to build general-...

🚀 OSF turns these findings into a practical recipe for building more generalizable and deployable sleep AI.

➡️Paper: arxiv.org/abs/2603.00190

Great work led by my students @ZitaoShuai, @ZongzheX2001, David, and collaborator
@WeiWang1973! 🌙

#AI #sleep #sensor #health #multimodal #LLMs

1 month ago 3 1 0 0

Our third finding: scaling does help in sleep — but only with the right recipe. With the right SSL design, performance keeps improving as we scale:
📦 pre-training data size
🧠 model capacity
🌐 multi-source data mixture

So the message is not just scale more. It's: scale the right pre-training design.

1 month ago 1 0 1 0

But missing-channel inference is not hopeless.

We find that explicitly encouraging channel-invariant feature learning during pre-training can substantially improve both downstream performance and robustness when channels are missing.

1 month ago 1 0 1 0

Our first finding: existing sleep FMs can break badly under missing-channel inference.

This is not a corner case. In real sleep studies, channel availability changes across cohorts, devices, and protocols.

And when that happens, performance can drop sharply.

1 month ago 1 0 1 0

We did not want to run a narrow comparison of one or two methods.

Instead, we benchmarked major self-supervised learning families for sleep FM pre-training:
🔗 contrastive
🧠 self-distillation
♻️ reconstructive
➡️ autoregressive

1 month ago 1 0 1 0

A big reason progress in sleep FMs has been hard to compare is simple: we have not had a unified and open testbed.

We address that with SleepBench — a fully open benchmark built from public resources, with
⏱️ 166,500 hours of sleep recordings
🧑‍🤝‍🧑 21,000+ sleep studies
🌍 9 datasets
💾 ~20M 30s epochs

1 month ago 2 0 1 0

Meet OSF — a fully open benchmark and state-of-the-art sleep foundation models. 🌙

We study pre-training and scaling recipes that actually improve generalization in real-world settings. 🏥

🌐 Website: yang-ai-lab.github.io/osf/
💻 Code: github.com/yang-ai-lab/...
🤗 Models: hf.co/yang-ai-lab/...

1 month ago 5 2 2 0

HEARTS: Benchmarking LLM Reasoning on Health Time Series The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series moda...

🚀 HEARTS is built as a living ecosystem for the community: new data, new reasoning tasks, and new models can continue to be added over time!
➡️Paper: arxiv.org/abs/2603.06638

#AI #HealthAI #LLM #TimeSeries #Multimodal

1 month ago 1 0 0 0

Finally, we asked whether input format changes the story. 🖼️📝

It does affect absolute performance, but much less than one might expect. Whether time series are presented as text, images, or other input forms, the relative task difficulty stays surprisingly stable.

1 month ago 1 0 1 0

Models from the same family often behave very similarly. 🧬

Even when absolute performance changes with scale, the overall performance pattern tends to remain stable within a family. That suggests scaling alone is not enough to solve the core reasoning gap.

1 month ago 0 0 1 0

We also found a striking temporal difficulty pattern. 📉

Longer sequences and higher sampling frequencies consistently make the tasks harder. Across models, domains, and modalities, performance tends to drop as the temporal burden increases.

1 month ago 1 0 1 0

A second takeaway is that many models rely on shortcuts rather than deep reasoning. 🪤

For Perception and Inference tasks, models often do reasonably well when there are explicit thresholds, obvious quantitative cues, or strong domain priors to lean on.

1 month ago 0 0 1 0

One clear takeaway: current LLMs are still weak at genuine health time-series reasoning. 📊

We evaluated 14 state-of-the-art LLMs and found that, although many beat a simple naive baseline, the gains are often modest. On many tasks, they still fall clearly behind specialized time-series models.

1 month ago 0 0 1 0

To move beyond standard benchmark design, we built a hierarchical task taxonomy. 🏗️

Rather than only asking multiple-choice questions, HEARTS organizes 110 tasks into four cognitive levels:
🧠 Perception
🔍 Inference
✍️ Generation
⚙️ Deduction

1 month ago 0 0 1 0

HEARTS also pushes models across a very wide temporal range. ⏱️

Reasoning over health time series is not only about short segments. Models may need to detect fine local structure, track long-range dependencies, or connect patterns across long periods of observation.

1 month ago 0 0 1 0

One key goal was breadth of real-world physiological coverage. 🏥

Many existing benchmarks rely on synthetic signals or stay within a small number of domains. HEARTS instead brings real-world datasets spanning from motion and metabolism to sleep, respiration, surgery, speech, behavior, and more.

1 month ago 0 0 1 0

Can LLMs really reason over health time series? 📈

Introducing HEARTS ❤️— the first living benchmark built for health time-series reasoning.

🌐Website: yang-ai-lab.github.io/HEARTS
🕵️Code: github.com/yang-ai-lab/...
🤗Dataset: hf.co/datasets/yan...
🏆Leaderboard: yang-ai-lab.github.io/HEARTS/leade...

1 month ago 1 0 1 0

SleepLM: Natural-Language Intelligence for Human Sleep We present SleepLM, a family of sleep-language foundation models that enable human sleep alignment, interpretation, and interaction with natural language. Despite the critical role of sleep, learning-...

SleepLM points to a new direction for sleep AI🚀. Read all about it!
➡️Paper: arxiv.org/abs/2602.23605

Great work led by my students @ZongzheX2001, @ZitaoShuai, Eideen, and amazing collaborators @AysolaRavi and Rajesh!

More to come🌙

#AI #sleep #sensor #health #multimodal #LLMs

1 month ago 1 0 0 0

Finally, we wanted this to connect to real clinical workflows. 🏥

SleepLM can combine its predictions across an entire night and produce useful full-night measures, while staying stable over long sequences. This matters as real sleep analysis is about understanding the whole night in a reliable way.

1 month ago 0 0 1 0

We also wanted the model to be more controllable. 🎛️

Instead of always generating one broad description, SleepLM can focus on a specific part of the physiology when asked. For example, it can emphasize 🧠brain activity, 🫁breathing, ❤️heart-related signals, or 💪body movement.

1 month ago 0 0 1 0

SleepLM also learns when something happens, not just whether it happened. ⏱️

Our results show that the model is sensitive to timing. The strongest match appears when the text and the signal line up at the correct moment, and that match weakens as the alignment moves away.

1 month ago 1 0 1 0

SleepLM learns a strong link between language and physiology. 🔄

When we ask it to match text to signals, or signals to text, it performs much better than general-purpose baselines. It not only reads sleep signals well — but also learns a shared space where signal and language line up closely.

1 month ago 0 0 1 0

One clear takeaway: general LLMs are not enough. 📊

Even strong LLMs 🤖 are not built for dense physiology. They often work with summaries, but struggle when the task depends on subtle waveform structure.

🛌 SleepLM is designed for that setting, and it shows clear gains on zero-shot sleep tasks.

1 month ago 0 0 1 0

At the core is ReCoCa 🏗️, our unified training framework.

It combines three signals in one objective:
🔗 contrastive alignment
✍️ caption generation
♻️ signal reconstruction

The result is a representation that stays both language-aware and signal-grounded.

1 month ago 1 0 1 0

Traditional sleep scoring compresses rich signals into a small set of labels. 🧩

We built a multilevel strategy to turn sleep into layered text descriptions. This gives a much richer view of sleep, enabling us to curate the first sleep-language dataset:

🗂️100K+ hours of data from >10,000 people! 🚀

1 month ago 0 0 1 0

🌙 What if your sleep signals could speak?

Introducing SleepLM — sleep-language foundation models that turns raw sleep into something we can describe, query, and localize with language. 🗣️

🌐Website: yang-ai-lab.github.io/SleepLM
🕵️Code: github.com/yang-ai-lab/...
🤗Models: hf.co/yang-ai-lab/...

🧵👇

1 month ago 2 0 1 0

📢 My lab at UCLA is hiring PhD students and postdocs!

Please apply to UCLA CS or CompMed and mention my name if you are interested in foundation models and (Gen)AI for health / medicine / science.

More info: cs.ucla.edu/~yuzhe

4 months ago 2 1 0 0

SensorLM: Learning the Language of Wearable Sensors We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor ...

Read all about it!
➡️Paper: arxiv.org/abs/2506.09108

Huge team effort! Kudos to my intern Evelyn, amazing team @kmr_ayush, @aametwally1, @Orson_Xu, @timalthoff, @pushmeet, @cecim, @xliucs, @danmcduff, and other amazing co-authors!

#AI #wearable #sensor #health #multimodal
(8/8)

10 months ago 1 0 0 0

Beyond its discriminative power, SensorLM showcases compelling generative capabilities. It can produce hierarchical and realistic captions from input wearable data only, offering more coherent & correct descriptions compared to LLMs like Gemini 2.0 Flash. ✍️✨

(7/8)

10 months ago 0 0 1 0

SensorLM also demonstrates intriguing capabilities, including interesting scaling behavior over data size, model size, and compute. 📈💡

(6/8)

10 months ago 0 0 1 0

Posts by Yuzhe Yang