🚀 OSF turns these findings into a practical recipe for building more generalizable and deployable sleep AI.
➡️Paper: arxiv.org/abs/2603.00190
Great work led by my students @ZitaoShuai, @ZongzheX2001, David, and collaborator
@WeiWang1973! 🌙
#AI #sleep #sensor #health #multimodal #LLMs
Posts by Yuzhe Yang
Our third finding: scaling does help in sleep — but only with the right recipe. With the right SSL design, performance keeps improving as we scale:
📦 pre-training data size
🧠 model capacity
🌐 multi-source data mixture
So the message is not just scale more. It's: scale the right pre-training design.
But missing-channel inference is not hopeless.
We find that explicitly encouraging channel-invariant feature learning during pre-training can substantially improve both downstream performance and robustness when channels are missing.
Our first finding: existing sleep FMs can break badly under missing-channel inference.
This is not a corner case. In real sleep studies, channel availability changes across cohorts, devices, and protocols.
And when that happens, performance can drop sharply.
We did not want to run a narrow comparison of one or two methods.
Instead, we benchmarked major self-supervised learning families for sleep FM pre-training:
🔗 contrastive
🧠 self-distillation
♻️ reconstructive
➡️ autoregressive
A big reason progress in sleep FMs has been hard to compare is simple: we have not had a unified and open testbed.
We address that with SleepBench — a fully open benchmark built from public resources, with
⏱️ 166,500 hours of sleep recordings
🧑🤝🧑 21,000+ sleep studies
🌍 9 datasets
💾 ~20M 30s epochs
Meet OSF — a fully open benchmark and state-of-the-art sleep foundation models. 🌙
We study pre-training and scaling recipes that actually improve generalization in real-world settings. 🏥
🌐 Website: yang-ai-lab.github.io/osf/
💻 Code: github.com/yang-ai-lab/...
🤗 Models: hf.co/yang-ai-lab/...
🚀 HEARTS is built as a living ecosystem for the community: new data, new reasoning tasks, and new models can continue to be added over time!
➡️Paper: arxiv.org/abs/2603.06638
#AI #HealthAI #LLM #TimeSeries #Multimodal
Finally, we asked whether input format changes the story. 🖼️📝
It does affect absolute performance, but much less than one might expect. Whether time series are presented as text, images, or other input forms, the relative task difficulty stays surprisingly stable.
Models from the same family often behave very similarly. 🧬
Even when absolute performance changes with scale, the overall performance pattern tends to remain stable within a family. That suggests scaling alone is not enough to solve the core reasoning gap.
We also found a striking temporal difficulty pattern. 📉
Longer sequences and higher sampling frequencies consistently make the tasks harder. Across models, domains, and modalities, performance tends to drop as the temporal burden increases.
A second takeaway is that many models rely on shortcuts rather than deep reasoning. 🪤
For Perception and Inference tasks, models often do reasonably well when there are explicit thresholds, obvious quantitative cues, or strong domain priors to lean on.
One clear takeaway: current LLMs are still weak at genuine health time-series reasoning. 📊
We evaluated 14 state-of-the-art LLMs and found that, although many beat a simple naive baseline, the gains are often modest. On many tasks, they still fall clearly behind specialized time-series models.
To move beyond standard benchmark design, we built a hierarchical task taxonomy. 🏗️
Rather than only asking multiple-choice questions, HEARTS organizes 110 tasks into four cognitive levels:
🧠 Perception
🔍 Inference
✍️ Generation
⚙️ Deduction
HEARTS also pushes models across a very wide temporal range. ⏱️
Reasoning over health time series is not only about short segments. Models may need to detect fine local structure, track long-range dependencies, or connect patterns across long periods of observation.
One key goal was breadth of real-world physiological coverage. 🏥
Many existing benchmarks rely on synthetic signals or stay within a small number of domains. HEARTS instead brings real-world datasets spanning from motion and metabolism to sleep, respiration, surgery, speech, behavior, and more.
Can LLMs really reason over health time series? 📈
Introducing HEARTS ❤️— the first living benchmark built for health time-series reasoning.
🌐Website: yang-ai-lab.github.io/HEARTS
🕵️Code: github.com/yang-ai-lab/...
🤗Dataset: hf.co/datasets/yan...
🏆Leaderboard: yang-ai-lab.github.io/HEARTS/leade...
SleepLM points to a new direction for sleep AI🚀. Read all about it!
➡️Paper: arxiv.org/abs/2602.23605
Great work led by my students @ZongzheX2001, @ZitaoShuai, Eideen, and amazing collaborators @AysolaRavi and Rajesh!
More to come🌙
#AI #sleep #sensor #health #multimodal #LLMs
Finally, we wanted this to connect to real clinical workflows. 🏥
SleepLM can combine its predictions across an entire night and produce useful full-night measures, while staying stable over long sequences. This matters as real sleep analysis is about understanding the whole night in a reliable way.
We also wanted the model to be more controllable. 🎛️
Instead of always generating one broad description, SleepLM can focus on a specific part of the physiology when asked. For example, it can emphasize 🧠brain activity, 🫁breathing, ❤️heart-related signals, or 💪body movement.
SleepLM also learns when something happens, not just whether it happened. ⏱️
Our results show that the model is sensitive to timing. The strongest match appears when the text and the signal line up at the correct moment, and that match weakens as the alignment moves away.
SleepLM learns a strong link between language and physiology. 🔄
When we ask it to match text to signals, or signals to text, it performs much better than general-purpose baselines. It not only reads sleep signals well — but also learns a shared space where signal and language line up closely.
One clear takeaway: general LLMs are not enough. 📊
Even strong LLMs 🤖 are not built for dense physiology. They often work with summaries, but struggle when the task depends on subtle waveform structure.
🛌 SleepLM is designed for that setting, and it shows clear gains on zero-shot sleep tasks.
At the core is ReCoCa 🏗️, our unified training framework.
It combines three signals in one objective:
🔗 contrastive alignment
✍️ caption generation
♻️ signal reconstruction
The result is a representation that stays both language-aware and signal-grounded.
Traditional sleep scoring compresses rich signals into a small set of labels. 🧩
We built a multilevel strategy to turn sleep into layered text descriptions. This gives a much richer view of sleep, enabling us to curate the first sleep-language dataset:
🗂️100K+ hours of data from >10,000 people! 🚀
🌙 What if your sleep signals could speak?
Introducing SleepLM — sleep-language foundation models that turns raw sleep into something we can describe, query, and localize with language. 🗣️
🌐Website: yang-ai-lab.github.io/SleepLM
🕵️Code: github.com/yang-ai-lab/...
🤗Models: hf.co/yang-ai-lab/...
🧵👇
📢 My lab at UCLA is hiring PhD students and postdocs!
Please apply to UCLA CS or CompMed and mention my name if you are interested in foundation models and (Gen)AI for health / medicine / science.
More info: cs.ucla.edu/~yuzhe
Read all about it!
➡️Paper: arxiv.org/abs/2506.09108
Huge team effort! Kudos to my intern Evelyn, amazing team @kmr_ayush, @aametwally1, @Orson_Xu, @timalthoff, @pushmeet, @cecim, @xliucs, @danmcduff, and other amazing co-authors!
#AI #wearable #sensor #health #multimodal
(8/8)
Beyond its discriminative power, SensorLM showcases compelling generative capabilities. It can produce hierarchical and realistic captions from input wearable data only, offering more coherent & correct descriptions compared to LLMs like Gemini 2.0 Flash. ✍️✨
(7/8)
SensorLM also demonstrates intriguing capabilities, including interesting scaling behavior over data size, model size, and compute. 📈💡
(6/8)