This study represents years of collaborative work by an incredible cross-functional team at Verily and our outstanding research partners at SRI International. Congrats to the team! n/3
Posts by Benjamin W. Nelson, PhD
We rigorously evaluated the Verily Numetric Watch’s ability to estimate over 12 sleep metrics against gold-standard in-lab polysomnography in a demographically diverse cohort. n/2
As the Clinical Science Lead on this study, I’m excited to share our new publication in the Journal of Sleep Research: "Performance Evaluation of the Verily Numetric Watch Sleep Suite for Digital Sleep Assessment Against In-Lab Polysomnography."
onlinelibrary.wiley.com/doi/10.1111/...
n/1
Huge thanks to my amazing co-authors: @prof-nick-allen.bsky.social, John Torous, MD MBI, Ari Winbush, Steven Siddals, Matthew Flathers! Grateful for the opportunity to lead this project as part of my Adjunct Faculty position at Harvard Medical School and BIDMC.
GPT-4o surpassing human performance for calm/neutral and surprise recognition, while Gemini surpassed human performance for surprise recognition.
We also examined model performance across actor race and sex, finding no significant biases—an encouraging result for future clinical applications. 4/n
All LLM models demonstrated substantial to almost perfect agreement with ground truth labels. Notably, GPT-4o and Gemini reached human performance levels for overall facial emotion recognition 3/n
We evaluated the agreement + accuracy GPT-4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet, using the NimStim dataset, a benchmark of 672 facial expressions from 43 diverse human actors resulting in 2,016 model-based emotion estimates. 2/n
New Preprint: "Evaluating the Performance of Large Language Models in Identifying Human Facial Emotions: GPT-4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet" 🧠🖼️📊
In this study, we benchmarked the ability of leading LLMs to recognize facial emotions. 1/n
osf.io/preprints/ps...