Stella Li (@stellali) Bsky

Co-led with @pkargupta.bsky.social ✨ We learned so much and couldn't have done it w/o our amazing collaborators and mentors: Ken Wang, Jinu Lee, @shan23chen.bsky.social, @orevaahia.bsky.social, Dean Light, Tom Griffiths, @maxkw.bsky.social, Jiawei Han, @asli-celikyilmaz.bsky.social, Yulia Tsvetkov🩵

4 months ago 5 0 0 0

Cognitive Foundations for Reasoning and Their Manifestation in LLMs Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand...

More fun details, especially extension cognitive science background💫, in our 24-page paper!

📄Paper: arxiv.org/abs/2511.16660
💻Code: github.com/pkargupta/co...
🤗Data: huggingface.co/collections/...
🌐 Blogpost: tinyurl.com/cognitive-fo...

4 months ago 4 0 1 0

What our Cognitive Foundations framework enables:

🔍 Systematic diagnosis of reasoning failures
🎯 Predicting which training→which capabilities
🧪 Testing cognitive theories at scale
🌉 Shared vocabulary bridging cognition & AI research

More on opportunities & challenges in📄

4 months ago 1 0 1 0

Test-time reasoning guidance: up to 66.7% improvement 💡

We scaffold cognitive structures from successful traces to guide reasoning.

Major gains on ill-structured problems🌟

Models possess latent capabilities—they just don't deploy them adaptively without explicit guidance.

4 months ago 3 1 1 0

🧑🏻Humans reason differently‼️ More abstraction (54% vs 36%), self-awareness (49% vs 19%), conceptual processing. Less surface enumeration and rigid sequential chaining.

Even with correct answers—underlying mechanisms diverge fundamentally.

4 months ago 4 0 1 0

We analyzed 1,598 LLM reasoning papers:

Research concentrates on easily quantifiable behaviors—sequential organization (55%), decomposition (60%)

Neglects meta-cognitive controls (8-16%) and alternative representations (10-27%) that correlate with success⚠️

4 months ago 5 1 1 0

Structure matters as much as presence📐

We introduce method to extract reasoning structure from traces

Successful: selective attention → knowledge alignment → forward chaining
Common: skip to forward chaining

LLMs prematurely seek solution before understanding constraints‼️

4 months ago 3 0 1 0

Model-specific patterns reveal training impact:

Olmo3 exhibits more diverse cognitive elements (49%)—they explicitly included meta-reasoning data during midtraining.
DeepHermes-3: only 12% avg presence.

Training methodology shapes cognitive profiles dramatically.

4 months ago 3 0 1 0

Meta-cognitive deficit is severe:

🤔Self-awareness: 16% in research design, 19% in LLM traces vs 49% in humans
🧐Self-evaluation on non-verifiable problems collapses (53.5% presence, 0.031 correlation)

Models can't self-assess without ground truth.

4 months ago 2 0 1 0

The presence-effectiveness paradox:

Logical coherence: 91% of traces, 0.091 corr. w/ success Knowledge alignment: 20% of traces, 0.234 correlation (high)

Models frequently attempt core elements but fail to execute. Having the capability ≠ deploying it successfully😬

4 months ago 2 0 1 0

Models deploy strategies inversely to what works 🚨

As problems become ill-structured, models narrow their repertoire—but successful traces show need for greater diversity (successful = high ppmi in fig).

Sequential organization dominates. Meta-cognition disappears in LLMs.

4 months ago 2 0 1 0

We analyze 192K LLM reasoning traces from 18 models (text,image,video)+LLM 54 humans think-aloud traces

We introduce a framework for fine-grained span-level cognitive evaluation: WHICH elements appear, WHERE, and HOW they're sequenced.

First analysis of its kind at this scale📊

4 months ago 2 0 1 0

Our taxonomy bridges cognitive science → LLM eval:

28 elements across 4 dimensions—reasoning invariants (compositionality, logical coherence), meta-cognitive controls (self-awareness), representations (hierarchical, causal), and operations (backtracking, verification)

4 months ago 4 1 1 0

LLMs solve hard problems but fail on easy variants, exhibit patterns different from humans.

The issue: reasoning evaluations is by outcomes w/o understanding the cognitive processes that produce them. We can't diagnose failures or predict how training produces capabilities🚨

4 months ago 2 0 1 0

🤔💭What even is reasoning? It's time to answer the hard questions!

We built the first unified taxonomy of 28 cognitive elements underlying reasoning

Spoiler—LLMs commonly employ sequential reasoning, rarely self-awareness, and often fail to use correct reasoning structures🧠

4 months ago 45 8 2 0

Because Olmo 3 is fully open, we decontaminate our evals from our pretraining and midtraining data. @stellali.bsky.social proves this with spurious rewards: RL trained on a random reward signal can't improve on the evals, unlike some previous setups

5 months ago 1 1 1 0

Day 1 (Tue Oct 7) 4:30-6:30pm, Poster Session 2

Poster #77: ALFA: Aligning LLMs to Ask Good Questions: A Case Study in Clinical Reasoning; led by
@stellali.bsky.social & @jiminmun.bsky.social

6 months ago 2 1 1 0

This project was done as part of the Meta FAIR AIM mentorship program. Special thanks to my amazing collaborators and awesome mentors @melaniesclar.bsky.social @jcqln_h @hunterjlang @AnsongNi @andrew_e_cohen @jacoby_xu @chan_young_park @tsvetshop.bsky.social ‪@asli-celikyilmaz.bsky.social‬ 🫶🏻💙

8 months ago 0 0 0 0

PrefPalette: Personalized Preference Modeling with Latent Attributes Personalizing AI systems requires understanding not just what users prefer, but the reasons that underlie those preferences - yet current preference models typically treat human judgment as a black bo...

✨PrefPalette🎨 bridges cognitive science, social psychology, and AI for explainable preference modeling✨

📖Paper: arxiv.org/abs/2507.13541
💻Code: github.com/stellalisy/P...

Join us in shaping interpretable AI that you can trust and control🚀Feedback welcome!
#AI #Transparency

8 months ago 3 1 1 0

🌍Bonus: PrefPalette🎨 is a computational social science goldmine!

📊 Quantify community values at scale
📈 Track how norms evolve over time
🔍 Understand group psychology
📋 Move beyond surveys to revealed preferences

8 months ago 0 0 1 0

💡Potential real-world applications:

🛡️Smart content moderation—explains why content is flagged/decisions are made

🎯Interpretable LM alignment—revealing prominent attributes

⚙️Controllable personalization—giving user agency to personalize select attributes

8 months ago 0 0 1 0

🔍More importantly‼️we can see WHY preferences differ:

r/AskHistorians:📚values verbosity
r/RoastMe:💥values directness
r/confession:❤️values empathy

We visualize each group’s unique preference decisions—no more one-size-fits-all. Understand your audience at a glance🏷️

8 months ago 0 0 1 0

🏆Results across 45 Reddit communities:

📈Performance boost: +46.6% vs GPT-4o
💪Outperforms other training-based baselines w/ statistical significance
🕰️Robust to temporal shifts—trained pref models can be used out-of-the box!

8 months ago 0 0 1 0

⚙️How it works (pt.2)

1: 🎛️Train compact, efficient detectors for every attribute

2: 🎯Learn community-specific attribute weights during preference training

3: 🔧Add attribute embeddings to preference model for accurate & explainable predictions

8 months ago 0 0 1 0

⚙️How it works (prep stage)

📜Define 19 sociolinguistics & cultural attributes from literature
🏭Novel preference data generation pipeline to isolate attributes

Our data gen pipeline generates pairwise data on *any* decomposed dimension, w/ applications beyond preference modeling

8 months ago 0 0 1 0

Meet PrefPalette🎨! Our approach:

🔍⚖️models preferences w/ 19 attribute detectors and dynamic, context-aware weights

🕶️👍uses unobtrusive signals from Reddit to avoid response bias

🧠mirrors attribute-mediated human judgment—so you know not just what it predicts, but *why*🧐

8 months ago 0 0 1 0

🔬Cognitive science reveals how humans break choices into attributes, e.g.:

😂 Humor
❤️ Empathy
💬 Conformity
...then weight them based on context (e.g. comedy vs counseling).

These traits shape every decision, from product picks to conversation tone. Your mind is a colorful palette🎨

8 months ago 1 0 1 0

🚨Current preference models only output a reward/score:

❌No transparency in decision-making
❌Personalization breaks easily, one-size-fits-all scores
❌Use explicit annotations (response bias)

They can’t adapt to individual tastes, can’t debug errors, and fail to build trust🙅

8 months ago 0 0 1 0

WHY do you prefer something over another?

Reward models treat preference as a black-box😶‍🌫️but human brains🧠decompose decisions into hidden attributes

We built the first system to mirror how people really make decisions in our recent COLM paper🎨PrefPalette✨

Why it matters👉🏻🧵

8 months ago 7 2 1 2

Want to quickly sample high-quality images from diffusion models, but can’t afford the time or compute to distill them? Introducing S4S, or Solving for the Solver, which learns the coefficients and discretization steps for a DM solver to improve few-NFE generation.

Thread 👇 1/

1 year ago 3 1 1 0

Posts by Stella Li