Advertisement · 728 × 90

Posts by Atharva Kulkarni

🙌🥳Had great fun doing this during my summer internship with folks from Apple (Yuan Zhang, Joel Ruben Antony Moniz, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Hong Yu) and USC (@swabhs.bsky.social)

Looking forward to the feedback! 🙂
#LLMs #NLProc

(7/n)

11 months ago 0 0 0 0

🚫Bottom line: There’s no single metric that captures hallucinations reliably across the board.

🎯Our work highlights the need for robust, context-aware, and generalizable hallucination detection tools as a prerequisite to meaningful mitigation.

(6/n)

11 months ago 0 0 1 0
Post image

✅What works better?
Unsurprisingly, GPT4-based evaluators show the highest reliability with humans across settings 🌟
Using ensembles of multiple metrics is a promising avenue⭐️
Instruction tuning & mode-seeking decoding help reduce hallucinations📈

(5/n)

11 months ago 0 0 1 0
Post image Post image

Our findings highlight:
⚠️Many existing metrics show poor alignment with human judgments
⚠️The inter-metric correlation is also weak
⚠️The show limited generalization across datasets, tasks, and models
⚠️They do not consistent improvement with larger models

(4/n)

11 months ago 0 0 1 0

🧐Focusing on faithfulness and factuality errors in QA and dialogue tasks, we study diverse metrics spanning:
1. Syntactic and semantic similarity
2. Natural language inference
3. Multi-step question answering pipelines
4. Custom-trained models
5. SOTA LLMs as judge.

(3/n)

11 months ago 0 0 1 0

🤔Despite a surge in research on hallucination mitigation, few ask the critical questions:
1. Are the metrics capturing the hallucinations effectively?
2. Do they align with each other and the human notion of hallucination?
3. Do they generalize across different settings?

(2/n)

11 months ago 0 0 1 0
Post image

Hallucinations in LLMs are real—and so are the problems with how we measure them 📉

Our latest work questions the generalizability of hallucination detection metrics across tasks, datasets, model sizes, training methods, and decoding strategies 💥

arxiv.org/abs/2504.18114

(1/n)

11 months ago 1 0 1 0
Post image

Reasoning about the "why" behind user behavior can improve LLM personas! ✨🧠📈

📝Excited to share our new work: Improving LLM Personas via Rationalization with Psychological Scaffolds

🔗 arxiv.org/abs/2504.17993
🧵 (1/n)

11 months ago 14 4 1 1
Preview
NLP grad students Join the conversation

There's too many starter packs.
👇 Here's a list, mostly for NLP, ML, and related areas.

1 year ago 40 11 3 2
Post image Post image

#socalnlp is the biggest it's ever been in 2024! We have 3 poster sessions up from 2! How many years until it's a two-day event?? 🤯

1 year ago 26 3 1 0
Advertisement

Started a SoCal AI/ML/NLP researchers starter pack! It's a bit sparse right now, and perhaps more NLP heavy, but hey, nominate yourself and others! go.bsky.app/6QckPj9

1 year ago 43 8 17 1

🙋🏻‍♂️🙋🏻‍♂️

1 year ago 1 0 1 0

Hey John, thanks for starting this packet! Could you please add me as well?

1 year ago 0 0 1 0

Can you please add me to the pack! Looking forward to interacting with everyone!

1 year ago 1 0 1 0

Great initiative!! Can you please add me! Looking forward to interacting with everyone!!💯

1 year ago 0 0 1 0