Maharshi Gor (@maharshigor) Bsky

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness Adversarial datasets should validate AI robustness by providing samples on which humans perform well, but models do not. However, as models evolve, datasets can become obsolete. Measuring whether a da...

📝 Full paper link: arxiv.org/abs/2406.16342

TL;DR: We introduce AdvScore, a human-grounded metric to measure how "adversarial" a dataset really is—by comparing model vs. human performance. It helps build better, lasting benchmarks like AdvQA (proposed) that evolve with AI progress.

11 months ago 0 0 0 0

🏆ADVSCORE won an Outstanding Paper Award at #NAACL2025

🚨 Don't miss out on our poster presentation *today at 2 pm* by Yoo Yeon (first author).

📍Poster Session 5 - HC: Human-centered NLP

💼 Highly recommend talking to her if you are hiring and/or interested in Human-focused Al dev and evals!

11 months ago 7 1 1 0

🚨 New Position Paper 🚨

Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬

We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠

Here's why MCQA evals are broken, and how to fix them 🧵

1 year ago 46 13 2 0

meme with three rows. "this human-ai decision making leads to unfair outcomes" --> "panik" "let's show explanations to help people be more fair" --> "kalm" "those explanations are based on proxy features" --> "panik"

The Impact of Explanations on Fairness in Human-AI Decision-Making: Protected vs Proxy Features

Despite hopes that explanations improve fairness, we see that when biases are hidden behind proxy features, explanations may not help.

Navita Goyal, Connor Baumler +al IUI’24
hal3.name/docs/daume23...
>

1 year ago 21 6 1 0

Meme of two muscular arms grasping. The first is labeled "humans" the second "AI systems" and where they grasp is labeled "item response theory."

Do great minds think alike? Investigating Human-AI Complementarity in QA

We use item response theory to compare the capabilities of 155 people vs 70 chatbots at answering questions, teasing apart complementarities; implications for design.

by Maharshi Gor +al EMNLP’24
hal3.name/docs/daume24...
>

1 year ago 10 5 2 0

💯

Hallucination is totally the wrong word, implying it is perceiving the world incorrectly.

But it's generating false, plausible sounding statements. Confabulation is literally the perfect word.

So, let's all please start referring to any junk that an LLM makes up as "confabulations".

1 year ago 204 44 18 8

I used to like writefull when it was new and there nothing else better. But 🥲

1 year ago 1 0 0 0

starter pack for the Computational Linguistics and Information Processing group at the University of Maryland - get all your NLP and data science here!

go.bsky.app/V9qWjEi

1 year ago 29 12 1 1

👋🏽 Hey! 🫡

1 year ago 0 0 0 0

Posts by Maharshi Gor