Advertisement · 728 × 90

Posts by Manya Wadhwa

Post image

Would you realize if the book you were reading was AI? What if it was humanized to remove AI-speak?

We find that even without using stylistic cues (e.g., word choice or sentence structure) narrative choices alone give AI fiction away!

2 weeks ago 199 63 8 6

work done with amazing collaborators: Tiasa, @harveylederman.bsky.social , @jessyjli.bsky.social and @gregdnlp.bsky.social 🙌

1 month ago 5 0 0 0

📝 Check out our paper: arxiv.org/pdf/2603.09970
💻Code + Data: github.com/ManyaWadhwa/...
🏅 Leaderboard: manyawadhwa.github.io/projects/cre...

1 month ago 5 0 1 0

CREATE provides a concrete testbed for evaluating associative creative in LLMs and highlights the need for further work to fully leverage LLMs for such tasks!

1 month ago 6 0 1 0
Post image Post image

We introduce creative utility that includes quality + diversity.

Key findings:
🧠 High thinking budget doesn't help as much
💡 Creative prompting doesn't help
⚖️ Factuality–utility tradeoff: penalizing false connections reduces creative utility

1 month ago 6 0 1 0

Unlike creativity tests for humans (e.g., Alternate Uses Test), CREATE:

✅ Requires reasoning over parametric knowledge
✅ Is objectively verifiable

This makes it closer to real creativity tasks like brainstorming or hypothesis generation, while supporting quantitative eval!

1 month ago 5 0 1 0

CREATE evaluates whether models can construct interesting and distinct paths to connect concepts in their parametric knowledge. This mirrors associative reasoning in writing & scientific ideation: paths must be coherent, factually grounded, and conceptually diverse!

1 month ago 4 0 1 0
Post image

⚛️ Introducing CREATE, a benchmark for creative associative reasoning in LLMs.

Making novel, meaningful connections is key for scientific & creative works.

We objectively measure how well LLMs can do this. 🧵👇

1 month ago 41 13 2 4
Post image Post image

Hello world 👋
My first paper at UT Austin!

We ask: what happens when medical “evidence” fed into an LLM is wrong? Should your AI stay faithful, or should it play it safe when the evidence is harmful?

We show that frontier LLMs accept counterfactual medical evidence at face value.🧵

3 months ago 14 6 3 2
N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.

N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.

N-gram novelty is widely used as a measure of creativity and generalization. But if LLMs produce highly n-gram novel expressions that don’t make sense or sound awkward, should they still be called creative? In a new paper, we investigate how n-gram novelty relates to creativity.

5 months ago 42 10 1 3
Advertisement
Post image

AI is already at work in American newsrooms.

We examine 186k articles published this summer and find that ~9% are either fully or partially AI-generated, usually without readers having any idea.

Here's what we learned about how AI is influencing local and national journalism:

6 months ago 56 29 5 2
Post image

🚨New paper on AI & copyright

Authors have sued LLM companies for using books w/o permission for model training.

Courts however need empirical evidence of market harm. Our preregistered study exactly addresses this gap.

Joint work w Jane Ginsburg from Columbia Law and @dhillonp.bsky.social 1/n🧵

6 months ago 22 12 1 1
UT Austin Computational Linguistics Research Group – Humans processing computers processing humans processing language

UT Austin Linguistics is hiring in computational linguistics!

Asst or Assoc.

We have a thriving group sites.utexas.edu/compling/ and a long proud history in the space. (For instance, fun fact, Jeff Elman was a UT Austin Linguistics Ph.D.)

faculty.utexas.edu/career/170793

🤘

6 months ago 41 27 1 4
Post image

Find my students and collaborators at COLM this week!

Tuesday morning: @juand-r.bsky.social and @ramyanamuduri.bsky.social 's papers (find them if you missed it!)

Wednesday pm: @manyawadhwa.bsky.social 's EvalAgent

Thursday am: @anirudhkhatry.bsky.social 's CRUST-Bench oral spotlight + poster

6 months ago 9 5 0 1

Unfortunately I won't be at #COLM2025 this week, but please check out our work being presented by my collaborators/advisors!

If you are interested in evals of open-ended tasks/creativity please reach out and we can schedule a chat! :)

6 months ago 4 1 0 0

Excited to present this at #COLM2025 tomorrow! (Tuesday, 11:00 AM poster session)

6 months ago 10 4 0 0
Post image

Come to talk with us today about the evaluation of long form multilingual generation at the second poster session #COLM2025

📍4:30–6:30 PM / Room 710 – Poster #8

6 months ago 6 2 0 0
Post image

Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses?

Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓

10 months ago 10 3 1 0
UT Austin campus

UT Austin campus

Extremely excited to announce that I will be joining
@utaustin.bsky.social Computer Science in August 2025 as an Assistant Professor! 🎉

11 months ago 42 9 5 2
Advertisement
Post image

What does it mean for #LLM output to be novel?
In work w/ johnchen6.bsky.social, Jane Pan, Valerie Chen and He He, we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵

11 months ago 7 4 2 0
Title: "Characterizing the Role of Similarity in the Property Inferences of Language Models"
Authors: Juan Diego Rodriguez, Aaron Mueller, Kanishka Misra

Left figure: "Given that dogs are daxable, is it true that corgis are daxable?" A language model could answer this either using taxonomic relations, illustrated by a taxonomy dog-corgi, dog-mutt, canine-wolf, etc., or by similarity relations (dogs are more similar to corgis than cats, wolves or shar peis).

Right figure: illustration of the causal model (and an example intervention) for distributed alignment search (DAS), which we used to find a subspace in the network responsible for property inheritance behavior. The bottom nodes are "property", "premise concept (A)" and "conclusion concept (B)" , the middle nodes are "A has property P", "B is a kind of A", and the top node is "B has property P".

Title: "Characterizing the Role of Similarity in the Property Inferences of Language Models" Authors: Juan Diego Rodriguez, Aaron Mueller, Kanishka Misra Left figure: "Given that dogs are daxable, is it true that corgis are daxable?" A language model could answer this either using taxonomic relations, illustrated by a taxonomy dog-corgi, dog-mutt, canine-wolf, etc., or by similarity relations (dogs are more similar to corgis than cats, wolves or shar peis). Right figure: illustration of the causal model (and an example intervention) for distributed alignment search (DAS), which we used to find a subspace in the network responsible for property inheritance behavior. The bottom nodes are "property", "premise concept (A)" and "conclusion concept (B)" , the middle nodes are "A has property P", "B is a kind of A", and the top node is "B has property P".

How do language models organize concepts and their properties? Do they use taxonomies to infer new properties, or infer based on concept similarities? Apparently, both!

🌟 New paper with my fantastic collaborators @amuuueller.bsky.social and @kanishka.bsky.social

1 year ago 108 22 4 6

If you are at #NAACL2025 @naaclmeeting.bsky.social catch @juand-r.bsky.social presenting our poster on the interplay between similarity and category membership in the property inferences of LMs @ Poster Session 1 on Wednesday!

Or if you're at home like me, read our paper: arxiv.org/abs/2410.22590

11 months ago 12 2 0 0
Post image Post image

🚀Meet CRUST-Bench, a dataset for C-to-Rust transpilation for full codebases 🛠️
A dataset of 100 real-world C repositories across various domains, each paired with:
🦀 Handwritten safe Rust interfaces.
🧪 Rust test cases to validate correctness.
🧵[1/6]

11 months ago 16 5 1 1

Work done with amazing collaborators: @zaynesprague.bsky.social, @cmalaviya.bsky.social, Philippe Laban, @jessyjli.bsky.social and @gregdnlp.bsky.social 🙌

1 year ago 1 0 0 0

📝 Read the full paper: arxiv.org/pdf/2504.15219
💻 You can also use our system to generate criteria: github.com/ManyaWadhwa/...
Also checkout our 🎛️ UI to explore generated criteria + source URLs!

1 year ago 0 0 1 0

Why do we need this? If you’ve used an LLM to draft a paper intro, research talk, or blog post, you’ve likely noticed that while the facts are correct, something feels off. What might be missing are the subtle cues and unspoken expectations. EvalAgent helps uncover and address those hidden layers! 🔮

1 year ago 0 0 1 0
Post image

EvalAgent (EA-Web) criteria are often non-obvious to humans and not easily met by LLMs out of the box, making them valuable for evaluation. We also show how the criteria generated by EvalAgent is highly actionable (results in paper)!

1 year ago 0 0 1 0
Advertisement
Post image

We test criteria generated by EvalAgent across 9 datasets from creative writing to technical reports and compare criteria generated by 2 other systems!

Results? We show that the criteria generated by EvalAgent (EA-Web) are 🎯 highly specific and 💭 implicit.

1 year ago 0 0 1 0

For example, EvalAgent generates the following criteria for the academic talk prompt:

The response should have:
🪄 A compelling opening/motivation
🧠 Clear research question it answers
🏁 A strong conclusion that restates findings

1 year ago 0 0 1 0

EvalAgent emulates how a human would seek advice by 🔍searching 🔍 things like “how to write a compelling talk”, reading expert tips from blogs and academic websites and aggregating that into specific, useful evaluation criteria.

1 year ago 0 0 1 0