Would you realize if the book you were reading was AI? What if it was humanized to remove AI-speak?
We find that even without using stylistic cues (e.g., word choice or sentence structure) narrative choices alone give AI fiction away!
Posts by Manya Wadhwa
work done with amazing collaborators: Tiasa, @harveylederman.bsky.social , @jessyjli.bsky.social and @gregdnlp.bsky.social 🙌
📝 Check out our paper: arxiv.org/pdf/2603.09970
💻Code + Data: github.com/ManyaWadhwa/...
🏅 Leaderboard: manyawadhwa.github.io/projects/cre...
CREATE provides a concrete testbed for evaluating associative creative in LLMs and highlights the need for further work to fully leverage LLMs for such tasks!
We introduce creative utility that includes quality + diversity.
Key findings:
🧠 High thinking budget doesn't help as much
💡 Creative prompting doesn't help
⚖️ Factuality–utility tradeoff: penalizing false connections reduces creative utility
Unlike creativity tests for humans (e.g., Alternate Uses Test), CREATE:
✅ Requires reasoning over parametric knowledge
✅ Is objectively verifiable
This makes it closer to real creativity tasks like brainstorming or hypothesis generation, while supporting quantitative eval!
CREATE evaluates whether models can construct interesting and distinct paths to connect concepts in their parametric knowledge. This mirrors associative reasoning in writing & scientific ideation: paths must be coherent, factually grounded, and conceptually diverse!
⚛️ Introducing CREATE, a benchmark for creative associative reasoning in LLMs.
Making novel, meaningful connections is key for scientific & creative works.
We objectively measure how well LLMs can do this. 🧵👇
Hello world 👋
My first paper at UT Austin!
We ask: what happens when medical “evidence” fed into an LLM is wrong? Should your AI stay faithful, or should it play it safe when the evidence is harmful?
We show that frontier LLMs accept counterfactual medical evidence at face value.🧵
N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.
N-gram novelty is widely used as a measure of creativity and generalization. But if LLMs produce highly n-gram novel expressions that don’t make sense or sound awkward, should they still be called creative? In a new paper, we investigate how n-gram novelty relates to creativity.
AI is already at work in American newsrooms.
We examine 186k articles published this summer and find that ~9% are either fully or partially AI-generated, usually without readers having any idea.
Here's what we learned about how AI is influencing local and national journalism:
🚨New paper on AI & copyright
Authors have sued LLM companies for using books w/o permission for model training.
Courts however need empirical evidence of market harm. Our preregistered study exactly addresses this gap.
Joint work w Jane Ginsburg from Columbia Law and @dhillonp.bsky.social 1/n🧵
UT Austin Linguistics is hiring in computational linguistics!
Asst or Assoc.
We have a thriving group sites.utexas.edu/compling/ and a long proud history in the space. (For instance, fun fact, Jeff Elman was a UT Austin Linguistics Ph.D.)
faculty.utexas.edu/career/170793
🤘
Find my students and collaborators at COLM this week!
Tuesday morning: @juand-r.bsky.social and @ramyanamuduri.bsky.social 's papers (find them if you missed it!)
Wednesday pm: @manyawadhwa.bsky.social 's EvalAgent
Thursday am: @anirudhkhatry.bsky.social 's CRUST-Bench oral spotlight + poster
Unfortunately I won't be at #COLM2025 this week, but please check out our work being presented by my collaborators/advisors!
If you are interested in evals of open-ended tasks/creativity please reach out and we can schedule a chat! :)
Excited to present this at #COLM2025 tomorrow! (Tuesday, 11:00 AM poster session)
Come to talk with us today about the evaluation of long form multilingual generation at the second poster session #COLM2025
📍4:30–6:30 PM / Room 710 – Poster #8
Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses?
Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓
UT Austin campus
Extremely excited to announce that I will be joining
@utaustin.bsky.social Computer Science in August 2025 as an Assistant Professor! 🎉
What does it mean for #LLM output to be novel?
In work w/ johnchen6.bsky.social, Jane Pan, Valerie Chen and He He, we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵
Title: "Characterizing the Role of Similarity in the Property Inferences of Language Models" Authors: Juan Diego Rodriguez, Aaron Mueller, Kanishka Misra Left figure: "Given that dogs are daxable, is it true that corgis are daxable?" A language model could answer this either using taxonomic relations, illustrated by a taxonomy dog-corgi, dog-mutt, canine-wolf, etc., or by similarity relations (dogs are more similar to corgis than cats, wolves or shar peis). Right figure: illustration of the causal model (and an example intervention) for distributed alignment search (DAS), which we used to find a subspace in the network responsible for property inheritance behavior. The bottom nodes are "property", "premise concept (A)" and "conclusion concept (B)" , the middle nodes are "A has property P", "B is a kind of A", and the top node is "B has property P".
How do language models organize concepts and their properties? Do they use taxonomies to infer new properties, or infer based on concept similarities? Apparently, both!
🌟 New paper with my fantastic collaborators @amuuueller.bsky.social and @kanishka.bsky.social
If you are at #NAACL2025 @naaclmeeting.bsky.social catch @juand-r.bsky.social presenting our poster on the interplay between similarity and category membership in the property inferences of LMs @ Poster Session 1 on Wednesday!
Or if you're at home like me, read our paper: arxiv.org/abs/2410.22590
🚀Meet CRUST-Bench, a dataset for C-to-Rust transpilation for full codebases 🛠️
A dataset of 100 real-world C repositories across various domains, each paired with:
🦀 Handwritten safe Rust interfaces.
🧪 Rust test cases to validate correctness.
🧵[1/6]
Work done with amazing collaborators: @zaynesprague.bsky.social, @cmalaviya.bsky.social, Philippe Laban, @jessyjli.bsky.social and @gregdnlp.bsky.social 🙌
📝 Read the full paper: arxiv.org/pdf/2504.15219
💻 You can also use our system to generate criteria: github.com/ManyaWadhwa/...
Also checkout our 🎛️ UI to explore generated criteria + source URLs!
Why do we need this? If you’ve used an LLM to draft a paper intro, research talk, or blog post, you’ve likely noticed that while the facts are correct, something feels off. What might be missing are the subtle cues and unspoken expectations. EvalAgent helps uncover and address those hidden layers! 🔮
EvalAgent (EA-Web) criteria are often non-obvious to humans and not easily met by LLMs out of the box, making them valuable for evaluation. We also show how the criteria generated by EvalAgent is highly actionable (results in paper)!
We test criteria generated by EvalAgent across 9 datasets from creative writing to technical reports and compare criteria generated by 2 other systems!
Results? We show that the criteria generated by EvalAgent (EA-Web) are 🎯 highly specific and 💭 implicit.
For example, EvalAgent generates the following criteria for the academic talk prompt:
The response should have:
🪄 A compelling opening/motivation
🧠 Clear research question it answers
🏁 A strong conclusion that restates findings
EvalAgent emulates how a human would seek advice by 🔍searching 🔍 things like “how to write a compelling talk”, reading expert tips from blogs and academic websites and aggregating that into specific, useful evaluation criteria.