Advertisement Β· 728 Γ— 90

Posts by Chau Minh Pham

Preview
Short Little Difficult Books Novels that challenge with style, story, or form that you can read in a day.

Short Little Difficult Books | Discussion

5 months ago 1 1 0 0
Preview
Why AI writing is mid How the current way of training language models destroys any voice (and hope of good writing).

Why AI writing is mid
How the current way of training language models destroys any voice (and hope of good writing).

www.interconnects.ai/p/why-ai-wri...

5 months ago 86 11 8 15
Preview
Torching the Modern-Day Library of Alexandria β€œSomewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”

I curated some readings for class on "data tensions" and the list felt worth sharing. Come on a tour of datasets, books, the web, and AI with me...

We'll start with this piece on the Google Books project: the hopes, dreams, disasters, and aftermath of building a public library on the internet.

1/n

5 months ago 77 26 6 2

Excited to share our new paper, "DataRater: Meta-Learned Dataset Curation"!

We explore a fundamental question: How can we *automatically* learn which data is most valuable for training foundation models?

Paper: arxiv.org/pdf/2505.17895 to appear at @neuripsconf.bsky.social

Thread πŸ‘‡

5 months ago 25 4 1 2
Screenshot that reads: 

Introducing the Anthology for Computers and the Humanities

Taylor Arnold, Maria Antoniak, Miguel Escobar Varela, Marie Puren, Mila Oiva , Amanda Regan, Lauren Tilton, and Melanie Walsh

1 Data Science and Statistics, University of Richmond, U.S.A.
2 Computer Science, University of Colorado Boulder, U.S.A.
3 Faculty of Arts and Social Sciences, National University of Singapore
4 Laboratoire de Recherche de l'EPITA, Paris, France
5 History and Archaeology, University of Turku, Finland
6 History and Geography, Clemson University, U.S.A.
7 Rhetoric and Communication Studies, University of Richmond, U.S.A.
8 Information School, University of Washington, U.S.A.

Permanent Link: https://doi.org/10.63744/HHsQG7hNWyxG

Published: 25 September 2025

Screenshot that reads: Introducing the Anthology for Computers and the Humanities Taylor Arnold, Maria Antoniak, Miguel Escobar Varela, Marie Puren, Mila Oiva , Amanda Regan, Lauren Tilton, and Melanie Walsh 1 Data Science and Statistics, University of Richmond, U.S.A. 2 Computer Science, University of Colorado Boulder, U.S.A. 3 Faculty of Arts and Social Sciences, National University of Singapore 4 Laboratoire de Recherche de l'EPITA, Paris, France 5 History and Archaeology, University of Turku, Finland 6 History and Geography, Clemson University, U.S.A. 7 Rhetoric and Communication Studies, University of Richmond, U.S.A. 8 Information School, University of Washington, U.S.A. Permanent Link: https://doi.org/10.63744/HHsQG7hNWyxG Published: 25 September 2025

As DH grows, it’s increasingly important to publish conference papers, but there hasn’t been a clear venue for that.

So I’m thrilled to share this new home for DH proceedings, which will include CHR papers & more.

Thanks to @taylor-arnold.bsky.social for leading this effort!

bit.ly/ach-anthology

5 months ago 127 65 6 2
A diagram illustrating pointwise scoring with a large language model (LLM). At the top is a text box containing instructions: 'You will see the text of a political advertisement about a candidate. Rate it on a scale ranging from 1 to 9, where 1 indicates a positive view of the candidate and 9 indicates a negative view of the candidate.' Below this is a green text box containing an example ad text: 'Joe Biden is going to eat your grandchildren for dinner.' An arrow points down from this text to an illustration of a computer with 'LLM' displayed on its monitor. Finally, an arrow points from the computer down to the number '9' in large teal text, representing the LLM's scoring output. This diagram demonstrates how an LLM directly assigns a numerical score to text based on given criteria

A diagram illustrating pointwise scoring with a large language model (LLM). At the top is a text box containing instructions: 'You will see the text of a political advertisement about a candidate. Rate it on a scale ranging from 1 to 9, where 1 indicates a positive view of the candidate and 9 indicates a negative view of the candidate.' Below this is a green text box containing an example ad text: 'Joe Biden is going to eat your grandchildren for dinner.' An arrow points down from this text to an illustration of a computer with 'LLM' displayed on its monitor. Finally, an arrow points from the computer down to the number '9' in large teal text, representing the LLM's scoring output. This diagram demonstrates how an LLM directly assigns a numerical score to text based on given criteria

LLMs are often used for text annotation, especially in social science. In some cases, this involves placing text items on a scale: eg, 1 for liberal and 9 for conservative

There are a few ways to accomplish this task. Which work best? Our new EMNLP paper has some answers🧡
arxiv.org/pdf/2507.00828

5 months ago 27 8 1 0
Post image

AI is already at work in American newsrooms.

We examine 186k articles published this summer and find that ~9% are either fully or partially AI-generated, usually without readers having any idea.

Here's what we learned about how AI is influencing local and national journalism:

5 months ago 56 29 5 2
Advertisement
Post image Post image

"AI slop" seems to be everywhere, but what exactly makes text feel like "slop"?

In our new work (w/ @tuhinchakr.bsky.social, Diego Garcia-Olano, @byron.bsky.social ) we provide a systematic attempt at measuring AI "slop" in text!

arxiv.org/abs/2509.19163

🧡 (1/7)

6 months ago 33 17 1 1

Keynote at #COLM2025: Nicholas Carlini from Anthropic

"Are language models worth it?"

Explains that the prior decade of his work on adversarial images, while it taught us a lot, isn't very applied; it's unlikely anyone is actually altering images of cats in scary ways.

6 months ago 80 22 2 2

πŸ“’ New #COLM2025 paper πŸ“’

Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! πŸ₯΄

Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.

🧡

7 months ago 41 10 3 1

What are your favorite recent papers on using LMs for annotation (especially in a loop with human annotators), synthetic data for task-specific prediction, active learning, and similar?

Looking for practical methods for settings where human annotations are costly.

A few examples in thread ↴

8 months ago 79 23 13 3

I see this work as our answer to the "cultural alignment" and "cultural benchmarking" trends in NLP research. Instead of making decisions for people, we consider "culture" in a specific setting with specific people for a specific task, and we ask people directly about their cultural adaptations.

10 months ago 38 6 1 0
Preview
GitHub - chtmp223/Frankentext: Frankentext: Stitching random text fragments into long-form narratives Frankentext: Stitching random text fragments into long-form narratives - chtmp223/Frankentext

We release code to facilitate future research on fine-grained detection of mixed-origin texts and human-AI cowriting.

Github: github.com/chtmp223/Fra...
Paper: arxiv.org/abs/2505.18128

Work done with @jennajrussell, @dzungvietpham, and @MohitIyyer!

10 months ago 2 0 0 0
Post image

Room for improvement:

πŸ”§ Frankentexts struggle with smooth narrative transitions and grammar, as noted by human annotators.
πŸ”© Non-fiction versions are coherent and faithful but tend to be overly anecdotal and lack factual accuracy.

10 months ago 1 0 1 0
Post image

Takeaway 2: Our controllable generation process provides a sandbox for human-AI co-writing research, with adjustable proportion, length, and diversity of human excerpts.

πŸ‘« Models can follow copy constraints, which is a proxy for % of human writing in co-authored texts.

10 months ago 1 0 1 0
Advertisement
Post image

Takeaway 1: Frankentexts don’t fit into the "AI vs. human" binary.

πŸ“‰ Binary detectors misclassify them as human-written
πŸ‘¨β€πŸ‘©β€πŸ‘§ Humans can detect AI involvement more often
πŸ” Mixed-authorship tools (Pangram) help, but still catch only 59%

We need better tools for this gray zone.

10 months ago 1 0 1 0
Post image

Automatic evaluation on 100 Frankentexts using LLM judges, text detectors, and a ROUGE-L-based metric shows that:

πŸ’ͺ Gemini-2.5-Pro, Claude-3.5-Sonnet, and R1 can generate Frankentexts that are up to 90% relevant, 70% coherent, and 75% traceable to the original human writings.

10 months ago 2 0 1 0
Post image

Frankentext generation presents an instruction-following task that challenges the limits of controllable generation, requiring each model to:

1️⃣ Produce a draft by selecting & combining human-written passages.
2️⃣ Iteratively revise the draft while maintaining a copy ratio.

10 months ago 3 0 1 0
Post image

πŸ€” What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts?

🧟 You get what we call a Frankentext!

πŸ’‘ Frankentexts are surprisingly coherent and tough for AI detectors to flag.

10 months ago 35 8 1 1

We find that LLMs (e.g. GPT-4o, LLaMA-3.1) consistently recall book content across languages, even for texts without official translation in pre-training data!

Great work led by undergrads at UMass NLP πŸ₯³

10 months ago 2 0 0 0
A visualization of the generator-validator gap, where the LM likelihoods of for the generator and discriminator forms of questions are poorly correlated.

A visualization of the generator-validator gap, where the LM likelihoods of for the generator and discriminator forms of questions are poorly correlated.

Aligning the validator and generator rankings can fix it!

Aligning the validator and generator rankings can fix it!

One of the ways that LLMs can be inconsistent is the "generator-validator gap," where LLMs deem their own answers incorrect.

🎯 We demonstrate that ranking-based discriminator training can significantly reduce this gap, and improvements on one task often generalize to others!

πŸ§΅πŸ‘‡

1 year ago 35 8 2 3
Racial and Ethnic Representation in Literature Taught in US High Schools | Published in Journal of Cultural Analytics By Li Lucy, Camilla Griffiths & 7 more. We quantify the representation, or presence, of characters of color in English Language Arts instruction in the United States to better understand possible raci...

πŸ“š Check out the newest JCA article by Li Lucy (@lucy3.bsky.social), Camilla Griffiths, Claire Ying, JJ Kim-Ebio, Sabrina Baur, Sarah Levine, Jennifer L. Eberhardt, David Bamman (@dbamman.bsky.social), and Dorottya Demszky. culturalanalytics.org/article/1316...

1 year ago 47 23 1 0
Preview
Learning to Reason for Long-Form Story Generation Generating high-quality stories spanning thousands of tokens requires competency across a variety of skills, from tracking plot and character arcs to keeping a consistent and engaging style. Due to…

A very cool paper shows that you can use the RL loss to improve story generation by some clever setups on training on known texts (e.g. ground predictions versus a next chapter you know). RL starting to generalize already!

1 year ago 33 6 0 2
Leaderboard showing performance of language models on claim verification task over book-length input. o1-preview is the best model with 67.36% accuracy followed by Gemini 2.5 Pro with 64.17% accuracy.

Leaderboard showing performance of language models on claim verification task over book-length input. o1-preview is the best model with 67.36% accuracy followed by Gemini 2.5 Pro with 64.17% accuracy.

We have updated #nocha, a leaderboard for reasoning over long-context narratives πŸ“–, with some new models including #Gemini 2.5 Pro which shows massive improvements over the previous version! Congrats to #Gemini team πŸͺ„ πŸ§™ Check πŸ”— novelchallenge.github.io for details :)

1 year ago 11 4 0 0
Advertisement

New paper from our team @GoogleDeepMind!

🚨 We've put LLMs to the test as writing co-pilots – how good are they really at helping us write? LLMs are increasingly used for open-ended tasks like writing assistance, but how do we assess their effectiveness? πŸ€”

arxiv.org/pdf/2503.19711

1 year ago 20 8 1 1
Post image Post image

Our lab had a #dogathon πŸ• yesterday where we analyzed NYC Open Data on dog licenses. We learned a lot of dog facts, which I’ll share in this thread 🧡

1) Geospatial trends: Cavalier King Charles Spaniels are common in Manhattan; the opposite is true for Yorkshire Terriers.

1 year ago 52 14 2 14
ArXiv Paper Feed

The high effort solution is to use an LLM to make a browser extension which tracks your academic reading and logs every paper you interact with to github, which builds and publishes a webapp to expose the data.

Which, clearly only a crazy weirdo would do.

dmarx.github.io/papers-feed/

1 year ago 39 9 3 5
Video

πŸ’‘New preprint & Python package: We use sparse autoencoders to generate hypotheses from large text datasets.

Our method, HypotheSAEs, produces interpretable text features that predict a target variable, e.g. features in news headlines that predict engagement. 🧡1/

1 year ago 41 13 1 3

Ask OpenAI Operator for bus routes from your home in Vietnam to a university and it likely fails because it refuses to use Google Maps! Our new BEARCUBS 🐻 benchmark shows CU agents still struggle with seemingly straightforward multimodal questions.

1 year ago 1 0 0 0
Post image

Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers?

We create ONERULER πŸ’, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!

Our analysis across 26 languages πŸ§΅πŸ‘‡

1 year ago 14 5 1 3