Advertisement · 728 × 90

Posts by A. Feder Cooper

My work with @marklemley.bsky.social and others on extracting copyrighted books from language models recently featured in the UK House of Lords, Communications and Digital Committee report on AI, copyright and the creative industries

publications.parliament.uk/pa/ld5901/ld...

6 days ago 15 5 0 0
Post image

New book on the idea shelf: Hasok Chang: _Inventing Temperature: Measurement and Scientific Progress_

What even is temperature? The history of science is a struggle both to make consistent measurements and to think clearly about what they are actually measuring.

james.grimmelmann.net/idea-shelf

2 weeks ago 9 1 3 0

it’s clarifying in a way. (trying to find a silver lining). like it’s also clearer to me now more than before who’s in it for love of research vs love of the game

2 weeks ago 1 0 0 0

have identified this kind of thing among some researchers I really used to respect. On top of the waste of time, research is starting to feel pretty lonely. Really just want to do little fun projects with people who also like to do little fun projects.

3 weeks ago 6 0 1 0

Yes, we hope to have both a more lay oriented paper focused on the results and a paper that focuses on the legal implications in the definition of a copy available soon

3 weeks ago 2 1 0 0

(Will posting a big update to this one soon)

3 weeks ago 0 0 0 0
Preview
Extracting books from production language models Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized dat...

Here’s a follow-on paper from that one, which studies blackbox systems (and so has to use very different methods): arxiv.org/abs/2601.02671

3 weeks ago 1 0 0 0
Advertisement
Preview
Extracting memorized pieces of (copyrighted) books from open-weight language models Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expr...

Here’s our prior paper, which uses the verbatim version of the metric, but is specifically about copyrighted text and open-weight models : arxiv.org/abs/2505.12546

3 weeks ago 5 1 2 0

For sure, there are really interesting questions here. For copyright, and also memorization more generally. We’re going to do something specifically for copyrighted text (the present paper is an ML methods paper, and written that way), and then there’s still other stuff to do (more on that later)

3 weeks ago 1 0 2 0
Preview
Estimating near-verbatim extraction risk in language models with decoding-constrained beam search Recent work shows that standard greedy-decoding extraction methods for quantifying memorization in LLMs miss how extraction risk varies across sequences. Probabilistic extraction -- computing the prob...

@afedercooper.bsky.social and I have a new paper that expands AI memorization tests to include near-verbatim copies, not just exact copies. We find that increases extraction/memorization of copyrighted content significantly. arxiv.org/abs/2603.24917

3 weeks ago 33 4 1 0

Agree that near-verbatim is important. This was a nontrivial computer science problem for the notion of memorization we’ve been studying (different from the prior standard metric, which undercounts for additional reasons). Working on a blog post that explains the hard but interesting problem here

3 weeks ago 5 1 1 0

okay there is actually one more, and it is also about copyright. but it's a law review paper. this is my last ML* memorization paper (at least for a while)

1 month ago 3 0 0 0

i'm writing my last memorization paper (i say for the 10th time), and hopefully what is my last first author paper for a bit. this one also isn't about copyright. i'm excited to start thinking about other things.

if anyone is looking for a new research buddy, i'm down to clown.

1 month ago 6 0 1 0

please let me know if you ever want to chat about any of this. I can’t promise I have anything useful to say, but I do have plenty to say about this. And am of course always around to listen.

1 month ago 0 0 0 0
Preview
Microsoft deletes blog telling users to train AI on pirated Harry Potter books The now-deleted Harry Potter dataset was "mistakenly" marked public domain.

like sometimes life is art, specifically an absurdist Beckett play

arstechnica.com/tech-policy/...

1 month ago 1 0 0 0
Post image

someone sent me this from the other place and this timeline really is something else

1 month ago 4 0 1 0
Advertisement

Not a perfect fit to the exact query I don't think, but I like this note as a starting place: lawreview.uchicago.edu/sites/defaul...

@jackbalkin.bsky.social

2 months ago 1 0 0 0

(lucky for everyone that I'm too lazy to write a blog post))

2 months ago 3 0 0 0

Yes, I have published at that track before, and related ones. But I'm not eager to again. Getting into that is maybe worth a blog post.

2 months ago 2 0 1 0

No I did not write/submit this paper to the ICML position paper track. Like many (but of course not all) papers submitted there, I think this is at most a blog post (where "at most" is a very generous upper bound, because the ~300 characters above almost certainly are enough).

2 months ago 3 0 1 0

Position: ML conferences should consider removing the position paper track

(...and just acknowledge that every scientific paper is articulating at least one position)

2 months ago 10 0 1 0

(This is all to say, I've been shocked at some of what I've heard coming out of industry. My assumption used to be that they knew a lot more about this than they seem to.)

2 months ago 1 0 0 0

I think partially yes. There definitely are full-time applied and research people working on data curation as a topic. But there are a ton of gaps/ things that might seem surprising here. E.g., making corpus-level decisions doesn't always tell you much about the underlying training data examples.

2 months ago 2 0 1 0

Am also concerned about this, but it’s not clear to me that companies even know everything that’s included. I suppose “use it all” is an editorial decision, though.

2 months ago 2 0 1 0

I just had a paper I reviewed months ago be “desk rejected” by ICLR for this reason. (It’s arguably not a desk rejection after 3 reviewers already chimed in.) But, this seems to be where things are headed.

2 months ago 1 0 0 0
Advertisement

Even if chucking the papers outright is undesirable (hallucination checkers are not error-free), I'm disappointed there's no process at all other than "oops, you can go fix it if you care to."

2 months ago 3 0 0 0

(though going forward, I wouldn’t be sad if I had a bit more compute 🙃)

3 months ago 0 0 0 0

One of my favorite responses to questions about compute in my work this year is “it’s expensive, yes, but I had to develop some efficient algos and write some efficient code to make this possible. This work was done at odd hours on 4 A100s shared by a dozen people.”

3 months ago 0 0 1 0

note that i said “ML” and “copyright,” which are very specific things that i actually think have very little to do with the anger i’m referring to

3 months ago 1 0 0 0

it’s hard to work at the intersection of ML and copyright because “both sides” of the debate are angry and, in my experience, most haven’t done much of the background reading in ML or copyright to have an informed opinion. it’s just vibes and anger. i should probably write something up about this.

3 months ago 7 1 1 0