My work with @marklemley.bsky.social and others on extracting copyrighted books from language models recently featured in the UK House of Lords, Communications and Digital Committee report on AI, copyright and the creative industries
publications.parliament.uk/pa/ld5901/ld...
Posts by A. Feder Cooper
New book on the idea shelf: Hasok Chang: _Inventing Temperature: Measurement and Scientific Progress_
What even is temperature? The history of science is a struggle both to make consistent measurements and to think clearly about what they are actually measuring.
james.grimmelmann.net/idea-shelf
it’s clarifying in a way. (trying to find a silver lining). like it’s also clearer to me now more than before who’s in it for love of research vs love of the game
have identified this kind of thing among some researchers I really used to respect. On top of the waste of time, research is starting to feel pretty lonely. Really just want to do little fun projects with people who also like to do little fun projects.
Yes, we hope to have both a more lay oriented paper focused on the results and a paper that focuses on the legal implications in the definition of a copy available soon
(Will posting a big update to this one soon)
Here’s a follow-on paper from that one, which studies blackbox systems (and so has to use very different methods): arxiv.org/abs/2601.02671
Here’s our prior paper, which uses the verbatim version of the metric, but is specifically about copyrighted text and open-weight models : arxiv.org/abs/2505.12546
For sure, there are really interesting questions here. For copyright, and also memorization more generally. We’re going to do something specifically for copyrighted text (the present paper is an ML methods paper, and written that way), and then there’s still other stuff to do (more on that later)
@afedercooper.bsky.social and I have a new paper that expands AI memorization tests to include near-verbatim copies, not just exact copies. We find that increases extraction/memorization of copyrighted content significantly. arxiv.org/abs/2603.24917
Agree that near-verbatim is important. This was a nontrivial computer science problem for the notion of memorization we’ve been studying (different from the prior standard metric, which undercounts for additional reasons). Working on a blog post that explains the hard but interesting problem here
okay there is actually one more, and it is also about copyright. but it's a law review paper. this is my last ML* memorization paper (at least for a while)
i'm writing my last memorization paper (i say for the 10th time), and hopefully what is my last first author paper for a bit. this one also isn't about copyright. i'm excited to start thinking about other things.
if anyone is looking for a new research buddy, i'm down to clown.
please let me know if you ever want to chat about any of this. I can’t promise I have anything useful to say, but I do have plenty to say about this. And am of course always around to listen.
someone sent me this from the other place and this timeline really is something else
Not a perfect fit to the exact query I don't think, but I like this note as a starting place: lawreview.uchicago.edu/sites/defaul...
@jackbalkin.bsky.social
(lucky for everyone that I'm too lazy to write a blog post))
Yes, I have published at that track before, and related ones. But I'm not eager to again. Getting into that is maybe worth a blog post.
No I did not write/submit this paper to the ICML position paper track. Like many (but of course not all) papers submitted there, I think this is at most a blog post (where "at most" is a very generous upper bound, because the ~300 characters above almost certainly are enough).
Position: ML conferences should consider removing the position paper track
(...and just acknowledge that every scientific paper is articulating at least one position)
(This is all to say, I've been shocked at some of what I've heard coming out of industry. My assumption used to be that they knew a lot more about this than they seem to.)
I think partially yes. There definitely are full-time applied and research people working on data curation as a topic. But there are a ton of gaps/ things that might seem surprising here. E.g., making corpus-level decisions doesn't always tell you much about the underlying training data examples.
Am also concerned about this, but it’s not clear to me that companies even know everything that’s included. I suppose “use it all” is an editorial decision, though.
I just had a paper I reviewed months ago be “desk rejected” by ICLR for this reason. (It’s arguably not a desk rejection after 3 reviewers already chimed in.) But, this seems to be where things are headed.
Even if chucking the papers outright is undesirable (hallucination checkers are not error-free), I'm disappointed there's no process at all other than "oops, you can go fix it if you care to."
(though going forward, I wouldn’t be sad if I had a bit more compute 🙃)
One of my favorite responses to questions about compute in my work this year is “it’s expensive, yes, but I had to develop some efficient algos and write some efficient code to make this possible. This work was done at odd hours on 4 A100s shared by a dozen people.”
note that i said “ML” and “copyright,” which are very specific things that i actually think have very little to do with the anger i’m referring to
it’s hard to work at the intersection of ML and copyright because “both sides” of the debate are angry and, in my experience, most haven’t done much of the background reading in ML or copyright to have an informed opinion. it’s just vibes and anger. i should probably write something up about this.