Advertisement · 728 × 90

Posts by Robin Jia

Preview
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating? Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate...

Frontier LLMs don't debug, they regenerate.

We built PDB to measure that gap, GPT-5.1-Codex pass unit tests >76% of the time, but touch only <45% of the right lines.

Even Claude Code touches only ~50%.

📄 Paper: arxiv.org/abs/2604.17338
🌐 Project: precise-debugging-benchmark.github.io

2 days ago 6 2 1 0

Hubble is finally out! We used 200k GPU hours from NAIRR and NVIDIA to build a comprehensive resource for the scientific study of LLM memorization. Fully open-source models & data up to 8B params + 500B tokens with controlled data insertion to study memorization risks 🔭✨

5 months ago 7 1 0 0
Hubble Suite logo (cloth patch with names of key organizations involved: USC, MPI, NVIDIA)

Hubble Suite logo (cloth patch with names of key organizations involved: USC, MPI, NVIDIA)

Announcing 🔭Hubble, a suite of open-source LLMs to advance the study of memorization!

Pretrained 1B/8B param models, with controlled insertion of texts designed to emulate key memorization risks: copyright (e.g., book passages), privacy (e.g., synthetic biographies), and test set contamination

5 months ago 8 4 1 3

I had a lot of fun contemplating about memorization questions at the @l2m2workshop.bsky.social panel yesterday together with Niloofar Mireshghallah and Reza Shokri, moderated by
@pietrolesci.bsky.social who did a fantastic job!
#ACL2025

8 months ago 12 2 1 1
Preview
Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics Improvements in large language models have led to increasing optimism that they can serve as reliable evaluators of natural language generation outputs. In this paper, we challenge this optimism by th...

Paper link: arxiv.org/abs/2501.14883

8 months ago 0 0 0 0
Post image

Automatic metrics for assessing factuality are easy to run and commonly used, but do they work? In < 1 hour, come find the answer at poster 349 in Hall X4, where I’ll be presenting @ameyagodbole.bsky.social ‘s work uncovering inconsistencies, errors, and biases of factuality metrics!

8 months ago 2 0 1 0

I’ll be at ACL 2025 next week where my group has papers on evaluating evaluation metrics, watermarking training data, and mechanistic interpretability. I’ll also be co-organizing the first Workshop on LLM Memorization @l2m2workshop.bsky.social on Friday. Hope to see lots of folks there!

8 months ago 2 0 0 0
LLMs can propose plans and generate action semantics, but struggle with state tracking. Symbolic planners leverage specialized search algorithms, but require predefined action semantics for the environment.
PSALM integrates the strengths of both.

LLMs can propose plans and generate action semantics, but struggle with state tracking. Symbolic planners leverage specialized search algorithms, but require predefined action semantics for the environment. PSALM integrates the strengths of both.

Come by @naaclmeeting.bsky.social Poster 6 in Hall 3 from 4-530pm today to see @billzhu.bsky.social's and Ishika Singh's work with me and @robinjia.bsky.social on PSALM: autonomously inducing symbolic pre- and post-conditions of actions with LLMs, symbolic planning, and text environment interaction!

11 months ago 6 1 1 0
Advertisement

Check out @billzhu.bsky.social ‘s excellent work on combining LLMs with symbolic planners at NAACL on Thursday! I will also be at NAACL Friday-Sunday, looking forward to chatting about LLM memorization, interpretability, evaluation, and more

11 months ago 3 0 0 0
Post image

At @naaclmeeting.bsky.social this week! I’ll be presenting our work on LLM domain induction with @thomason.bsky.social on Thu (5/1) at 4pm in Hall 3, Section I.

Would love to connect and chat about LLM planning, reasoning, AI4Science, multimodal stuff, or anything else. Feel free to DM!

11 months ago 4 3 0 1
Preview
PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them Open-domain Question Answering models which directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compare...

Sounds like arxiv.org/abs/2102.07033

1 year ago 1 0 0 0
Post image Post image

Excited to share that my intern work at Meta GenAI is accepted to @iclr-conf.bsky.social #ICLR2025

Introducing TLDR: Token-Level Detective Reward Model For Large Vision Language Models.

TLDR provides fine-grained annotations to
each text token.

🔗arXiv: arxiv.org/abs/2410.04734

1 year ago 5 1 1 1

Our workshop on LLM Memorization is coming to ACL 2025! The call for papers is out, please submit both archival and non-archival (work in progress or already published) papers

1 year ago 8 3 0 0
Preview
Pre-trained Large Language Models Use Fourier Features to Compute Addition Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-tra...

Links & presentation times:
1. Fourier Features: arxiv.org/abs/2406.03445 Thu, 4:30pm
2. TF + ICL: arxiv.org/abs/2310.17086 Fri, 11am
3. Backdoor detection: arxiv.org/abs/2409.00399 Sat, 1:44pm at AdvML Frontiers
4. LLMs + PDDL: arxiv.org/abs/2406.02791 Sun, 2:30pm at OWA workshop

1 year ago 0 0 0 0

I'll be at #NeurIPS2024! My group has papers analyzing how LLMs use Fourier Features for arithmetic and how TFs learn higher-order optimization for ICL (led by @deqing.bsky.social), plus workshop papers on backdoor detection and LLMs + PDDL (led by @billzhu.bsky.social)

1 year ago 23 3 1 1

A starter pack for #NLP #NLProc researchers! 🎉

go.bsky.app/SngwGeS

1 year ago 251 99 45 13

USC NLP folks are on Bluesky!
Follow my amazing colleagues here

go.bsky.app/KUwSZ6W

1 year ago 17 5 3 2
Advertisement

Started a SoCal AI/ML/NLP researchers starter pack! It's a bit sparse right now, and perhaps more NLP heavy, but hey, nominate yourself and others! go.bsky.app/6QckPj9

1 year ago 43 8 17 1