Advertisement · 728 × 90

Posts by Yinglun Zhu

ttm

For more details, please check out our

Blog: yinglunz.com/blogs/ttm.html
Paper: arxiv.org/pdf/2510.07632
Code: github.com/yinglunz/tes...

Joint work with Jiancheng Zhang and Fuzhi Tang. Feedback and thoughts are very welcome!

5 months ago 0 0 0 0

Two takeaways:

1. Eval lies at the heart of AI progress.
2. Iterative, matching-based self-improvements works -- and should be explored beyond compositional reasoning!

5 months ago 0 0 1 0
Post image

TTM can also be extended to datasets without local groups -- by treating the entire dataset as a global assignment problem between all images and captions (solved in polynomial time).

The global TTM variant achieves up to 33.3% relative error reduction.

5 months ago 0 0 1 0
Post image

TTM isn’t limited to benchmarks with k-by-k groups.

For 1-by-k groups, GroupMatch = GroupScore, so metric change brings no benefit. Yet, TTM still delivers substantial improvements -- up to 85.7% -- on datasets such as SugarCrepe and WhatsUp.

5 months ago 0 0 1 0
Post image

TTM provides substantial improvements on top of SimpleMatch, without external supervision.

Remarkably, TTM enables SigLIP-B16 (~ 0.2B params) to surpass GPT-4.1 on MMVP-VLM.

Shout out to the awesome authors behind SigLIP! @giffmana.ai @xzhai.bsky.social @kolesnikov.ch and Basil Mustafa

5 months ago 0 0 1 0
Post image

To push further, we develop Test-Time Matching (TTM), an iterative, self-improving algorithm with two key components:

(i) GroupMatch-based pseudo-labels for stronger supervision.
(ii) A progressively decaying selection threshold schedule to gradually expand coverage across the test set.

5 months ago 0 0 1 0
Post image

SimpleMatch reveals substantial hidden capability -- it enables SigLIP-B16 to surpass all prior results and GPT-4.1 to achieve the first result surpassing human performance on Winoground.

5 months ago 0 0 1 0
Advertisement

Because a correct GroupMatch also guarantees a perfect GroupScore, this creates an arbitrage opportunity via a two-step SimpleMatch procedure:

1. Select the most likely matching under GroupMatch.
2. Overfit to that matching at test time.

5 months ago 0 0 1 0

We introduce a new GroupMatch metric that evaluates the best overall matching instead of isolated pairwise comparisons.

This increases the random-guessing success rate to 1/k! (from 1/6 to 1/2 when k = 2).

5 months ago 0 0 1 0

The widely used GroupScore metric requires one-to-one alignment between k images and k captions without enforcing consistency -- a single collision means failure.

Under random guessing, the success rate is (k-1)! / (2k-1)! → only 1/6 when k = 2.

5 months ago 1 0 1 0

Multimodal models, even frontier ones, have long been reported to perform at or below random guessing on compositional reasoning benchmarks.

Why does this happen?

We find that part of the difficulty lies in the evaluation metric itself.

5 months ago 0 0 1 0
Post image

Super excited to share Test-Time Matching (TTM), an iterative, self-improving algorithm that unlocks substantial compositional reasoning capabilities in multimodal models.

TTM enables SigLIP-B16 (~0.2B params) to outperform GPT-4.1 on MMVP-VLM, establishing a new SOTA.

5 months ago 3 0 1 0

Paper: yinglunz.com/pdfs/dtrl.pdf

Joint work with my student Junkai Luo.
Feedback welcome! 🙌

6 months ago 0 0 0 0

Our algorithm achieves SOTA performance across multiple benchmarks.

We hope these ideas also inspire improvements to GRPO for LLMs—especially in credit assignment.

6 months ago 0 0 1 0
Advertisement

💡 Building on this insight, we adapt GRPO to online finetuning of DTs, introducing:

• Sub-trajectory optimization → better credit assignment
• Sequence-level likelihood objectives (concurrent w/ GSPO) → stability & efficiency
• Active sampling → improved exploration in uncertain regions

6 months ago 1 0 1 0

🔍 We identify hindsight return relabeling as the key obstacle: while useful for supervised objectives, it destabilizes importance weights for RL methods like PPO and GRPO.

6 months ago 0 0 1 0
Post image

🚀Excited to share our new paper:

Online Finetuning Decision Transformers with Pure RL Gradients

RL drives reasoning in LLMs—but remains underexplored for online finetuning of Decision Transformers (DTs), where most methods still rely mainly on supervised objectives.

Why?

6 months ago 4 0 1 0

Paper: arxiv.org/pdf/2510.03247
Joint work with my student Jiancheng Zhang.
Feedback welcome!

3/3

6 months ago 0 0 0 0

Our algorithm combines uncertainty and diversity principles in a modality-aware
design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. It achieves consistent gains over baselines across multiple benchmarks, including COCO and DataComp.

2/3

6 months ago 0 0 1 0
Post image

Sharing new paper: Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

We extend classical unimodal active learning to the multimodal AL with unaligned data, allowing data-efficient finetuning and pretraining of vision-language models such as CLIP and SigLIP.

1/3

6 months ago 3 0 1 0

We hope this work inspires more research on adaptive, efficient deployment of LLMs—where compute is used strategically rather than blindly.

Joint work with my student Bowen Zuo 🙌
Feedback welcome!

9 months ago 0 0 0 0
Post image

Most methods allocate compute uniformly, ignoring variation in query difficulty.

We propose adaptive algorithms that estimate query difficulty on the fly and allocate compute strategically—just enough for easy queries and more for hard ones.

📊 Example (avg. budget = 32):

(2/3)

9 months ago 0 0 1 0
Advertisement
Post image

🚀Excited to share our new paper: Strategic Scaling of Test-Time Compute: A Bandit Learning Approach.

We turn test-time compute allocation into a bandit learning problem, achieving:
✅ +11.10% on MATH-500
✅ +7.41% on LiveCodeBench

Paper: arxiv.org/pdf/2506.12721

(1/3)

9 months ago 1 0 1 0
Post image

There is a ton of interest in the question of whether AI can be be funny: www.bbc.com/future/artic.... Our paper at NeurIPS investigates the humor generation capabilities of the latest and greatest AI models using one of world’s largest humor datasets! arxiv.org/pdf/2406.10522

1 year ago 9 1 0 1
Post image

I’m recruiting multiple PhD students for Fall 2025 at UCR! If you’re interested in working on efficient ML, RL, and LLMs, please apply to the UCR CS/EE PhD program.

Please visit yinglunz.com for detailed information on research directions and contact instructions.

1 year ago 3 2 0 0