Aarash Feizi (@aarashfeizi) Bsky

PairBench: A Systematic Framework for Selecting Reliable Judge VLMs As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare data pairs as instructed in the prompt becomes essential. To ad...

🧵 7/7

📢 Shoutout to my amazing co-authors and to ServiceNow Research and Mila for making this happen! 🚀

📄 Read the full paper: arxiv.org/abs/2502.15210

#PairBench #LLMs #VLMs #GenAI #AutoEval

1 year ago 1 0 0 0

🧵 6/7

✅ Beyond benchmarking, PairBench can be used during VLM training & fine-tuning to detect biases early and improve evaluation methods!

This could lead to more trustworthy, consistent AI systems for real-world tasks. 🚀

1 year ago 2 0 1 0

🧵 5/7

✅ PairBench correlates strongly with existing benchmarks, meaning it can serve as a low-cost alternative to expensive human-annotated benchmarks!

This makes it easier to compare and rank models efficiently—without excessive computational costs.

1 year ago 0 0 1 0

🧵 4/7

Instead of blindly picking a judge model, we should ask:
🔹 What task is being evaluated?
🔹 What metric matters most?

✅ PairBench helps match the right VLM to the right task, improving fairness & reliability in auto-evaluation.

1 year ago 0 0 1 0

🧵 3/7

🚨 No single VLM is the best! Models vary drastically across PairBench metrics.

Although some align well with human judgements, they may struggle at symmetry, smoothness, or controllability—making their scores unreliable!

📄 More failure cases in our paper’s appendix!

1 year ago 0 0 1 0

🧵 2/7

✅ Surprising (and concerning) result: Most VLMs lack symmetry! 🤯

In theory, sim(A, B) = sim(B, A)—but in practice? Many models fail!

For example, simply swapping the order of the input images makes GPT-4o and Gemini 1.5 Pro change their decision and scores drastically. 🔄

1 year ago 0 0 1 0

🧵 1/7

Vision language models (VLMs) are widely used as automated evaluators, but can they actually compare data reliably? 🤔

✅ PairBench systematically tests how well VLMs judge similarity across modalities, revealing key strengths & weaknesses in their decisions.

1 year ago 0 0 1 0

🚨 Excited to introduce PairBench! 🚨

💡 TL;DR: VLM-judges can fail at data comparison!

✅ PairBench helps you pick the right one by testing alignment, symmetry, smoothness & controllability—ensuring reliable auto-evaluation.

📄 Paper: arxiv.org/abs/2502.15210

🧵 Thread: 👇

1 year ago 1 2 1 0

Posts by Aarash Feizi