#LLMEvaluation hashtag - Bluesky

5 months ago

Bayesian Framework Replaces Pass@k for More Reliable LLM Evaluation

A Bayesian evaluation framework replaces Pass@k, giving credible intervals and supporting scoring. Tests on AIME'24/25, HMMT'25 and BrUMO'25 show faster convergence and tighter bounds. getnews.me/bayesian-framework-repla... #bayesian #llmevaluation

0 0 0 0

5 months ago

Unified LEGO-IRT Framework Enables Data-Efficient LLM Evaluation

The LEGO‑IRT framework can estimate LLM capabilities using only 3 % of evaluation items and showed up to 10 % error reduction with structural knowledge. Read more: getnews.me/unified-lego-irt-framewo... #legoirt #llmevaluation

0 0 0 0

5 months ago

BluePrint Dataset Enables LLM Evaluation of Social Media Personas

The BluePrint dataset, released on September 27 2025, offers anonymized political personas from Bluesky with 12 interaction types for LLM next‑action prediction benchmarks. Read more: getnews.me/blueprint-dataset-enable... #blueprint #llmevaluation

0 0 0 0

6 months ago

Multi-Agent LLM Evaluation via a Social Laboratory Framework

Researchers built a social lab where LLM agents debate, reaching agreement above 0.88 with a moderator guiding outcomes. Submitted 1 Oct 2025 at NeurIPS 2025. Read more: getnews.me/multi-agent-llm-evaluati... #llmevaluation #neurips

0 0 0 0

6 months ago

SKYLENAGE Benchmark Launches Multi-Level Math Evaluation for LLMs

SKYLENAGE introduces a benchmark with 100 reasoning items and 150 contest problems; the top model reached 44% accuracy on contests and 81% on reasoning. Read more: getnews.me/skylenage-benchmark-laun... #skylenage #mathbenchmark #llmevaluation

0 0 0 0

6 months ago

New SPEED Framework Enhances Fair and Interpretable LLM Evaluation

SPEED, a self‑refining evaluation system, uses compact expert models to give descriptive analyses of LLMs instead of a single score; the study spans 16 pages. Read more: getnews.me/new-speed-framework-enha... #speedframework #llmevaluation

0 0 0 0

6 months ago

New CDT Framework Offers Holistic Evaluation for Large Language Models

Researchers released the Cognition‑Domain‑Task (CDT) framework, which reports up to 2‑point gains in benchmark scores and fine‑tuned models reaching 44.3 and 45.4. Read more: getnews.me/new-cdt-framework-offers... #cdt #llmevaluation #ai

0 0 0 0

6 months ago

ByteSized32Refactored: Extensible Text‑Game Corpus for LLM Evaluation

ByteSized32Refactored provides 32 text‑game environments and cuts the codebase to ~10,000 lines with a shared GameBasic.py library. GPT‑4o tests showed mixed results. Read more: getnews.me/bytesized32refactored-ex... #textgames #llmevaluation

0 0 0 0

6 months ago

TrustJudge Reduces Evaluation Inconsistencies in LLM-as-a-Judge Systems

TrustJudge lowers score‑comparison inconsistency by 8.43% and pairwise transitivity errors by 10.82% using distribution‑sensitive scoring and likelihood‑aware aggregation. getnews.me/trustjudge-reduces-evalu... #trustjudge #llmevaluation

0 0 0 0

6 months ago

Persona‑Based Prompting Reveals Style Gaps in LLM Benchmarks

The study showing persona‑augmented prompting was accepted for EMNLP 2025; the initial version appeared in July 2025 and a revised version in September 2025. Read more: getnews.me/persona-based-prompting-... #llmevaluation #persona #emnlp2025

0 0 0 0

6 months ago

Evaluating LLM Progress with Benchmarks, Games, and Cognitive Tests

EMNLP 2025 Findings show interactive games reveal bigger performance gaps between LLMs than static benchmarks and capture social-emotional reasoning better. Read more: getnews.me/evaluating-llm-progress-... #llmevaluation #emnlp2025

0 0 0 0

6 months ago

TALEC Enables Custom LLM Evaluation with In‑House Criteria via ICL

TALEC lets firms set custom LLM evaluation criteria via in‑context learning, matching over 80% correlation with human judgments. First posted June 2024, updated Sep 2025. Read more: getnews.me/talec-enables-custom-llm... #talec #llmevaluation

0 0 0 0

6 months ago

Design Flaws in LLM Judge Benchmarks Cause Rankings to Turn Into Noise

LLM‑judged benchmarks such as Arena‑Hard Auto can produce noisy rankings; DeepSeek‑R1‑32B showed over 90% unexplained variance and factor correlations above 0.93. Read more: getnews.me/design-flaws-in-llm-judg... #llmevaluation #benchmark

0 0 0 0

6 months ago

SPEED Framework Enhances LLM Evaluation with Expert‑Driven Diagnostics

Introduced on 24 Sep 2025, the SPEED framework uses expert models to flag hallucinations, toxicity and context issues, offering feedback beyond static scores. Read more: getnews.me/speed-framework-enhances... #speedframework #llmevaluation

0 0 0 0

6 months ago

DeCE Framework Splits Precision and Recall for Better LLM Evaluation

DeCE separates precision and recall for LLM answers, achieving a Pearson r = 0.78 with expert scores versus BLEU/ROUGE’s r = 0.12. Only 11.95% of criteria needed review. Read more: getnews.me/dece-framework-splits-pr... #dece #llmevaluation

0 0 0 0

6 months ago

NazoNazo Benchmark Evaluates Insight Reasoning in Large Language Models

The NazoNazo benchmark tests insight reasoning with Japanese riddles; humans scored 52.9% accuracy on a set of 120 riddles, and only GPT‑5 came close to that level. getnews.me/nazonazo-benchmark-evalu... #nazonazo #llmevaluation #riddles

0 0 0 0

Tulu 3: Pushing Frontiers in Open Language Model Post-Training Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary o...

6 months ago

📄 Paper: arxiv.org/abs/2411.15124
📦 Datasets: huggingface.co/collections/...
🎬 YouTube: www.youtube.com/watch?v=DPhq...
🎙 Spotify: open.spotify.com/episode/7aHP...
🎧 Apple: podcasts.apple.com/ca/podcast/o...
#WiAIR #LLMEvaluation #LLMs #OpenScience 8/8🧵

0 0 0 0

6 months ago

Instance-level Randomization Improves Stability of LLM Evaluations

Instance-level randomization (ILR) averages multiple runs per test case, cutting variance and using less than half the compute of fixed-setting benchmarks. Read more: getnews.me/instance-level-randomiza... #instancelerandomization #llmevaluation

0 0 0 0

6 months ago

Direct Judgement Preference Optimization Improves LLM Evaluation

A new study on Direct Judgement Preference Optimization shows the model beats GPT‑4o and other baselines on 10 of 13 benchmarks, reducing position and length bias. getnews.me/direct-judgement-prefere... #llmevaluation #gpt4o

0 0 0 0

6 months ago

🪜 Depth hurts performance.
Accuracy drops with longer chains; failures dominate at 16+ steps. Even “correct” labels can mask 10.7% faulty chains—proof-level checks catch shortcuts.
#LLMEvaluation (5/7)

0 0 1 0

6 months ago

🧩 What’s P-FOLIO?
▪️ 1,430 expert proofs
▪️ Chains 0–20 steps
▪️ 31 inference rules (widely used + complex)
A natural-language stress test for multi-step logic in FOLIO stories.
#LLMs #LLMEvaluation (2/7)

0 0 1 0

7 months ago

🗝️ Finding #1:
Even state-of-the-art models such as OpenAI o3, DeepSeek R1, and Gemini Flash often rely on brute force when creative shortcuts are available. (4/8)
#LLMEvaluation #AIResearch

0 0 1 0