It's about what's hidden, and what new deficiencies the tech carries with it. @victorojewale.bsky.social opines on the evolution of deployed AI and its limits. victorojewale.substack.com/p/from-exper...
Posts by Ryan Steed
It's been a journey of nearly 3 years, but I'm very excited to announce the CNTR AISLE Portal! 🚀 cntr-aisle.org It’s a new way to review and evaluate the 1,000+ AI bills introduced in the U.S. over the last three years. Check out the Bill Library and our Profiles#AIPolicy #OpenData
GLMMs are just one approach — we’re looking forward to more work on statistical frameworks for AI evaluation. Send questions/comments to caisi-metrology@nist.gov.
Paper (w/ the talented Drew Keller, Kweku Kwegyir-Aggrey, Anita Rao, Julia Sharp, and Stevie Bergman): nvlpubs.nist.gov/nistpubs/ai/...
Fig. 6a from the paper: Distribution of estimated question difficulties by domain and labeled difficulty (GPQA-Diamond). Distribution of random effects in each domain. Chemistry questions were particularly difficult for the LLMs we tested. Each dot indicates a GPQA-Diamond question’s GLMM-estimated difficulty (i.e., random effect value). Box plots display quartiles and violin plots display estimated density. These estimates show that GPQA-Diamond’s chemistry questions were particularly difficult for the 22 tested LLMs. On the other hand, question difficulty for LLMs has a weak relationship with question-writer-labeled difficulty. This may suggest that humans and the tested LLMs find different questions difficult, and/or could call into question whether writer annotations are accurate even for human difficulty.
Fig. 6b from the paper: Distribution of estimated question difficulties by domain and labeled difficulty (GPQA-Diamond). Distribution of random effects for the 191 questions at the three most common writer-annotated difficulty levels. Question difficulty for LLMs has a weak relationship with human-labeled difficulty. Each dot indicates a GPQA-Diamond question’s GLMM-estimated difficulty (i.e., random effect value). Box plots display quartiles and violin plots display estimated density. These estimates show that GPQA-Diamond’s chemistry questions were particularly difficult for the 22 tested LLMs. On the other hand, question difficulty for LLMs has a weak relationship with question-writer-labeled difficulty. This may suggest that humans and the tested LLMs find different questions difficult, and/or could call into question whether writer annotations are accurate even for human difficulty.
GLMMs have other benefits, too:
- We can estimate question difficulties to identify problematic questions and other patterns in benchmarks.
- Variance decomposition (between- and within-questions) can highlight nuances in performance between tasks, languages, and other subsets of a benchmark.
Ideally, evaluators should use a statistical model to explicitly define the estimand & other statistical assumptions.
We propose one approach using generalized linear mixed models. GLMMs can often estimate uncertainty more precisely than typical “regression-free” approaches.
Fig. 1 from the paper: Comparing accuracy estimates (GPQA-Diamond). Lower plots show the estimated accuracy of a selection of tested LLMs* with 95% confidence intervals. Upper plots show corresponding confidence interval (CI) widths. Generalized accuracy CIs are larger than benchmark accuracy CIs because they account for the selection of benchmark items from a superpopulation. Notably, some pairs of LLMs may have significantly different benchmark accuracy but not generalized accuracy. The simple average (pink) estimates reflect the average across all n benchmark questions with standard error calculated as standard deviation of results divided by √n. For estimates of benchmark accuracy, the simple average method results in under-confident CIs compared to a valid regression-free method (blue). For estimates of generalized accuracy, the simple average method provides valid CIs, but precision can be increased by running more trials per item (as in the regression-free method). Generalized linear mixed model (GLMM, orange) estimates require additional assumptions but further increase precision.
AI evals rarely specify which question is being answered — but the choice matters, especially when it comes to computing error bars. (Assuming error bars are included at all…)
In particular, error bars for generalized accuracy tend to be larger and may yield different rankings.
We identify two distinct questions about accuracy:
- Benchmark accuracy: How well does the LLM perform on this specific, fixed benchmark?
- Generalized accuracy: How well would the LLM perform across the larger population of questions similar to those in this benchmark?
AI benchmark evals commonly report "accuracy" metrics — but what’s really being measured? And how should we compute the error bars?
New NIST report from my team at CAISI outlines a better statistical framework for eval analysis: www.nist.gov/news-events/...
This has been a massive community project, and we need you all to participate!
See more: evalevalai.com/projects/eve...
Had the chance to give feedback on this project on CAISI’s behalf. I’m very excited to see this develop!
If this kind of work speaks to you, come work with us! My team at CAISI is hiring an Applied Systems AI Research Scientist, among many other roles. www.nist.gov/caisi/career...
CAISI invites input on any aspect of this draft, including from orgs that conduct AI evals and from users of eval reports (for decision-making, procurement, integration, etc.)
Public comment closes March 31 — details here: www.nist.gov/news-events/...
Table I.1 from https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf
Automated benchmarks are not all you need, but they are popular tools in AI development. Hoping this doc is a foundation for future guidelines on field testing and other kinds of evals.
Table 3.1 from https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf
Section 3 covers critical practices related to responsible and transparent reporting — including uncertainty quantification, reproducibility, and properly qualified claims.
Section 2 dives into the nitty-gritty operational details of setting up and running a benchmark — including helpful lists of relevant settings and design principles.
Table 1.1 from https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf
I’m especially excited about the focus on practical measurement validity.
Sections 1 describes ways to assess the relationship between the contents of a benchmark and what evaluators really want to measure.
Excited to co-author a new public draft from NIST CAISI on best practices for automated benchmark evals.
We want your feedback! Public comment open til March 31. Highlights below 🧵
www.nist.gov/news-events/...
NIST CAISI is also hiring post-docs — applications due Feb. 1.
Come work with our team on AI evaluations and metrology!
Apply here: ra.nas.edu/RAPLab10/Opp...
More detail: www.linkedin.com/posts/astevi...
Late notice, but: NIST has a multi-year, funded graduate fellowship + summer intern program (potentially with our team at CAISI!). Due January 15.
stemfellowships.org/applicants/
About the PhD Audits and evaluation of AI systems — and the broader context that AI systems operate in — have become central to conceptualising, quantifying, measuring and understanding the operations, failures, limitations, underlying assumptions, and downstream societal implications of AI systems. Existing AI audit and evaluation efforts are fractured, done in a siloed and ad-hoc manner, and with little deliberation and reflection around conceptual rigour and methodological validity. This PhD is for a candidate that is passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like. This requires grappling with questions such as: What does it mean to represent “ground truth” in proxies, synthetic data, or computational simulation? How do we reliably measure abstract and complex phenomena? What are the epistemological or methodological implications of quantification and measurement approaches we choose to employ? Particularly, what underlying presuppositions, values, or perspectives do they entail? How do we ensure the lived experiences of impacted communities play a critical role in the development and justification of measurement metrics and proxies? Through exploration of these questions, the candidate is expected to engage with core concepts in the philosophy of science, history of science, Black feminist epistemologies, and similar schools of thought to develop an in-depth understanding of existing practices with the aim of applying it to advance shared standards and best practice in AI evaluation. The candidate is expected to integrate empirical (for example, through analysis or evaluation of existing benchmarks) or practical (for example, by executing evaluation of AI systems) components into the overall work.
Are you passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like? Come do a PhD with us.
Closing Date: 10 February 2026
Apply here aial.ie/hiring/phd-a...
US CAISI is hiring -- the internal govt name for the role is "IT Specialist" but it is effectively a research scientist role!
Salary is $120,579 to - $195,200 per year, and you get to work on AI evaluation within government agencies!
Job posting (**closes EOD 12/28/2025**): lnkd.in/exJgkqr5
Note that this position requires a specially formatted resume, 2 pages max: help.usajobs.gov/faq/applicat...
Also, our team is hiring an AI Research Scientist!
www.usajobs.gov/job/851528400
Also, belated announcement that I joined @steviebergman.bsky.social’s wonderful Applied Systems team at CAISI — with @anitakrao.bsky.social, Drew Keller, & (formerly) @kwekuka.bsky.social. More to come!
"Building gold-standard AI systems requires gold-standard AI measurement science... Today, many evaluations of AI systems do not precisely articulate what has been measured, much less whether the measurements are valid."
We highlight open q's about construct validity, field studies, and more.
Our team at NIST's Center for AI Standards and Innovation (CAISI) just released a blog post with open questions for AI measurement science:
www.nist.gov/blogs/caisi-...
US CAISI (the equivalent of the US "AI Safety Institute") just put out their approach to AI measurement & there's such a significant portion on construct validity (nist.gov/blogs/caisi-...).
Great to see this after ongoing advocating about this issue (arxiv.org/abs/2511.04703)!
After having such a great time at #CHI2025 and #FAccT2025, I wanted to share some of my favorite recent papers here!
I'll aim to post new ones throughout the summer and will tag all the authors I can find on Bsky. Please feel welcome to chime in with thoughts / paper recs / etc.!!
🧵⬇️:
Thank you :) Love this thread
A title slide with the paper title: "Legacy Procurement Practices Shape How U.S. Cities Govern AI". The title includes a small illustration that is a simple chart: A government provides an AI vendor with money, and in exchange the vendor provides the government with an AI system.
Q: What do school buses, desktop computers, and AI have in common?
A: The same decades-old laws and processes apply when governments go to purchase them.
Our new #FAccT2025 paper examines how these legacy public procurement norms apply to AI. 🧵