Ryan Steed (@rbsteed.com) Bsky

Every Generation of AI Solves One Problem and Hides Another What brittleness looked like then, what it looks like now, and why “adding a layer” never fully solves it

It's about what's hidden, and what new deficiencies the tech carries with it. @victorojewale.bsky.social opines on the evolution of deployed AI and its limits. victorojewale.substack.com/p/from-exper...

3 weeks ago 3 3 0 0

CNTR AISLE CNTR AISLE Portal

It's been a journey of nearly 3 years, but I'm very excited to announce the CNTR AISLE Portal! 🚀 cntr-aisle.org It’s a new way to review and evaluate the 1,000+ AI bills introduced in the U.S. over the last three years. Check out the Bill Library and our Profiles#AIPolicy #OpenData

1 month ago 26 15 1 0

GLMMs are just one approach — we’re looking forward to more work on statistical frameworks for AI evaluation. Send questions/comments to caisi-metrology@nist.gov.

Paper (w/ the talented Drew Keller, Kweku Kwegyir-Aggrey, Anita Rao, Julia Sharp, and Stevie Bergman): nvlpubs.nist.gov/nistpubs/ai/...

2 months ago 0 0 0 0

Fig. 6a from the paper: Distribution of estimated question difficulties by domain and labeled difficulty (GPQA-Diamond). Distribution of random effects in each domain. Chemistry questions were particularly difficult for the LLMs we tested. Each dot indicates a GPQA-Diamond question’s GLMM-estimated difficulty (i.e., random effect value). Box plots display quartiles and violin plots display estimated density. These estimates show that GPQA-Diamond’s chemistry questions were particularly difficult for the 22 tested LLMs. On the other hand, question difficulty for LLMs has a weak relationship with question-writer-labeled difficulty. This may suggest that humans and the tested LLMs find different questions difficult, and/or could call into question whether writer annotations are accurate even for human difficulty.

Fig. 6b from the paper: Distribution of estimated question difficulties by domain and labeled difficulty (GPQA-Diamond). Distribution of random effects for the 191 questions at the three most common writer-annotated difficulty levels. Question difficulty for LLMs has a weak relationship with human-labeled difficulty. Each dot indicates a GPQA-Diamond question’s GLMM-estimated difficulty (i.e., random effect value). Box plots display quartiles and violin plots display estimated density. These estimates show that GPQA-Diamond’s chemistry questions were particularly difficult for the 22 tested LLMs. On the other hand, question difficulty for LLMs has a weak relationship with question-writer-labeled difficulty. This may suggest that humans and the tested LLMs find different questions difficult, and/or could call into question whether writer annotations are accurate even for human difficulty.

GLMMs have other benefits, too:

- We can estimate question difficulties to identify problematic questions and other patterns in benchmarks.

- Variance decomposition (between- and within-questions) can highlight nuances in performance between tasks, languages, and other subsets of a benchmark.

2 months ago 0 0 1 0

Ideally, evaluators should use a statistical model to explicitly define the estimand & other statistical assumptions.

We propose one approach using generalized linear mixed models. GLMMs can often estimate uncertainty more precisely than typical “regression-free” approaches.

2 months ago 0 0 1 0

Fig. 1 from the paper: Comparing accuracy estimates (GPQA-Diamond). Lower plots show the estimated accuracy of a selection of tested LLMs* with 95% confidence intervals. Upper plots show corresponding confidence interval (CI) widths. Generalized accuracy CIs are larger than benchmark accuracy CIs because they account for the selection of benchmark items from a superpopulation. Notably, some pairs of LLMs may have significantly different benchmark accuracy but not generalized accuracy. The simple average (pink) estimates reflect the average across all n benchmark questions with standard error calculated as standard deviation of results divided by √n. For estimates of benchmark accuracy, the simple average method results in under-confident CIs compared to a valid regression-free method (blue). For estimates of generalized accuracy, the simple average method provides valid CIs, but precision can be increased by running more trials per item (as in the regression-free method). Generalized linear mixed model (GLMM, orange) estimates require additional assumptions but further increase precision.

AI evals rarely specify which question is being answered — but the choice matters, especially when it comes to computing error bars. (Assuming error bars are included at all…)

In particular, error bars for generalized accuracy tend to be larger and may yield different rankings.

2 months ago 0 0 1 0

We identify two distinct questions about accuracy:

- Benchmark accuracy: How well does the LLM perform on this specific, fixed benchmark?

- Generalized accuracy: How well would the LLM perform across the larger population of questions similar to those in this benchmark?

2 months ago 0 0 1 0

New Report: Expanding the AI Evaluation Toolbox with Statistical Models We're Hiring!Please visit the CAISI Careers Page to learn more abou

AI benchmark evals commonly report "accuracy" metrics — but what’s really being measured? And how should we compute the error bars?

New NIST report from my team at CAISI outlines a better statistical framework for eval analysis: www.nist.gov/news-events/...

2 months ago 1 0 1 1

This has been a massive community project, and we need you all to participate!

See more: evalevalai.com/projects/eve...

2 months ago 7 3 0 0

Had the chance to give feedback on this project on CAISI’s behalf. I’m very excited to see this develop!

2 months ago 3 0 0 0

If this kind of work speaks to you, come work with us! My team at CAISI is hiring an Applied Systems AI Research Scientist, among many other roles. www.nist.gov/caisi/career...

2 months ago 2 2 0 0

Towards Best Practices for Automated Benchmark Evaluations Comments Sought on Initial Public Draft of NIST AI 800-2 through March 31

CAISI invites input on any aspect of this draft, including from orgs that conduct AI evals and from users of eval reports (for decision-making, procurement, integration, etc.)

Public comment closes March 31 — details here: www.nist.gov/news-events/...

2 months ago 1 0 0 1

Table I.1 from https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf

Automated benchmarks are not all you need, but they are popular tools in AI development. Hoping this doc is a foundation for future guidelines on field testing and other kinds of evals.

2 months ago 0 0 1 0

Table 3.1 from https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf

Section 3 covers critical practices related to responsible and transparent reporting — including uncertainty quantification, reproducibility, and properly qualified claims.

2 months ago 0 0 1 0

Section 2 dives into the nitty-gritty operational details of setting up and running a benchmark — including helpful lists of relevant settings and design principles.

2 months ago 0 0 1 0

Table 1.1 from https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-2.ipd.pdf

I’m especially excited about the focus on practical measurement validity.

Sections 1 describes ways to assess the relationship between the contents of a benchmark and what evaluators really want to measure.

2 months ago 0 0 1 0

Towards Best Practices for Automated Benchmark Evaluations Comments Sought on Initial Public Draft of NIST AI 800-2 through March 31

Excited to co-author a new public draft from NIST CAISI on best practices for automated benchmark evals.

We want your feedback! Public comment open til March 31. Highlights below 🧵

www.nist.gov/news-events/...

2 months ago 2 0 1 2

NIST CAISI is also hiring post-docs — applications due Feb. 1.

Come work with our team on AI evaluations and metrology!

Apply here: ra.nas.edu/RAPLab10/Opp...

More detail: www.linkedin.com/posts/astevi...

3 months ago 1 0 0 1

Applicants Fellowship Information *** Application is currently OPEN *** Number of Awards: Varies annually Schedule: Online applications open in late August and close on January 15, 2026. Type: Fellowshi…

Late notice, but: NIST has a multi-year, funded graduate fellowship + summer intern program (potentially with our team at CAISI!). Due January 15.

stemfellowships.org/applicants/

3 months ago 2 2 1 0

$About the PhD Audits and evaluation of AI systems — and the broader context that AI systems operate in — have become central to conceptualising, quantifying, measuring and understanding the operations, failures, limitations, underlying assumptions, and downstream societal implications of AI systems. Existing AI audit and evaluation efforts are fractured, done in a siloed and ad-hoc manner, and with little deliberation and reflection around conceptual rigour and methodological validity. This PhD is for a candidate that is passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like. This requires grappling with questions such as: What does it mean to represent “ground truth” in proxies, synthetic data, or computational simulation? How do we reliably measure abstract and complex phenomena? What are the epistemological or methodological implications of quantification and measurement approaches we choose to employ? Particularly, what underlying presuppositions, values, or perspectives do they entail? How do we ensure the lived experiences of impacted communities play a critical role in the development and justification of measurement metrics and proxies? Through exploration of these questions, the candidate is expected to engage with core concepts in the philosophy of science, history of science, Black feminist epistemologies, and similar schools of thought to develop an in-depth understanding of existing practices with the aim of applying it to advance shared standards and best practice in AI evaluation. The candidate is expected to integrate empirical (for example, through analysis or evaluation of existing benchmarks) or practical (for example, by executing evaluation of AI systems) components into the overall work.$

About the PhD Audits and evaluation of AI systems — and the broader context that AI systems operate in — have become central to conceptualising, quantifying, measuring and understanding the operations, failures, limitations, underlying assumptions, and downstream societal implications of AI systems. Existing AI audit and evaluation efforts are fractured, done in a siloed and ad-hoc manner, and with little deliberation and reflection around conceptual rigour and methodological validity. This PhD is for a candidate that is passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like. This requires grappling with questions such as: What does it mean to represent “ground truth” in proxies, synthetic data, or computational simulation? How do we reliably measure abstract and complex phenomena? What are the epistemological or methodological implications of quantification and measurement approaches we choose to employ? Particularly, what underlying presuppositions, values, or perspectives do they entail? How do we ensure the lived experiences of impacted communities play a critical role in the development and justification of measurement metrics and proxies? Through exploration of these questions, the candidate is expected to engage with core concepts in the philosophy of science, history of science, Black feminist epistemologies, and similar schools of thought to develop an in-depth understanding of existing practices with the aim of applying it to advance shared standards and best practice in AI evaluation. The candidate is expected to integrate empirical (for example, through analysis or evaluation of existing benchmarks) or practical (for example, by executing evaluation of AI systems) components into the overall work.

Are you passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like? Come do a PhD with us.

Closing Date: 10 February 2026

Apply here aial.ie/hiring/phd-a...

4 months ago 16 10 0 0

US CAISI is hiring -- the internal govt name for the role is "IT Specialist" but it is effectively a research scientist role!

Salary is $120,579 to - $195,200 per year, and you get to work on AI evaluation within government agencies!

Job posting (**closes EOD 12/28/2025**): lnkd.in/exJgkqr5

4 months ago 24 10 1 1

USAJOBS Help Center - How do I write a resume for a federal job? USAJOBS Help Center

Note that this position requires a specially formatted resume, 2 pages max: help.usajobs.gov/faq/applicat...

4 months ago 1 0 0 0

Also, our team is hiring an AI Research Scientist!

www.usajobs.gov/job/851528400

4 months ago 10 7 1 0

Also, belated announcement that I joined @steviebergman.bsky.social’s wonderful Applied Systems team at CAISI — with @anitakrao.bsky.social, Drew Keller, & (formerly) @kwekuka.bsky.social.   More to come!

4 months ago 2 0 0 0

"Building gold-standard AI systems requires gold-standard AI measurement science... Today, many evaluations of AI systems do not precisely articulate what has been measured, much less whether the measurements are valid."  

We highlight open q's about construct validity, field studies, and more.

4 months ago 2 0 1 0

Accelerating AI Innovation Through Measurement Science Building gold-standard AI systems requires gold-standard AI measurement science – the scientific study of methods used to assess AI systems’ properties and impacts. NIST works to improve measurements ...

Our team at NIST's Center for AI Standards and Innovation (CAISI) just released a blog post with open questions for AI measurement science:

www.nist.gov/blogs/caisi-...

4 months ago 5 1 1 2

US CAISI (the equivalent of the US "AI Safety Institute") just put out their approach to AI measurement & there's such a significant portion on construct validity (nist.gov/blogs/caisi-...).

Great to see this after ongoing advocating about this issue (arxiv.org/abs/2511.04703)!

4 months ago 25 8 1 0

After having such a great time at #CHI2025 and #FAccT2025, I wanted to share some of my favorite recent papers here!

I'll aim to post new ones throughout the summer and will tag all the authors I can find on Bsky. Please feel welcome to chime in with thoughts / paper recs / etc.!!

🧵⬇️:

9 months ago 55 10 2 2

Thank you :) Love this thread

8 months ago 2 0 0 0

A title slide with the paper title: "Legacy Procurement Practices Shape How U.S. Cities Govern AI". The title includes a small illustration that is a simple chart: A government provides an AI vendor with money, and in exchange the vendor provides the government with an AI system.

Q: What do school buses, desktop computers, and AI have in common?
A: The same decades-old laws and processes apply when governments go to purchase them.

Our new #FAccT2025 paper examines how these legacy public procurement norms apply to AI. 🧵

11 months ago 8 1 2 1

Posts by Ryan Steed