Avijit Ghosh (@evijit.io) Bsky

This has been a massive community project, and we need you all to participate!

See more: evalevalai.com/projects/eve...

1 month ago 7 3 0 0

This 100%

2 months ago 1 0 0 0

This has to be rage bait. Did we not see the South Park episode where ChatGPT suggested a business idea to convert fries to salad? (And I tried to prompt myself too)

3 months ago 5 0 0 0

Brisket for thanksgiving >>>> Turkey for thanksgiving

4 months ago 5 0 0 0

Who is winning the open AI race?

Our new study Economies of Open Intelligence maps @hf.co 851k models' downloads 2020→2025.

1) Power rebalance: US tech ↓; China + community ↑
2) Models size & efficient ↑ (MoE, quant, multimodal)
3) Intermediary layers ↑ (adapters/quantizers)
4) Transparency ↓

/🧵

4 months ago 7 3 1 0

I used to love the word “key” until AI models decided to love it and now I cringe at “key takeaways” in text material :(

4 months ago 2 0 0 0

It’s that time of the year again! I’ll be at @neuripsconf.bsky.social this year too :) If you’re interested in Responsible AI, AI Evals ( @eval-eval.bsky.social ) or AI4Science (Hugging Science), say hi!

4 months ago 4 1 1 0

🚨 AI keeps scaling, but social impact evaluations aren’t–and the data proves it 🚨

Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)

4 months ago 11 3 1 0

AI Evaluation Dashboard Professional AI system evaluation and assessment tool

A (very incomplete) frontend of Eval Cards can be found here: evalcards.evalevalai.com, and we are now collecting eval datasets (to show in eval cards) on github: github.com/evaleval/eve...

If you want to help see eval cards come alive, get in touch!

4 months ago 0 0 0 0

Finally, what's next from here? Almost every developer we spoke to said that what we need is a standardized way of reporting, aggregating and comparing all the evals done by both 1st and 3rd parties for a model. This is actually our next project: Eval Cards!

4 months ago 0 0 1 0

Incredible work done with literally the smartest and most passionate researchers I am lucky to work with. Paper co-led with @ankareuel.bsky.social and Jenny Chim, and other co-authors!

4 months ago 0 0 1 0

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluation...

Read the detailed results here: arxiv.org/abs/2511.05613

We also release the code, and the full annotated dataset on Hugging Face (link in paper).

4 months ago 1 0 1 0

This only strengthens our position that good-quality, independent third-party evaluations are paramount for AI safety.

4 months ago 1 1 1 0

First-party reports are less transparent or lower quality. We conducted interviews with eval practitioners and found that companies have laid off or reassigned teams dedicated to documentation & social impact evals, or they are being told to focus more on capability reporting.

4 months ago 1 0 1 1

This is true even at the provider level. We find for e.g., that Google used to do a lot more reporting about their model evaluations in 2022 and 2023 but they reduced reporting in the Gemini era, and same can be seen for Meta over successive Llama versions.

4 months ago 1 0 1 0

We find that model developers have become less transparent about their eval results over time. For instance Env Cost reporting in first party reports (release docs, model cards, system cards) has drastically declined over time. Less than 15% mention labor or the environment!

4 months ago 0 0 1 0

We take a look at the entire eval landscape, specifically social impact evals across 7 dimensions: Bias & Harm, Sensitive Content, Performance Disparity, Env. Costs & Emissions, Privacy & Data, Financial Costs, and Moderation Labor. Who is reporting these evals?

4 months ago 0 0 1 0

Extremely thrilled to talk about our new paper: "Who Evaluates AI’s Social Impacts? Mapping Coverage And Gaps In First And Third Party Evaluations".

This is the first big project output from the
@eval-eval.bsky.social coalition! Thread below:

4 months ago 18 7 1 0

… this looks like the Nature font oh no

4 months ago 2 0 1 0

We have a call for posters out! Please submit your extended abstracts, it should be quick and easy. And just like last year, provocative work is especially encouraged as it makes for such interesting conversation 😈

5 months ago 3 2 0 0

This. Copyright is a tool for protection but it’s not everything. In fact, there’s research showing that it is possible to create competitive language models using public domain data only. The proliferation of copyright respecting models would not solve the labor impact policy problem.

5 months ago 24 2 1 0

2025 Workshop on Evaluating AI in Practice EvalEval, UK AI Security Institute (AISI), and UC San Diego (UCSD) are excited to announce the upcoming Evaluating AI in Practice workshop, happening on December 8, 2025, in San Diego, California.

Going to San Diego for Neurips? We at @eval-eval.bsky.social , along with the UK AISI, are hosting a closed door state of evals workshop at @ucsandiego.bsky.social on Dec 8th.

Request to join below! :)

evaleval.github.io/events/works...

5 months ago 0 0 0 0

The thing about non survey papers is that they can still be problematic/fake science etc, and arxiv needs a long overdue + moderated comments section

5 months ago 1 0 0 0

Support scientific data formats · Issue #7804 · huggingface/datasets List of formats and libraries we can use to load the data in datasets: DICOMs: pydicom NIfTIs: nibabel WFDB: wfdb cc @zaRizk7 for viz Feel free to comment / suggest other formats and libs you'd lik...

Datasets are the backbone of AI for Science, and we want to support scientific data natively on Hugging Face. The amazing @lhoestq.hf.co started a discussion on GH for this! Please engage (better still, submit a PR) so we can start supporting your 🫵 dataset:

github.com/huggingface/...

5 months ago 2 0 0 0

Yes! The Science/Tech/Cyber committee is doing really good work too. Well intentioned folks there trying to actually engage with researchers and industry folks. Love MA

5 months ago 2 0 0 0

Random off the cuff observation about American AI: LLM folks seem to be concentrated in SF, but AI4Science folks seem to be concentrated in Boston. Meaning as the former gets oversaturated and the latter is only getting started, I expect Boston to be the next big AI epicenter! 💪

5 months ago 2 0 1 1

🌟 Weekly AI Evaluation Spotlight 🌟

🤖 Did you know malicious actors can exploit trust in AI leaderboards to promote poisoned models in the community?

This week's paper 📜"Exploiting Leaderboards for Large-Scale Distribution of Malicious Models" by @iamgroot42.bsky.social explores this!

5 months ago 5 2 1 0

Oof

5 months ago 0 0 0 0

I have started requesting that panel moderators provide a disclaimer at panels I am on that not all my opinions are provided by my employer. HF ppl largely believe in democratization of AI and open source, but we actually have intense healthy debates internally on edge topics! It's great :)

5 months ago 3 0 0 0

+1000. I miss life pre-AI hype when the discourse around AI was more scientific and people used to attribute papers and opinions to scientists instead of to their companies. Not all orgs block research papers and sanity check their papers via legal teams, and HF, especially so, is very distributed.

5 months ago 3 0 1 0

Posts by Avijit Ghosh