Advertisement · 728 × 90

Posts by SteeveUX

Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as "LLM-as-a-judge." In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan-Gladen-style estimators, and (ii) surrogate-outcome approaches such as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators by deriving explicit forms of efficient influence function (EIF)-based efficient estimators and characterize conditions under which PPI-style estimators attain strictly smaller asymptotic variance than measurement-error corrections. We verify our theoretical results in simulations and demonstrate the methods on real-data examples. We provide an implementation of the benchmarked methods and comparison utilities at https://github.com/yiqunchen/debias-llm-as-a-judge.

Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as "LLM-as-a-judge." In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan-Gladen-style estimators, and (ii) surrogate-outcome approaches such as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators by deriving explicit forms of efficient influence function (EIF)-based efficient estimators and characterize conditions under which PPI-style estimators attain strictly smaller asymptotic variance than measurement-error corrections. We verify our theoretical results in simulations and demonstrate the methods on real-data examples. We provide an implementation of the benchmarked methods and comparison utilities at https://github.com/yiqunchen/debias-llm-as-a-judge.

arXiv📈🤖
Efficient Inference for Noisy LLM-as-a-Judge Evaluation
By Chen, Lu, Li et al

2 months ago 1 1 0 0
Laptop screen showing mobile app appointment review screens with ratings, service details, date/time, location, and a completion confirmation.

Laptop screen showing mobile app appointment review screens with ratings, service details, date/time, location, and a completion confirmation.

Completed health appointment flow. ❤️‍🩹

#designsky #buildinpublic #indiehackers #webdesign #design #webdesigner #website #websitedesign #promote #spotlight

2 months ago 3 2 1 0
Preview
The History of Web Design, 1993–2012: Season 5 Launch Introducing Cybercultural's history of web design, from the grey web pages of 1993 to the colorful, mobile-centric web designs of 2012. A celebration of the peak years of personal websites and blogs.

A history of web design, from the grey web pages of 1993 to the colorful, mobile-centric web designs of 2012. A celebration of the peak years of personal websites and blogs. By Richard MacManus.

cybercultural.com/p/history-of...

2 months ago 26 6 2 0
Preview
Introducing Cowork | Claude | Claude Claude Code's agentic capabilities, now for everyone. Give Claude access to your files and let it organize, create, and edit documents while you focus on what matters.

Love the work @anthropic.com are doing! claude.com/blog/cowork-... such a solid set of tools! #ai #llm #ml

2 months ago 1 0 0 0

Need some love for @labourlewis.bsky.social on the list

2 months ago 1 0 0 0

We've done a Quiet Riot starter pack for all those (finally!) heading over from the other place.

bsky.app/starter-pack...

2 months ago 123 51 9 2
This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape ChatGPT's outputs but also carry over into subsequent human-human communication.

This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape ChatGPT's outputs but also carry over into subsequent human-human communication.

arXiv📈🤖
How Human is AI? Examining the Impact of Emotional Prompts on Artificial and Human and Responsiveness
By Zurich), Zurich), Zurich)

2 months ago 0 1 0 0
Elon Musk I
@elonmusk •2h
They want any excuse for censorship
X
Basil the Great • @BasilTheGreat • 4h
The UK Labour Government is threatening to block X but won't say a word about ChatGPT and Gemini
Why?
We know why.
X stands for freedom of speech.
They don't care about Al images, they care about people learning the truth.

Elon Musk I @elonmusk •2h They want any excuse for censorship X Basil the Great • @BasilTheGreat • 4h The UK Labour Government is threatening to block X but won't say a word about ChatGPT and Gemini Why? We know why. X stands for freedom of speech. They don't care about Al images, they care about people learning the truth.

If I’d built a noncing machine for free public use, I too would pretend that I don’t understand why people are so upset

2 months ago 81 16 6 0
Preview
European accessibility act The European accessibility act is a directive that aims to improve the functioning of the internal market for accessible products and services, by removing barriers created by divergent rules in Membe...

🧵5/5

commission.europa.eu/strategy-and...

9 months ago 0 0 0 0
Advertisement

🧵4/5

Despite its name the EAA has Global Reach and applies to any company, regardless of location (including the UK, US, etc.), that sells covered products or services to consumers within the European Union. It includes private sector businesses from manufacturers to service providers.

9 months ago 0 0 1 0

🧵3/5

The EAA covered a wide range of products & services and includes:
Computers & OS
Smartphones
TV Equipment
Telephony Services
ATMs & Kiosks
Banking Services
E-books
E-commerce
Also includes transport services (air, bus, rail), and audio-visual media services.

9 months ago 0 0 1 0

🧵2/5

The primary goal is to improve the lives of persons with disabilities and older people by removing accessibility barriers in key digital products and services, ensuring they can participate fully in society.

9 months ago 0 0 1 0
Colourful patterns

Colourful patterns

🧵1/5

The full compliance deadline for European Accessibility Act 2025 happened today (28th June 2025).

#a11y #accessibility #eaa

9 months ago 1 0 1 0

I’ve been exploring how we trust AI or more why we don’t trust AI in the same ways as people as well as an exercise in AI co-creation.

#ai #llm #articifialintelligence #futureofai #aiethics #machinelearning #deeplearning #ux #userexperience


open.substack.com/pub/steevero...

9 months ago 2 0 0 0
Post image

This looks to be a neat feature in the accessibility settings on iOS. The puzzle is much larger than low vision and there is much more to inclusive design. Approaches like these functions can change the conversation for how we approach accessible #ios26 #liquidglass #wwdc25

9 months ago 1 0 0 0
Preview
GitHub Copilot: Meet the new coding agent GitHub Copilot has a new feature: a coding agent that can implement a task or issue, run in the background with GitHub Actions, and more.

Really excited at this from @microsoft.com github.blog/news-insight... exciting times! Particularly around vision capabilities.

#build2025

10 months ago 1 0 0 0

@bradleystacey.bsky.social

11 months ago 1 0 0 0
Preview
UK could save billions by ending hunger – not slashing benefits Researchers at Trussell have found that the UK government could save billions if it increased universal credit to help tackle hunger.

UK could save billions by ending hunger – not slashing benefits

www.bigissue.com/news/social-...

11 months ago 565 196 14 6

🚨 MAJOR: Google offers FREE Gemini Advanced AI tools to U.S. college students until June 2026 via Google One AI Premium plan!
🔹 Eligibility: .edu email, 15+ months free
🔹 Strategic move to dominate EdTech & convert students later
#google #geminiadvanced #veo2 #notebooklm

11 months ago 0 0 0 0
Advertisement

🚨 OpenAI's new models "think with images"! • o3 & o4-mini manipulate images during reasoning • All tools: web search, code, image gen • 91.6% on AIME 2024, 20% fewer errors • For ChatGPT Plus, Pro, Teams #openai #o4mini #chatgpt

11 months ago 1 0 0 0

@bradleystacey.bsky.social

11 months ago 1 0 0 0

I'm curious about workplace language.

What's your take on the term 'vibe coding'?

Would love to hear your thoughts!
#UXResearch #WorkplaceCulture #Tech #Coding #SoftwareDevelopment #AI

11 months ago 1 0 1 0
Post image

Wonder what this might be….
#AI

11 months ago 0 0 0 0
Screenshot of the press release. Full text in the link.

Screenshot of the press release. Full text in the link.

Here we go - EU Member States approve first batch of EU trade retaliation against Trump tariffs ec.europa.eu/commission/p...

11 months ago 131 56 6 2
Preview
How People with Disabilities Use the Web Introduces how people with disabilities, including people with age-related impairments, use the Web.

People who want to make the web accessible need to understand the many different ways that people with disabilities use the web. This W3C resource offers a good introduction to how disabled people navigate the web, and barriers they commonly encounter.

www.w3.org/WAI/people-u...

1 year ago 113 53 1 1
Santa jumping on a trampoline in winter

Santa jumping on a trampoline in winter

Santa jumping on a trampoline in Autumn

Santa jumping on a trampoline in Autumn

A meditating capybara in a Santa hat on a mountain

A meditating capybara in a Santa hat on a mountain

A capybara in a Santa hat meditating on a city street

A capybara in a Santa hat meditating on a city street

It’s been a while since I used ImageFX from Google - and it’s pretty consistent

#AI #GenAI #ImageGen

1 year ago 0 0 0 0

What’s funny is if we take conversational AI a slight delay might actually aid authenticity by mimicking natural thinking time, especially for complex queries. However, too long a delay could negatively impact experienced users who expect instant replies.

1 year ago 0 0 0 0
Advertisement

💯% was on a site earlier and the delay was excessive to say the least - wasn’t helped by having no states on the button to even indicate something was happening.

1 year ago 0 0 0 0

7/7

But! If you can't make it faster, make it feel faster! Progress bars, animations, and feedback can hack user psychology to extend patience thresholds. ⚡

1 year ago 0 0 0 0


6/7

10+ seconds = ABANDONMENT 🚫
The psychological contract is broken. Users feel disrespected and frustrated. Your brain says "system failure" even if it's just slow.

1 year ago 0 0 1 0