Advertisement · 728 × 90

Posts by Isabel Silva Corpus

Introducing skytrails by @kennypeng.bsky.social, a way to browse posts across Bluesky!

Follow trails to navigate the space of content/conversations here, and discover new interests beyond your usual feeds 🕸️

I'll talk more about it tomorrow in my talk at #atscience #atmosphereconf!

3 weeks ago 24 9 1 1
Post image

New paper! The Linear Representation Hypothesis is a powerful intuition for how language models work, but lacks formalization. We give a mathematical framework in which we can ask and answer a basic question: how many features can be stored under the hypothesis? 🧵 arxiv.org/abs/2602.11246

2 months ago 43 14 1 2

"Community Notes" are reshaping how millions encounter information on social media--but what makes them work (or not)? We term these "Crowdsourced Context Systems" (CCS) and introduce a framework for designing and evaluating them in a new #CHI26 paper 🧵

2 months ago 29 6 2 1
Title + abstract of the preprint

Title + abstract of the preprint

Excited to present a new preprint with @nkgarg.bsky.social: presenting usage statistics and observational findings from Paper Skygest in the first six months of deployment! 🎉📜

arxiv.org/abs/2601.04253

3 months ago 170 50 4 5

I spoke with @kattenbarge.bsky.social for this @wired.com piece about my research into reddit moderators' experiences moderating AI-generated content. Moderators are working hard to keep Reddit "one of the most human spaces left on the internet," but it's a trying and often thankless task.

4 months ago 11 4 1 0

Has been lots of fun working on this with a great team, thanks to @eegilbert.org, @allisonkoe.bsky.social, and @informor.bsky.social!

4 months ago 5 0 1 0
Preview
Introducing AI to an Online Petition Platform Changed Outputs but not Outcomes The rapid integration of AI writing tools into online platforms raises critical questions about their impact on content production and outcomes. We leverage a unique natural experiment on Change$.$org...

We speculate that the AI tool improved writer productivity and petition completion rates. However, was this worth the platform-level increase in text homogeneity and decrease in outcomes? The paper adds some reflections and potential explanations arxiv.org/abs/2511.13949. WIP, comments welcome! 8/8

4 months ago 7 1 1 0

Again, petitions written with AI access were stylistically different (e.g. longer), but did not seem to have higher chances of positive outcomes (e.g. signatures). 7/

4 months ago 4 1 1 0
Post image

We confirmed these trends by comparing repeat-petition writers who wrote 1 petition pre and 1 post in-platform AI access against a baseline of writers w/ 2 petitions pre- or post-AI. If AI is helpful, we’d see a higher rate of success than expected for the split writers' second petition. 6/

4 months ago 6 0 1 0
Advertisement
Post image

However, petition outcomes did not improve, and by some measures worsened: the share of petitions that reach minimum thresholds of comments and signatures decreased modestly when writers had access to the AI tool. 5/

4 months ago 6 0 1 0
Post image

Petition text also became more homogenous: the average pairwise similarity of petition text increased. 4/

4 months ago 7 0 1 0

Lexical features changed: petitions written with access to AI were longer, with more complex and varied vocabularies. Petition language shifted: in the pre-AI period petition titles tended to use short verbs like “let” and “stop”; after AI introduction we saw the rise of “implement” and “urge”. 3/

4 months ago 4 0 1 0

To do this, we scraped 1.5 million petitions and leveraged the delayed release of the AI tool in Australia to estimate the causal impact of access to in-platform AI on petition text and outcomes (as measured through signatures and comments). 2/

4 months ago 5 0 1 0
Preview
Introducing AI to an Online Petition Platform Changed Outputs but not Outcomes The rapid integration of AI writing tools into online platforms raises critical questions about their impact on content production and outcomes. We leverage a unique natural experiment on Change$.$org...

Excited to share a new working paper!

What happened when Change.org integrated an AI writing tool into their platform? We provide causal evidence that petition text changed significantly while outcomes did not improve. 1/

arxiv.org/abs/2511.13949

4 months ago 53 18 4 6
Preview
Elon Musk’s Grokipedia cites neo-Nazi website 42 times: study An analysis by researchers at Cornell University is the first comprehensive look at Grokipedia since Musk launched his project last month.

@davidingram.bsky.social covered @mantzarlis.com and my work on Grokipedia citations for NBC! much more to analyze here

www.nbcnews.com/news/amp/rcn...

5 months ago 8 5 0 1

@ziv-e.bsky.social's talk showed the gaps in which users are recommended content that is aligned with their values. This work definitely points to the importance of user agency in feeds!
I couldn’t find the link to this paper, but some related work from Ziv here:
🔗: arxiv.org/pdf/2509.14434

5 months ago 6 0 0 0

They found that while increasingly accurate LLM’s map to increasingly accurate responses, the effects plateau around 80%. Great insight into how decision supports require active design to best support people and their goals. Looking forward to reading this paper when it’s out!!

5 months ago 4 0 1 0

@jennahgosciak.bsky.social showed how LLM assistance can improve government caseworker accuracy in the context of SNAP eligibility questions. It was really cool to see Jennah + team get ahead of the ever-shifting technical capacity of LLMs by varying chatbot accuracy.

5 months ago 7 1 1 0
Advertisement
Preview
A large-scale field experiment on participatory decision-making in China - Nature Human Behaviour Wu et al. show that involving citizens in local decision-making (participatory budgeting) improves civic engagement in a Chinese context.

Sherry Jueyu Wu showed that when people participate in collective decision-making, they are more willing to express that the gov needs improvement. Interesting to think about in the context of participation and accountability on online platforms...
🔗: www.nature.com/articles/s41...

5 months ago 6 1 1 0

This is a really cool work demonstrating the value of human expertise in sociotechnical decision-making processes!
🔗: arxiv.org/pdf/2505.13325

5 months ago 4 0 1 0

@utopianturtle.top presented a causal framework for modeling algorithmically assisted decision-making, which the authors use to identify the ways academic advisors leverage non-algorithmic knowledge, and where advisors’ holistic approaches contribute to improved student outcomes.

5 months ago 5 0 1 0

Had a great time at CODE@MIT this weekend, and wanted to highlight a few (of the many) cool talks!

5 months ago 17 5 1 1
Post image

If you run conjoint experiments, you need to read this.

Most conjoints estimate average effects for each attribute.

But what if the effect of one attribute depends on the others?

This paper has got you covered!

11 months ago 51 11 4 2
A screenshot of our paper's:

Title: A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms
Authors: Emma Harvey, Rene Kizilcec, Allison Koenecke
Abstract: Increasingly, individuals who engage in online activities are expected to interact with large language model (LLM)-based chatbots. Prior work has shown that LLMs can display dialect bias, which occurs when they produce harmful responses when prompted with text written in minoritized dialects. However, whether and how this bias propagates to systems built on top of LLMs, such as chatbots, is still unclear. We conduct a review of existing approaches for auditing LLMs for dialect bias and show that they cannot be straightforwardly adapted to audit LLM-based chatbots due to issues of substantive and ecological validity. To address this, we present a framework for auditing LLM-based chatbots for dialect bias by measuring the extent to which they produce quality-of-service harms, which occur when systems do not work equally well for different people. Our framework has three key characteristics that make it useful in practice. First, by leveraging dynamically generated instead of pre-existing text, our framework enables testing over any dialect, facilitates multi-turn conversations, and represents how users are likely to interact with chatbots in the real world. Second, by measuring quality-of-service harms, our framework aligns audit results with the real-world outcomes of chatbot use. Third, our framework requires only query access to an LLM-based chatbot, meaning that it can be leveraged equally effectively by internal auditors, external auditors, and even individual users in order to promote accountability. To demonstrate the efficacy of our framework, we conduct a case study audit of Amazon Rufus, a widely-used LLM-based chatbot in the customer service domain. Our results reveal that Rufus produces lower-quality responses to prompts written in minoritized English dialects.

A screenshot of our paper's: Title: A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms Authors: Emma Harvey, Rene Kizilcec, Allison Koenecke Abstract: Increasingly, individuals who engage in online activities are expected to interact with large language model (LLM)-based chatbots. Prior work has shown that LLMs can display dialect bias, which occurs when they produce harmful responses when prompted with text written in minoritized dialects. However, whether and how this bias propagates to systems built on top of LLMs, such as chatbots, is still unclear. We conduct a review of existing approaches for auditing LLMs for dialect bias and show that they cannot be straightforwardly adapted to audit LLM-based chatbots due to issues of substantive and ecological validity. To address this, we present a framework for auditing LLM-based chatbots for dialect bias by measuring the extent to which they produce quality-of-service harms, which occur when systems do not work equally well for different people. Our framework has three key characteristics that make it useful in practice. First, by leveraging dynamically generated instead of pre-existing text, our framework enables testing over any dialect, facilitates multi-turn conversations, and represents how users are likely to interact with chatbots in the real world. Second, by measuring quality-of-service harms, our framework aligns audit results with the real-world outcomes of chatbot use. Third, our framework requires only query access to an LLM-based chatbot, meaning that it can be leveraged equally effectively by internal auditors, external auditors, and even individual users in order to promote accountability. To demonstrate the efficacy of our framework, we conduct a case study audit of Amazon Rufus, a widely-used LLM-based chatbot in the customer service domain. Our results reveal that Rufus produces lower-quality responses to prompts written in minoritized English dialects.

I am so excited to be in 🇬🇷Athens🇬🇷 to present "A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms" by me, @kizilcec.bsky.social, and @allisonkoe.bsky.social, at #FAccT2025!!

🔗: arxiv.org/pdf/2506.04419

9 months ago 31 10 1 2
"Bias Delayed is Bias Denied? Assessing the Effect of Reporting Delays on Disparity Assessments"

Conducting disparity assessments at regular time intervals is critical for surfacing potential biases in decision-making and improving outcomes across demographic groups. Because disparity assessments fundamentally depend on the availability of demographic information, their efficacy is limited by the availability and consistency of available demographic identifiers. While prior work has considered the impact of missing data on fairness, little attention has been paid to the role of delayed demographic data. Delayed data, while eventually observed, might be missing at the critical point of monitoring and action -- and delays may be unequally distributed across groups in ways that distort disparity assessments. We characterize such impacts in healthcare, using electronic health records of over 5M patients across primary care practices in all 50 states. Our contributions are threefold. First, we document the high rate of race and ethnicity reporting delays in a healthcare setting and demonstrate widespread variation in rates at which demographics are reported across different groups. Second, through a set of retrospective analyses using real data, we find that such delays impact disparity assessments and hence conclusions made across a range of consequential healthcare outcomes, particularly at more granular levels of state-level and practice-level assessments. Third, we find limited ability of conventional methods that impute missing race in mitigating the effects of reporting delays on the accuracy of timely disparity assessments. Our insights and methods generalize to many domains of algorithmic fairness where delays in the availability of sensitive information may confound audits, thus deserving closer attention within a pipeline-aware machine learning framework.

"Bias Delayed is Bias Denied? Assessing the Effect of Reporting Delays on Disparity Assessments" Conducting disparity assessments at regular time intervals is critical for surfacing potential biases in decision-making and improving outcomes across demographic groups. Because disparity assessments fundamentally depend on the availability of demographic information, their efficacy is limited by the availability and consistency of available demographic identifiers. While prior work has considered the impact of missing data on fairness, little attention has been paid to the role of delayed demographic data. Delayed data, while eventually observed, might be missing at the critical point of monitoring and action -- and delays may be unequally distributed across groups in ways that distort disparity assessments. We characterize such impacts in healthcare, using electronic health records of over 5M patients across primary care practices in all 50 states. Our contributions are threefold. First, we document the high rate of race and ethnicity reporting delays in a healthcare setting and demonstrate widespread variation in rates at which demographics are reported across different groups. Second, through a set of retrospective analyses using real data, we find that such delays impact disparity assessments and hence conclusions made across a range of consequential healthcare outcomes, particularly at more granular levels of state-level and practice-level assessments. Third, we find limited ability of conventional methods that impute missing race in mitigating the effects of reporting delays on the accuracy of timely disparity assessments. Our insights and methods generalize to many domains of algorithmic fairness where delays in the availability of sensitive information may confound audits, thus deserving closer attention within a pipeline-aware machine learning framework.

Figure contrasting a conventional approach to conducting disparity assessments, which is static, to the analysis we conduct in this paper. Our analysis (1) uses comprehensive health data from over 1,000 primary care practices and 5 million patients across the U.S., (2) timestamped information on the reporting of race to measure delay, and (3) retrospective analyses of disparity assessments under varying levels of delay.

Figure contrasting a conventional approach to conducting disparity assessments, which is static, to the analysis we conduct in this paper. Our analysis (1) uses comprehensive health data from over 1,000 primary care practices and 5 million patients across the U.S., (2) timestamped information on the reporting of race to measure delay, and (3) retrospective analyses of disparity assessments under varying levels of delay.

I am presenting a new 📝 “Bias Delayed is Bias Denied? Assessing the Effect of Reporting Delays on Disparity Assessments” at @facct.bsky.social on Thursday, with @aparnabee.bsky.social, Derek Ouyang, @allisonkoe.bsky.social, @marzyehghassemi.bsky.social, and Dan Ho. 🔗: arxiv.org/abs/2506.13735
(1/n)

9 months ago 13 4 1 3
A screenshot of our paper: 

Title: Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems

Authors: Emma Harvey, Emily Sheng, Su Lin Blodgett, Alexandra Chouldechova, Jean Garcia-Gathright, Alexandra Olteanu, Hanna Wallach

Abstract: The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments---even useful instruments---are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.

A screenshot of our paper: Title: Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems Authors: Emma Harvey, Emily Sheng, Su Lin Blodgett, Alexandra Chouldechova, Jean Garcia-Gathright, Alexandra Olteanu, Hanna Wallach Abstract: The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments---even useful instruments---are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.

📣 "Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems" is forthcoming at #ACL2025NLP - and you can read it now on arXiv!

🔗: arxiv.org/pdf/2506.04482
🧵: ⬇️

10 months ago 17 4 1 1

🎉 So excited to share that "Don't Forget the Teachers" has received a Best Paper Award at #CHI2025!!

@allisonkoe.bsky.social @kizilcec.bsky.social

1 year ago 35 3 7 0
Advertisement

*NEW DATASET AND PAPER* (CHI2025): How are online communities responding to AI-generated content (AIGC)? We study this by collecting and analyzing the public rules of 300,000+ subreddits in 2023 and 2024. 1/

1 year ago 16 5 1 2

Please repost to get the word out! @nkgarg.bsky.social and I are excited to present a personalized feed for academics! It shows posts about papers from accounts you’re following bsky.app/profile/pape...

1 year ago 169 118 8 13
A screenshot of our paper:

Title: “Don’t Forget the Teachers”: Towards an Educator-Centered Understanding of Harms from Large Language Models in Education

Authors: Emma Harvey, Allison Koenecke, Rene Kizilcec

Abstract: Education technologies (edtech) are increasingly incorporating new features built on LLMs, with the goals of enriching the processes of teaching and learning and ultimately improving learning outcomes. However, it is still too early to understand the potential downstream impacts of LLM-based edtech. Prior attempts to map the risks of LLMs have not been tailored to education specifically, even though it is a unique domain in many respects: from its population (students are often children, who can be especially impacted by technology) to its goals (providing the ‘correct’ answer may be less important than understanding how to arrive at an answer) to its implications for higher-order skills that generalize across contexts (e.g. critical thinking and collaboration). We conducted semi-structured interviews with six edtech providers representing leaders in the K-12 space, as well as a diverse group of 23 educators with varying levels of experience with LLM-based edtech. Through a thematic analysis, we explored how each group is anticipating, observing, and accounting for potential harms from LLMs in education. We find that, while edtech providers focus primarily on mitigating technical harms, i.e. those that can be measured based solely on LLM outputs themselves, educators are more concerned about harms that result from the broader impacts of LLMs, i.e. those that require observation of interactions between students, educators, school systems, and edtech to measure. Overall, we (1) develop an education-specific overview of potential harms from LLMs, (2) highlight gaps between conceptions of harm by edtech providers and those by educators, and (3) make recommendations to facilitate the centering of educators in the design and development of edtech tools.

A screenshot of our paper: Title: “Don’t Forget the Teachers”: Towards an Educator-Centered Understanding of Harms from Large Language Models in Education Authors: Emma Harvey, Allison Koenecke, Rene Kizilcec Abstract: Education technologies (edtech) are increasingly incorporating new features built on LLMs, with the goals of enriching the processes of teaching and learning and ultimately improving learning outcomes. However, it is still too early to understand the potential downstream impacts of LLM-based edtech. Prior attempts to map the risks of LLMs have not been tailored to education specifically, even though it is a unique domain in many respects: from its population (students are often children, who can be especially impacted by technology) to its goals (providing the ‘correct’ answer may be less important than understanding how to arrive at an answer) to its implications for higher-order skills that generalize across contexts (e.g. critical thinking and collaboration). We conducted semi-structured interviews with six edtech providers representing leaders in the K-12 space, as well as a diverse group of 23 educators with varying levels of experience with LLM-based edtech. Through a thematic analysis, we explored how each group is anticipating, observing, and accounting for potential harms from LLMs in education. We find that, while edtech providers focus primarily on mitigating technical harms, i.e. those that can be measured based solely on LLM outputs themselves, educators are more concerned about harms that result from the broader impacts of LLMs, i.e. those that require observation of interactions between students, educators, school systems, and edtech to measure. Overall, we (1) develop an education-specific overview of potential harms from LLMs, (2) highlight gaps between conceptions of harm by edtech providers and those by educators, and (3) make recommendations to facilitate the centering of educators in the design and development of edtech tools.

✨New Work✨ by me, @allisonkoe.bsky.social, and @kizilcec.bsky.social forthcoming at #CHI2025:

"Don't Forget the Teachers": Towards an Educator-Centered Understanding of Harms from Large Language Models in Education

🔗: arxiv.org/pdf/2502.14592

1 year ago 54 8 1 6