Introducing skytrails by @kennypeng.bsky.social, a way to browse posts across Bluesky!
Follow trails to navigate the space of content/conversations here, and discover new interests beyond your usual feeds 🕸️
I'll talk more about it tomorrow in my talk at #atscience #atmosphereconf!
Posts by Isabel Silva Corpus
New paper! The Linear Representation Hypothesis is a powerful intuition for how language models work, but lacks formalization. We give a mathematical framework in which we can ask and answer a basic question: how many features can be stored under the hypothesis? 🧵 arxiv.org/abs/2602.11246
"Community Notes" are reshaping how millions encounter information on social media--but what makes them work (or not)? We term these "Crowdsourced Context Systems" (CCS) and introduce a framework for designing and evaluating them in a new #CHI26 paper 🧵
Title + abstract of the preprint
Excited to present a new preprint with @nkgarg.bsky.social: presenting usage statistics and observational findings from Paper Skygest in the first six months of deployment! 🎉📜
arxiv.org/abs/2601.04253
I spoke with @kattenbarge.bsky.social for this @wired.com piece about my research into reddit moderators' experiences moderating AI-generated content. Moderators are working hard to keep Reddit "one of the most human spaces left on the internet," but it's a trying and often thankless task.
Has been lots of fun working on this with a great team, thanks to @eegilbert.org, @allisonkoe.bsky.social, and @informor.bsky.social!
We speculate that the AI tool improved writer productivity and petition completion rates. However, was this worth the platform-level increase in text homogeneity and decrease in outcomes? The paper adds some reflections and potential explanations arxiv.org/abs/2511.13949. WIP, comments welcome! 8/8
Again, petitions written with AI access were stylistically different (e.g. longer), but did not seem to have higher chances of positive outcomes (e.g. signatures). 7/
We confirmed these trends by comparing repeat-petition writers who wrote 1 petition pre and 1 post in-platform AI access against a baseline of writers w/ 2 petitions pre- or post-AI. If AI is helpful, we’d see a higher rate of success than expected for the split writers' second petition. 6/
However, petition outcomes did not improve, and by some measures worsened: the share of petitions that reach minimum thresholds of comments and signatures decreased modestly when writers had access to the AI tool. 5/
Petition text also became more homogenous: the average pairwise similarity of petition text increased. 4/
Lexical features changed: petitions written with access to AI were longer, with more complex and varied vocabularies. Petition language shifted: in the pre-AI period petition titles tended to use short verbs like “let” and “stop”; after AI introduction we saw the rise of “implement” and “urge”. 3/
To do this, we scraped 1.5 million petitions and leveraged the delayed release of the AI tool in Australia to estimate the causal impact of access to in-platform AI on petition text and outcomes (as measured through signatures and comments). 2/
Excited to share a new working paper!
What happened when Change.org integrated an AI writing tool into their platform? We provide causal evidence that petition text changed significantly while outcomes did not improve. 1/
arxiv.org/abs/2511.13949
@davidingram.bsky.social covered @mantzarlis.com and my work on Grokipedia citations for NBC! much more to analyze here
www.nbcnews.com/news/amp/rcn...
@ziv-e.bsky.social's talk showed the gaps in which users are recommended content that is aligned with their values. This work definitely points to the importance of user agency in feeds!
I couldn’t find the link to this paper, but some related work from Ziv here:
🔗: arxiv.org/pdf/2509.14434
They found that while increasingly accurate LLM’s map to increasingly accurate responses, the effects plateau around 80%. Great insight into how decision supports require active design to best support people and their goals. Looking forward to reading this paper when it’s out!!
@jennahgosciak.bsky.social showed how LLM assistance can improve government caseworker accuracy in the context of SNAP eligibility questions. It was really cool to see Jennah + team get ahead of the ever-shifting technical capacity of LLMs by varying chatbot accuracy.
Sherry Jueyu Wu showed that when people participate in collective decision-making, they are more willing to express that the gov needs improvement. Interesting to think about in the context of participation and accountability on online platforms...
🔗: www.nature.com/articles/s41...
This is a really cool work demonstrating the value of human expertise in sociotechnical decision-making processes!
🔗: arxiv.org/pdf/2505.13325
@utopianturtle.top presented a causal framework for modeling algorithmically assisted decision-making, which the authors use to identify the ways academic advisors leverage non-algorithmic knowledge, and where advisors’ holistic approaches contribute to improved student outcomes.
Had a great time at CODE@MIT this weekend, and wanted to highlight a few (of the many) cool talks!
If you run conjoint experiments, you need to read this.
Most conjoints estimate average effects for each attribute.
But what if the effect of one attribute depends on the others?
This paper has got you covered!
A screenshot of our paper's: Title: A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms Authors: Emma Harvey, Rene Kizilcec, Allison Koenecke Abstract: Increasingly, individuals who engage in online activities are expected to interact with large language model (LLM)-based chatbots. Prior work has shown that LLMs can display dialect bias, which occurs when they produce harmful responses when prompted with text written in minoritized dialects. However, whether and how this bias propagates to systems built on top of LLMs, such as chatbots, is still unclear. We conduct a review of existing approaches for auditing LLMs for dialect bias and show that they cannot be straightforwardly adapted to audit LLM-based chatbots due to issues of substantive and ecological validity. To address this, we present a framework for auditing LLM-based chatbots for dialect bias by measuring the extent to which they produce quality-of-service harms, which occur when systems do not work equally well for different people. Our framework has three key characteristics that make it useful in practice. First, by leveraging dynamically generated instead of pre-existing text, our framework enables testing over any dialect, facilitates multi-turn conversations, and represents how users are likely to interact with chatbots in the real world. Second, by measuring quality-of-service harms, our framework aligns audit results with the real-world outcomes of chatbot use. Third, our framework requires only query access to an LLM-based chatbot, meaning that it can be leveraged equally effectively by internal auditors, external auditors, and even individual users in order to promote accountability. To demonstrate the efficacy of our framework, we conduct a case study audit of Amazon Rufus, a widely-used LLM-based chatbot in the customer service domain. Our results reveal that Rufus produces lower-quality responses to prompts written in minoritized English dialects.
I am so excited to be in 🇬🇷Athens🇬🇷 to present "A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms" by me, @kizilcec.bsky.social, and @allisonkoe.bsky.social, at #FAccT2025!!
🔗: arxiv.org/pdf/2506.04419
"Bias Delayed is Bias Denied? Assessing the Effect of Reporting Delays on Disparity Assessments" Conducting disparity assessments at regular time intervals is critical for surfacing potential biases in decision-making and improving outcomes across demographic groups. Because disparity assessments fundamentally depend on the availability of demographic information, their efficacy is limited by the availability and consistency of available demographic identifiers. While prior work has considered the impact of missing data on fairness, little attention has been paid to the role of delayed demographic data. Delayed data, while eventually observed, might be missing at the critical point of monitoring and action -- and delays may be unequally distributed across groups in ways that distort disparity assessments. We characterize such impacts in healthcare, using electronic health records of over 5M patients across primary care practices in all 50 states. Our contributions are threefold. First, we document the high rate of race and ethnicity reporting delays in a healthcare setting and demonstrate widespread variation in rates at which demographics are reported across different groups. Second, through a set of retrospective analyses using real data, we find that such delays impact disparity assessments and hence conclusions made across a range of consequential healthcare outcomes, particularly at more granular levels of state-level and practice-level assessments. Third, we find limited ability of conventional methods that impute missing race in mitigating the effects of reporting delays on the accuracy of timely disparity assessments. Our insights and methods generalize to many domains of algorithmic fairness where delays in the availability of sensitive information may confound audits, thus deserving closer attention within a pipeline-aware machine learning framework.
Figure contrasting a conventional approach to conducting disparity assessments, which is static, to the analysis we conduct in this paper. Our analysis (1) uses comprehensive health data from over 1,000 primary care practices and 5 million patients across the U.S., (2) timestamped information on the reporting of race to measure delay, and (3) retrospective analyses of disparity assessments under varying levels of delay.
I am presenting a new 📝 “Bias Delayed is Bias Denied? Assessing the Effect of Reporting Delays on Disparity Assessments” at @facct.bsky.social on Thursday, with @aparnabee.bsky.social, Derek Ouyang, @allisonkoe.bsky.social, @marzyehghassemi.bsky.social, and Dan Ho. 🔗: arxiv.org/abs/2506.13735
(1/n)
A screenshot of our paper: Title: Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems Authors: Emma Harvey, Emily Sheng, Su Lin Blodgett, Alexandra Chouldechova, Jean Garcia-Gathright, Alexandra Olteanu, Hanna Wallach Abstract: The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments---even useful instruments---are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.
📣 "Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems" is forthcoming at #ACL2025NLP - and you can read it now on arXiv!
🔗: arxiv.org/pdf/2506.04482
🧵: ⬇️
🎉 So excited to share that "Don't Forget the Teachers" has received a Best Paper Award at #CHI2025!!
@allisonkoe.bsky.social @kizilcec.bsky.social
*NEW DATASET AND PAPER* (CHI2025): How are online communities responding to AI-generated content (AIGC)? We study this by collecting and analyzing the public rules of 300,000+ subreddits in 2023 and 2024. 1/
Please repost to get the word out! @nkgarg.bsky.social and I are excited to present a personalized feed for academics! It shows posts about papers from accounts you’re following bsky.app/profile/pape...
A screenshot of our paper: Title: “Don’t Forget the Teachers”: Towards an Educator-Centered Understanding of Harms from Large Language Models in Education Authors: Emma Harvey, Allison Koenecke, Rene Kizilcec Abstract: Education technologies (edtech) are increasingly incorporating new features built on LLMs, with the goals of enriching the processes of teaching and learning and ultimately improving learning outcomes. However, it is still too early to understand the potential downstream impacts of LLM-based edtech. Prior attempts to map the risks of LLMs have not been tailored to education specifically, even though it is a unique domain in many respects: from its population (students are often children, who can be especially impacted by technology) to its goals (providing the ‘correct’ answer may be less important than understanding how to arrive at an answer) to its implications for higher-order skills that generalize across contexts (e.g. critical thinking and collaboration). We conducted semi-structured interviews with six edtech providers representing leaders in the K-12 space, as well as a diverse group of 23 educators with varying levels of experience with LLM-based edtech. Through a thematic analysis, we explored how each group is anticipating, observing, and accounting for potential harms from LLMs in education. We find that, while edtech providers focus primarily on mitigating technical harms, i.e. those that can be measured based solely on LLM outputs themselves, educators are more concerned about harms that result from the broader impacts of LLMs, i.e. those that require observation of interactions between students, educators, school systems, and edtech to measure. Overall, we (1) develop an education-specific overview of potential harms from LLMs, (2) highlight gaps between conceptions of harm by edtech providers and those by educators, and (3) make recommendations to facilitate the centering of educators in the design and development of edtech tools.
✨New Work✨ by me, @allisonkoe.bsky.social, and @kizilcec.bsky.social forthcoming at #CHI2025:
"Don't Forget the Teachers": Towards an Educator-Centered Understanding of Harms from Large Language Models in Education
🔗: arxiv.org/pdf/2502.14592