Nick Byrd, Ph.D. (@byrdnick.com) Bsky

"Figure 2: Visual depiction of the theoretical reasoning behind the clustering algorithm. Figure A depicts the empirical pattern of healthcare visits for related symptoms before the index disease diagnosis (data pictured correspond to tuberculosis). There is an increase in symptomatic healthcare visits before diagnosis. The trend is estimated with two curves: The ﬁrst segment (ﬂatter) capturesthe period where clinical disease is unlikely to be present, the second segment (steeper) captures symptoms of clinical disease and potential missed opportunities. Figure B depicts how the k-means clustering algorithm is applied to the trends for each potential antecedent condition (note, this is an oversimpliﬁed depiction where only the two slope parameters are used to identify k=3 clusters). The central plot depicts examples of clusters of conditions identiﬁed based on the slope parameters. The plots on either side of the clustering graph depict examples of trends that might ﬁt the patterns...."

"Figure 4: Examples of trends in top antecedent conditions selected in the “cough” focal cluster prior to tuberculosis. The black dots depict 7-day average counts of visits with the given diagnosis relative to visit frequency. The linear piecewise model used to fit the trend and derive parameter estimates for the cluster analysis is depicted by the red line (see Supplemental Figure 3 for the remaining top 25 conditions)."

"Figure 6: Examples of selected trends in antecedent conditions contained in the “abdominal pain” focal cluster prior to appendicitis. The black dots depict 7-day average counts of visits with the given diagnosis relative to visit frequency. The linear piecewise model used to fit the trend and derive parameter estimates for the cluster analysis is depicted by the red line (see Supplemental Figure 6 for the remaining top 25 conditions)."

"Supplemental Figure 4 – Evaluation of focal cluster based on cough prior to tuberculosis, in terms of number of potential missed opportunities and patients with a diagnostic delay identified. The number of potential missed opportunities and patients that would be identified based on the biologically plausible antecedent conditions identified using the focal cluster are plotted for different values of k. The black line depicts the mean value while the grey shaded region represents values in between the 5 th and 95 th percentile of resulting clusters. The resulting number of potential missed opportunities identified decreases with greater values of k, yet across all cluster considerably more diagnostic delays may be identified compared to using the focal symptom of cough alone." "Supplemental Figure 7 – Evaluation of focal cluster containing unspecified abdominal pain prior to appendicitis, in terms of number of potential missed opportunities and patients with a diagnostic delay...."

Can #health systems spot missed diagnoses at scale?

Unsupervised clustering found symptoms that spike before a particular #diagnosis to consider clinically-plausible alternative diagnoses, which more than doubled detection of potential diagnostic delays.

doi.org/10.1515/dx-2...

14 hours ago 0 0 0 0

The scoring and other methods of the Retraction Risk calculator. Even “vague negative hint” is scored as -1, indicating the calculator may overestimate (rather than underestimate) retraction risk.

Learn more about #RetractionRisk methods at www.retractionrisk.com/about.html

The tool's authors admit only "8.3% of retracted articles [had] at least one [prior] critical tweet".

For non-retracted articles, it was 1.5% ...controlling for falsely labeled tweets?

doi.org/10.1002/asi....

#NLP #AI

3 days ago 0 1 0 0

Initial questions about this retraction risk calculator:

What about negative phrases in social media posts that are NOT actually about the linked article?

What about negative posts about a paper that don’t actually link to the paper? (Screenshots, “link in comment” posts, etc.)

3 days ago 0 0 1 0

A top-20 "Science Sleuths Leaderboard" ranked by "Total Negative Posts about Articles".

Some examples of allegedly "negative posts" from Nick Byrd about research articles, including his own research articles. The posts seem to be classified as negative due to negative phrases that describe things besides the article.

More examples of allegedly "negative posts" from Nick Byrd about research articles, including his own research articles. These posts also seem to be classified as negative due to negative phrases that describe things besides the article or research.

Some #RetractionRisk Scanner results that illustrate how the number of allegedly negative social media posts and PubPeer comments can increase retraction "risk level".

🤨 TIL I'm a leader in (wait for it) "negative posts" about research?
— including my posts about MY research?!
— in league with Rob Sica? ...and ED YONG?!!

Fortunately this #Retraction Risk calculator considers more than just our (false negative?) posts!

www.retractionrisk.c...

3 days ago 2 0 1 0

Thanks to feedback from this and other platforms, I've tentatively added some of the many-author papers that list me as author (consortia or otherwise) to my Google Scholar page.

Still feels weird, but people have shown me many easy ways to see through oft-misleading metrics to actual contribution.

6 days ago 0 0 0 0

"We built a custom video player that sent timestamped cues to the wearable whenever a claim appeared. At each marked misinformation timestamp, the watch vibrated and displayed a short text explanation, simulating an AI-driven fact-checker (Figure 3). This interaction loop continued as participants watched the video until playback ended. ... Figure 3: Prototype Interaction Flow, that starts with the smartwatch listening for signals, the video sends signals to the watch when certain claims occur, and finally the watch vibrates and displays an explanation." "Instruction to the participants before the control condition was, '...Please watch it as you would naturally. Feel free to pause, rewind, skip, or use web browser (e.g., Google, ChatGPT) at any time'. [In] the wearable condition, an additional instruction was given, 'You will wear a smartwatch while watching the video .... The watch might vibrate and display a text message when a potential misinformation is detected.'"

"As shown in Figure 7, there was no significant correlation between care score and belief change for false claims in the wearable condition (𝑟 = −0.13, 𝑝 = 0.26)." "In contrast, a weak but statistically significant positive correlation was observed in the no-wearable condition (𝑟 = 0.19, 𝑝 = 0.025). This counterintuitive finding suggests that without the wearable intervention, participants who cared more about a topic showed a slight increase in their belief in false claims." "As shown in Figure 8, in the wearable condition, there was no correlation between AOT and belief change (𝑟 = 0.004, 𝑝 = 0.966). However, in the no wearable condition, a significant negative correlation emerged (𝑟 = −0.219, 𝑝 = 0.010), indicating that without the wearable, individuals with higher AOT tended to think critically and reduce their belief in false claims..." "Figure 9 indicates the wearable intervention was consistently effective at reducing [false] belief [regardless of prior exposure]."

Table 2. The list of 2 video clips used in the study, with 8 fact-check verdicts and explanations per video. Because all participants encountered both videos and all claims, the number of total observations could be over 1000, depending on the analysis.

Qualitative data about potential for self-reflection and over-reliance.

Can #wearables do real-time fact-checking?

34 people watched videos with and without a #factChecking #smartwatch. The watch's alerts reduced belief in #misinformation and increased manual fact-checking.

Some users said alerts made them "reflect more".

doi.org/10.48550/arX...

6 days ago 2 0 0 0

I guess my worry is that it’d misleadingly inflate my metrics.

But if it is easy to discern actual contribution from metrics that’re inflated by many-author papers, then perhaps I should add my consortium/many-author papers to my GScholar page. Otherwise, it misleadingly *deflates* my contribution.

1 week ago 0 0 0 0

I like the idea of divided contribution, but I think many papers probably should not divide contributions equally among authors. In my experience on many-author papers, very few people do that vast majority of the work. So their contribution would be diluted if we divided equally among all authors.

1 week ago 1 1 1 0

CRediT profile visualization. A stacked bar chart showing kind of contribution (lead versus other role) for each kind of contribution (conception, funding, data collection, first draft, etc.)

A visualization of how much a scholar employs each type of open science practice (e.g., pre-registration, open materials, open data).

One reply to that objection might be RESCUE:

This system overcomes limitations of standard scholarly #metrics, #hiring, or #promotion materials, etc.

The demo shows how standardized and reproducible #dataViz quickly conveys a scholar’s varied contributions:

www.resque.info/includes/dem...

1 week ago 1 0 1 0

A potential objection:

1. Minimal contribution co-authorship occurs in more than just consortium papers; it probably occurs in most (all?) #teamScience papers.

2. Demarcating team #science papers from others may make team science contributions seem less valuable, disincentivizing team science.

1 week ago 0 0 2 0

An excerpt of a curriculum vitae with separate lists for "publications" and "publications supported (e.g., as analyst, forecaster)".

A screenshot from Google Scholar (taken April 2026) showing a prompt to add a many-author paper from the journal Nature to a scholar's profile simply because the scholar was listed among hundreds of other minor contributors (because he analyzed a tiny subset of the paper's data — work that was compensated with a small payment).

Researchers: should we separate our primary #authorship and #citations from consortia counterparts?

If I add little to a paper (e.g., a little analysis, data, editing), I feel odd letting it count toward MY #publications or MY #hIndex.

Just me?

Example: byrdnick.com/cv#Publ...

1 week ago 0 0 3 0

A table of macOS version history from Wikipedia.

How soon can you install a new operating system on #work device(s)?

I'll go first:
- #macOS 26 was published 2025-09-15: en.wikipedia.org/wik...
- My work device allowed me to install it 2026-03-31 — more than 6 months later.

I know #infoSec can be complicated. Insight welcome.

1 week ago 0 0 0 0

"4.2 Biting the bullet Indeed, common-sense morality can sometimes be quite demanding. .... I illustrate with some examples: Perhaps you are at the airport, and terrorists attack. You are between your helpless child and the door. You can see a terrorist start to approach. Presumably, you are obliged to collect your child before running away, even if doing so puts your life at a greater risk. Likewise, if your mother is hit by a car and paralyzed but still capable of living a good life, then you will likely be obliged to help take care of her (at least if no one else can), even if doing so is very costly to you. ... The fact that a principle has very demanding implications doesn’t imply that it isn’t commonsensical. And this isn’t just to say that anti-demandingness intuitions are unreliable, as have some other philosophers (Braddock 2013; Berkey 2016). It’s to say that these intuitions are demonstrably false from a common-sense perspective."

Can vegans eat dessert?

This papers offers two responses to an argument that vegans cannot, without some sort of contradiction, eat dessert. One option is to base omens veganism on alternative principles. Another is to bite the bullet (which I might do).

doi.org/10.1017/S095...

1 week ago 0 1 1 0

Glad you highlighted the potential mechanism: AI may have reduced Socratic discussion.

That raises questions about *how* people use AI (vs. *whether* they use it).

AI can be a Socratic discussant, but when people are less likely to interact Socratically with AI than with humans, then humans > AI.

1 week ago 0 0 0 0

"This randomized controlled pilot study involved 41 final-year medical students participating in a trauma simulation session. Students self-selected into teams of 4 to 6 and were randomized to either an LLM-assisted group (ChatGPT-4o mini) or a control group without LLM access. All teams completed 18 video-based trauma scenarios requiring time-sensitive clinical decisions. Prompting was unrestricted." "Confidence in trauma management improved in both groups (P<.001), with larger gains in the non-LLM group (P=.02). LLM support did not enhance the decision accuracy or speed and was associated with longer response times in some complex cases. Teams without LLMs demonstrated more active discussion and scored higher in teamwork ratings (median 5.0 [IQR 5.0-5.0] vs median 3.5 [IQR 3.0-4.5]; P=.08). Students primarily used the LLM for fact-checking but reported vague or overly general responses."

Will #chatGPT 4o-mini improve #medEd about time-sensitive #trauma decisions?

Not in this randomized control trial:
- Accuracy and speed no better than the control group?
- #AI group had *lower* confidence?
- Student complaints: "vague and overly general"

doi.org/10.2196/79134

1 week ago 3 0 1 0

"After advertising our initiative of implementing a private instance of ChatGPT to the workforce, an introductory webinar was conducted to educate interested employees about GenAI models and their responsible use. This webinar attracted 560 participants, demonstrating the high level of interest in this technology. All employees were then invited to apply for access to our HIPAA compliant, private GenAI studio and experiment with data not allowed within the public instance, such as patient information and intellectual property."

"From a technological perspective, 111 960 167 tokens (the atomic unit of large language models) were used in the first 6 months of implementation, costing the institution about $4200. On average, there were 60 users/week submitting about 671 queries/week. There were 34 new users per week on average. The highest number of new users occurred during the week of June 5, shortly after our introductory webinar, with another spike of new users in August around our first enterprise-wide prompt-a-thon and the following in December after subsequent prompt-a-thons (see Figure 2). These patterns highlight how influential educational programs were in catalyzing adoption within the health system...."

"When asked what they would use GenAI tools for if available, most stated they would frequently use it for writing documents, editing, summarizing information, and analyzing data. However, they did not perceive that using GenAI for these tasks would reduce the importance of maintaining their own skills in these areas, indicating that their experience suggested that GenAI could not automate these tasks. When asked about factors that would increase or decrease how frequently they use GenAI tools, participants indicated that public release of more capable AI tools, permanent access to the GenAI studio at NYULH, positive reports from colleagues, and studies showing productivity benefits of using AI tools would increase their usage."

Secure, HIPAA compliant Available to All 50,000 employees. • Launched October 1. • Deployed: 45k Windows devices, 3.7k Mac devices, and 12.5k smartphones • Unique Users: 19,601 • New Chats Per Day: ~4,000

How much will a #business pay to give employees #AI tools?

The first 1000 people at #NYU #LangoneHealth to use #Azure GenAI Studio used 111,960,167 tokens in the first 6 months.

It cost just $4200: doi.org/10.1093/jami...

So they gave the #tech to everyone. About 20,000 users!

2 weeks ago 2 0 2 0

"Figure 2: Architecture and training workflow of F/S-RM. (a) Adaptive reasoning task. (b) Reward signal generation modeled as an adaptive reasoning chain. (c) Two-stage training pipeline for optimizing fast-thinking judgment and slow-thinking CoT reasoning."

"Table 1: Comparison on RewardBench, RM-Bench, JudgeBench, and average performance. Bold numbers indicate the best performance, Underlined numbers indicate the second best. ∆ shows hybrid performance change vs. full slow thinking; ↓ shows token reduction vs. full slow thinking. Detailed comparison results are provided in the Appendix A.7."

Here's another one using reward models: doi.org/10.48550/arX...

Sidenote: "First token PERDITION" made me laugh — reinforcement learning meets Christian theology!

#cogSci #CompSci #religion

2 weeks ago 1 1 0 0

Good eye!

I can confirm that ann Antibiotic Resistance proposal stalled while “pending” scientific review (that was scheduled for January).

When feeling pessimistic, I try to remind myself how government shutdowns can cause delays. Of course, I sometimes wonder whether something else is afoot.

2 weeks ago 0 0 1 0

"No current open opportunities."

"Priorities This document has been prepared in furtherance of the AHRQ Director's responsibility to carry out AHRQ's mission, duties, and statutory responsibilities. See 42 U.S.C. § 299 et sec. Please note that this is not an exhaustive list of all agency priorities. This document is intended to clarify specific issues that currently require additional guidance. AHRQ continues to support projects across the spectrum of health services research. Through executive orders, including those on Gold-Standard Science and the Make America Healthy Again Commission Report (PDF, 4 MB), the President has directed HHS to close critical research gaps and guide efforts to better combat chronic disease in America and improve the health of all Americans through gold-standard science. To meet these requirements and fulfill our mission, AHRQ is prioritizing research in the following areas: Patient Safety Making healthcare delivery safer and more effective for all Americans will continue to be an important part of AHRQ's core function. Important areas of research include medical and hospital errors, including the development of performance measures, medication safety, and improving diagnosis with a focus on uncovering and implementing changes that can benefit large numbers of patients in significant ways or profoundly and substantially benefit smaller patient groups. Preventing Antibiotic Resistance Antibiotic resistance is a major health problem, with overuse and misuse of antibiotics as a prime driver. The development of strategies to encourage appropriate antibiotic use, including in children, is a fundamental challenge of the 21st century. AHRQ will continue to play a lead role in this area...."

It seems like #AHRQ has issued early expirations for *all* of its NOFOs and it does not seem to have *any* forecasted NOFOs: www.ahrq.gov/funding/fund... 😬

However their website does report continued (and updated) funding priorities: www.ahrq.gov/funding/prio... 🤞

2 weeks ago 0 0 1 0

"PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts. We evaluated PaperTrail in a within-subjects study with 26 researchers who performed two scholarly editing tasks using PaperTrail and a baseline interface. Our results show that PaperTrail significantly lowered participants’ trust compared to the baseline. However, this increased caution did not translate to behavioral changes, as people continued to rely on LLM-generated scholarly edits to avoid a cognitively burdensome task."

Do scientists rely less on #AI after seeing it hallucinate citations, findings, etc.?

#PaperTrail revealed what #LLMs did and didn't accurately summarize (compared to source).

That reduced researchers' trust in AI, but not their *reliance* on AI.

doi.org/10.48550/arX...

#tech

2 weeks ago 2 0 1 0

"Fig. 3. Performance on conjunction tasks and base-rate task, before and after training." "Participants (N = 56) were randomly assigned to a training group, which completed a series of conjunction judgment tasks with feedback, or a control group. Results showed that the training group improved on both trained and untrained conjunction tasks, including those based on real-world and clinical scenarios, while the control group showed no such improvement. No transfer effects were observed for unrelated base-rate tasks. Performance gains were gradual, suggesting that participants developed judgment strategies over time rather than immediately adopting normative rules. These findings demonstrate that conjunction fallacies can be mitigated through self-guided learning with minimal instruction, offering a promising approach to improving probabilistic reasoning."

"Training stimuli consisted of 30 judgment problems. A variety of tasks were used to keep participants attentive and ensure they read through the full task description. The training set comprised 15 Lindalike conjunction tasks from Andersson et al. (2020), six conjunction tasks comprising fictional situations paraphrased from the research literature, three conjunction tasks describing sequences of dice rolls (Tversky & Kahneman, 1983), and three conjunction tasks describing sequences of letters (Tversky & Kahneman, 1983). Further, we included three Monty Hall tasks..." "The control group instead completed 30 spelling tasks."

"Each block of decision tasks comprised 12 (pre- and post-test) or 10 (training) tasks in randomized order, which were completed one by one. As described under materials, a task included a brief description, a question or prompt, and the choice alternatives. This text was presented together with an image relating to the description in the task (Fig. 1A). There was no time limit on any task, and the task concluded once a choice was made by clicking on an alternative. Participants in both the training and control groups were informed about correct or incorrect responses after each trial, but no further feedback on which strategy to use."

Can feedback on #cognitiveBias tests train people to overcome #bias?

56 people were randomized to practice either 30 cog bias tests or 30 spelling tests between 12 pre- and post-tests (about news, clinical psych, etc.). Decision practice/feedback helped!

doi.org/10.1016/j.ac...

2 weeks ago 3 0 0 0

"Fig. 3 Data ontology after clustering in the feedback aggregator component." Anastasiou, L., Elguendouze, S., Efstathiou, I., Mariani, I., Cabrio, E., Villata, S., Karacapilidis, N., Concilio, G., Rizzo, F., Greco, S., Jermini, C., Coppola, C., Apostolopoulou, A., Domalis, G., Tsakalidis, D., & De Liddo, A. (2026). ORBIS: An AI-Enabled Architecture for Scaling Online Deliberation. In S. K. Aikins & T. Dimitrijevska-Markoski (Eds.), Artificial Intelligence and Government: Examining the Roles and Uses of AI in Enhancing Government Operations (pp. 89–118). Springer Nature Switzerland. https://doi.org/10.1007/978-3-032-12344-2_5

#AI can summarize massive bodies of text, but can it visualize their arguments?

#ORBIS performs #logic mining on deliberation sites like #BCause—identifying premises and predicting their relationships—to map arguments for and objections to each position.

doi.org/10.1007/978-...

2 weeks ago 2 0 1 0

Are you a good driver? The story of how a secret project at Google led to driverless cars on American roads. And, an answer to the question: are the robots actually safer drivers than we …

Part 1 of this series about autonomous driving (from @searchengine.show.web.brid.gy) does well to include the public-private partnerships fueled by @darpa.mil in its origin story. 🤓

Thanks @pjvogt.bsky.social et al.

www.searchengine.show/are-you-a-go...

#innovation #government #industry #science

2 weeks ago 0 0 1 0

🤓 March Methods > March Madness

2 weeks ago 3 0 0 0

"The main point of strategic reflectivism is that reflective reasoning (and its resources) should be allocated according to the system’s goals, optimally fulfilling competing goals wherever possible. ...the strategic reflectivist should acknowledge that when judgment aggregation is easy, more reflective inference may not be worth its opportunity costs." Byrd, N. (2025). Strategic Reflectivism In Intelligent Systems. Lecture Notes In Computer Science. https://doi.org/10.48550/arXiv.2505.22987

"Figure 1: Adaptive compute allocation across difficulty levels on Qwen3-8B-Base. (a) Compared to GRPO, C ODA dynamically allocates reasoning tokens by question difficulty, consuming substantially fewer tokens on easier problems while increasing compute for harder ones. (b) On easy tasks (GSM8K, MATH), extra tokens yield marginal gains, and CODA achieves optimal accuracy at minimal cost by avoiding unnecessary reasoning. On hard tasks (AIME24&25), additional tokens substantially improve performance, and CODA encourages deeper reasoning to maximize accuracy." So most of the token savings is on easier tasks, indicating that the control models over-deploy reflective reasoning. Wu, S., Xie, J., Zhang, Y., & Xiao, Y. (2026). CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning (arXiv:2603.08659). arXiv. https://doi.org/10.48550/arXiv.2603.08659

A new #AI paper confirms a component of #StrategicReflectivism: when accuracy is relatively easy for an intelligent system, reflective inference isn't worth its added tokens (doi.org/10.48550/arX...).

See CODA (Compute Allocation by Difficulty Awareness): doi.org/10.48550/arX...

2 weeks ago 4 2 1 0

I’ll ask about recording, but I can probably share just about every slide after the event.

And I’m always happy to chat. :)

3 weeks ago 1 0 0 0

Moral Measures Workshop - Consortium on Moral Decision-Making This event will bring together researchers to give short talks on methodology and measurement questions in the study of moral and ethical decision-making. Although many of our conference events have f...

Check out the full "Moral Measures" program at moralconsortium.psu.edu/events/moral...

3 weeks ago 0 0 1 0

Peak inside the black box of decision-making …at scale - Thinking aloud (in-person or online) with audio annotation - Thinking alone or together (chat) with text annotation Both methods allow people to observe actual decision-making processes — not just final decisions. To learn more, go to byrdnick.com

Join today's free webinar about "moral measurement"! I'm talking #psychometrics and data quality in the era of online experiments and AI. I review two problems, two solutions, and two resources.

Register at ssri.psu.edu/news/mo...

Thanks to Penn State for hosting!

3 weeks ago 2 0 1 0

If only @theverge.com had an Editor-at-Large in the DC area who could run these stories down to the origins. 😉

But seriously, David: day trips to NSA’s free Cryptologic museum or the (probably not free) International Spy Museum could be fun for the family and rife with Version History ideas.

3 weeks ago 0 0 1 0

The start and end of this episode are what I want more of in #tech journalism:

The #history of #innovation and deployment.

These origin stories often start with public-private #science partnerships trying to solve problems in #intelligence or #security — Bell, GE, IBM, MITRE, Rand, RCA, SRI. 🤓 🍿

3 weeks ago 2 0 2 0

Posts by Nick Byrd, Ph.D.