Quantifying the Systematic Bias in the Accessibility and Inaccessibility of Web Scraping Content From URL-Logged Web-Browsing Digital Trace Data. Abstract:Social scientists and computer scientists are increasingly using observational digital trace data and analyzing these data post hoc to understand the content people are exposed to online. However, these content collection efforts may be systematically biased when the entirety of the data cannot be captured retroactively. We call this often unstated assumption the problematic assumption of accessibility. To examine the extent to which this assumption may be problematic, we identify 107k hard news and misinformation web pages visited by a representative panel of 1,238 American adults and record the degree to which the web pages individuals visited were accessible via successful web scrapes or inaccessible via unsuccessful scrapes.
Note: Hard news webpages: χ2(3) = 745.6, p < .001; Misinformation webpages: χ2(3) = 13.1, p = .005. Distribution of websites in each accessibility category. Accessible websites are those in which the web scrape is successful. Inaccessible websites are those in which the web scrape is unsuccessful. Accessible websites are further categorized into unrestricted content in which the content is not restricted or returns an error, restricted websites in which the content sits behind a paywall, login page, or some other error message on the web page itself, and errors in which the web server returns an HTTP status code greater than 400.
Figure 1. Accessibility category metrics over time.
Figure 2
🚨New from me, Kumar, Durumeric & Hancock. We use web-browsing data (N = 21M) to quantify the (in)accessibility of misinformation and news visits, finding that conservative misinformation is most likely to be inaccessible to researchers via scraping doi.org/10.1177/08944393231218214