Paper:
https://doi.org/10.1007/978-3-032-21289-4_25
Want to try it yourself?
#openwebsearcheu book chapter:
openwebsearcheu-public.pages.it4i.eu/ows-the-book/content/dnt...
Code repository for running remote queries using #DuckDB [β¦]
Gijs Hendriksen presenting our work on "remote querying" to provide access to huge Web resources through de facto standard tech: Parquet files in S3 queried using #DuckDB to facilitate IR research at very acceptable latencies.
Run your ClueWeb experiment in 10 [β¦]
[Original post on idf.social]
I will be giving an invited talk at the #ECIR2026 IR4Good track about #OpenWebSearchEU: "Towards a shared infrastructure for assembling web search engines"
djoerdhiemstra.com/2026/towards-a-shared-in...
#ECIR2026 notifications were friendly to me π€
1. Full paper "Open Web Indexes for Remote Querying" with @gijs and @djoerd.
Can we let ppl query the Terabytes of Web Index we collect in #OpenWebSearch.EU in new ways, making good use of Parquet, S3, DuckDB?
Turns out the answer is a big YES! [β¦]
π The #OpenWebSearchEU project, coordinated by @mgrani.bsky.social and #OpenSearch Foundation, aims to strengthen Europe's digital sovereignty. With the launch of the #OpenWebIndex (OWI), it has reached a milestone for open internet search:
π§ͺ
#NGISearch & #OpenWebSearchEU with #NGISargasso at the #NGIForum2025! β¨
Collaboration with the #NextGenerationInternet community is key to shape an #OpenInternet together π
Discover more about NGI Sargasso: ngisargasso.eu π
#NGIForum25 #DigitalSovereignty #NGI #OpenSource
@ngi4eu.bsky.social
#OpenWebSearchEU is a silver sponsor of #ECIR2025!
https://ecir2025.eu/sponsors/
Open web index #OWI update:
4 billion URLs crawled
185 different languages
28 million Hosts
750 TB crawled
1 TB crawled per day
147 WARC Datasets
17.5 TB size of Open Web Index
28.8 TB size of WARC datasets
346 public datasets
#OpenWebSearchEU #OpenWebSearch
https://ows.eu