Towards a shared infrastructure for assembling web search engines
Web search engines are essential for navigating the web. Suppose we look at the web as a service that is provided by public utility companies, a service similar to electricity, water or telephone. To make sure that everyone has access to the web, public utility companies have to be subject to public control and regulation. Without regulation, a single firm may abuse their natural monopoly, for instance by raising prices, by deteriorating the service, by delivering unequal quality to different groups, or by pushing advertisements and propaganda. Equity requires that all citizens can access the web at a fair price, and at a sufficient level of quality, via transparent, well-regulated, community-based or government-based control.
OpenWebSearch.eu is a European Union funded project that researches what a transparent, well-regulated, community-based web search engine would look like. The project builds the index for a web search engine on open infrastructure that is distributed over four data centers in four different European countries. The data centers cooperatively crawl the web, cooperatively preprocess and enrich the web data, and cooperatively build an inverted index that is shared with the world. We envision a future where a search engine is βassembledβ from parts provided by many different companies, based on public standards. I will discuss public standards for search engine indexes, such as the common index file format (CIFF) and approaches based on open data formats like Parquet and open cloud object storage like S3. Furthermore, I will show how researchers can query the Open Web Index remotely using a low-cost local machine, without the need to download the full index, even though it currently consists of more than 10 billion web pages.
_To be presented at the European Conference on Information Retrieval (ECIR 2026)IR 4Good track on 30 March 2026 in Delft, the Netherlands_
I will be giving an invited talk at the #ECIR2026 IR4Good track about #OpenWebSearchEU: "Towards a shared infrastructure for assembling web search engines"
djoerdhiemstra.com/2026/towards-a-shared-in...
4
1
0
0