Remote Access to the Open Web Index

Remote Access to the Open Web Index#

The Open Web Index is not only distributed through collections in iRODS. We also provide a generic representation of the OWI that can be accessed and queried remotely, especially suitable for one-off queries (such as selecting a document subset of interest).

Representation#

The remote index consists of a Parquet-based inverted file representation, stored on S3 object storage. The index consists of two Parquet-based virtual tables: an index table and a documents table. For a detailed description of the remote index, we refer to our research paper on remote indexes [HHdV26].

Index#

The index table can be used to find document IDs that contain a certain term or set of terms. It also contains “normalized term frequencies”: the frequency of each term in the document if all document length were artificially scaled to be a certain constant length. This simplifies ranking models like BM25, which often use a length-based component: since all documents have the same length, the length-based component can be removed from the ranking formula and we no longer have to look up the lengths of each individual length.

An example index.parquet file would look like this:

term	docid	tf
fish	1	50
one	1	25
two	1	25

By storing the inverted file in the Parquet file format, we can very efficiently filter on term and skip reading the majority of the file.

Documents#

The documents table can be used to map the numeric, internal identifiers used in the index table back to the external, SHA2-based identifiers that we use throughout the Open Web Index. For convenience, we also include the URL so that users that only want to search a single page can make do with the remote index only.

The documents.parquet file would look like this:

docid	id	url
1	0a27e…	https://example.com/…

Given a docid returned by retrieval over the index table, we find the associated documents by filtering the documents table on the docid (again skipping a large majority of the file). Alternatively, when performing a larger lookup against the documents table (e.g. when finding all documents that contain a certain popular term), it may be beneficial to download the full documents table for local use – it is significantly smaller than the index table.

Usage#

We recommend using DuckDB to access the remote index. The examples in this section will be assuming you use DuckDB. We recommend to enable the parquet_metadata_cache setting, to speed up queries that re-use the same parts of the index.

SET parquet_metadata_cache = true;

The remote index is served from Allas, the S3-compliant object storage provided by our partner CSC. The data resides in a public S3 bucket named 2006391-owi-remote-index. To register the bucket in DuckDB and associate it with the right S3 endpoint, issue the following query:

CREATE PERSISTENT SECRET ows_remote_index (
    TYPE s3,
    ENDPOINT 'a3s.fi',
    URL_STYLE 'path',
    SCOPE 's3://2006391-owi-remote-index'
);

Following the organization of the main Open Web Index, the remote index is distributed as daily shards per data center. The path contains a reference to the collection ID of the associated collection in the Open Web Index. Say, for instance, we have the following path:

2006391-owi-remote-index/
    year=2025/
        month=12/
            day=25/
                collection=26eb6c02-e237-11f0-b4b1-8ebf6bb2cab9/

The referenced collection ID refers the main collection produced for the 25th of December, 2025. It can be found in the LEXIS portal at https://portal.lexis.tech/publicDataSets/26eb6c02-e237-11f0-b4b1-8ebf6bb2cab9/details. Similarly, it can be accessed with Owilix as follows:

owilix remote ls 'all/id=26eb6c02-e237-11f0-b4b1-8ebf6bb2cab9'

At the time of writing (January 2025), the remote index contains almost 550 million documents, the number growing by the day.

Retrieval#

The index is stored in the index subdirectory of the collection directory. We provide pre-computed indices for three document representations:

main_content: the body of the Web page (with boilerplate content removed)
title: the HTML title of the page
description: the HTML meta description of the page

Like the regular Open Web Index, the remote index is then partitioned into language-based shards. The full directory structure of the index tables is the following:

collection=26eb6c02-e237-11f0-b4b1-8ebf6bb2cab9/
    index/
        representation=main_content/
            language=eng/
                postings_0.parquet
                postings_1.parquet
                ...
            ...
        representation=title/
            ...
        representation=description/
            ...

To find all English documents from this specific daily index shard that contain the term “search” in the body, we would issue the following query:

SELECT docid
FROM 's3://2006391-owi-remote-index/year=2025/month=12/day=25/collection=26eb6c02-e237-11f0-b4b1-8ebf6bb2cab9/index/representation=title/language=eng/*.parquet'
WHERE term = 'search';

We can define a VIEW to simplify access to a larger part of the index:

CREATE VIEW remote_index AS (
    FROM 's3://2006391-owi-remote-index/*/*/*/*/index/*/*/*.parquet'
);

Note: queries with glob operators (including the CREATE VIEW above) may take longer to process, as DuckDB first needs to fetch the full list of files before it can prune away the files it does not need.

You can use this view with filters of your preference, consider for example:

-- Find all Dutch documents from December 2025
-- that contain "gezellig" in the title
SELECT docid
FROM remote_index
WHERE representation = 'title'
    AND language = 'nld'
    AND make_date(year, month, day) BETWEEN '2025-12-01' AND '2025-12-31'
    AND term = 'gezellig';

We support more complex queries as well:

Multi-term (disjunctive) queries#

SELECT docid
FROM remote_index
WHERE representation = 'main_content'
    AND term IN ('open', 'web', 'search')

Conjunctive queries#

WITH postings AS (
    SELECT docid
    FROM remote_index
    WHERE representation = 'main_content'
        AND term IN ('open', 'web', 'search')
)
SELECT docid
FROM postings
GROUP BY docid
HAVING count(*) = 3;

Ranking queries#

A simple ranking model based on (unsmoothed) unigram language models can be expressed easily:

SELECT docid, sum(log(tf)) AS score
FROM remote_index
WHERE representation = 'main_content'
    AND term IN ('open', 'web', 'search')
GROUP BY docid
ORDER BY score DESC
LIMIT 10;

The BM25 ranking model can also be expressed in terms of remote queries but becomes a bit more complicated. Refer to the repository of the remote querying paper for an example implementation (open-web-index-bm25.sql).

Document lookup#

Given a set of document identifiers, the next step is to fetch the URLs of the associated Web pages (e.g. for top-10 rankings), or the associated IDs that can be mapped back to the collections in the regular Open Web Index.

The numeric, internal document identifiers are unique within each shard of the index. This means it is not enough to fetch the docid column only: the collection and language columns are also part of the key that uniquely describes a document in the remote index.

We start off by creating another VIEW, similar to the one for the index itself.

CREATE VIEW documents AS (
    FROM 's3://2006391-owi-remote-index/*/*/*/*/documents/*/*.parquet'
);

Then, imagine we ran a conjunctive ranking query to find the top 10 documents that reference “Open Web Search”:

SELECT collection, language, docid, sum(log(tf)) AS score
FROM remote_index
WHERE representation = 'main_content'
    AND language = 'eng'
    AND term IN ('open', 'web', 'search')
GROUP BY ALL
    HAVING count(*) = 3
ORDER BY score DESC
LIMIT 10;

The query returns the following results:

collection	language	docid	score
d118aa3e-d5d3-11f0-86db-f6a03915313d	eng	6794391	6.646564081665443
d8690524-f51b-11f0-89ba-02a47ca5d9fd	eng	366670	6.646564081665443
311ef39c-fbfa-11f0-ae4a-02a47ca5d9fd	eng	982722	6.646564081665443
311ef39c-fbfa-11f0-ae4a-02a47ca5d9fd	eng	4967644	5.341410781178393
cac6c1c8-d990-11f0-a6f8-f6a03915313d	eng	1888202	5.328018560809645
6b4eceb8-f6d0-11f0-b007-02a47ca5d9fd	eng	4235929	5.139381565823188
311ef39c-fbfa-11f0-ae4a-02a47ca5d9fd	eng	683210	5.139381565823188
0dd75a3a-f9e0-11f0-8bf1-02a47ca5d9fd	eng	354236	4.999726308254446
f680cb14-ee31-11f0-ae4a-02a47ca5d9fd	eng	1956213	4.999726308254446
b7786952-e190-11f0-85af-8ebf6bb2cab9	eng	1083842	4.999726308254446

We can (trivially) retrieve the document metadata for the top document:

SELECT docid, id, url
FROM documents
WHERE collection = 'd118aa3e-d5d3-11f0-86db-f6a03915313d'
    AND language = 'eng'
    AND docid = 6794391;

This gives the following result:

docid	id	url
6794391	f1cbb115ae323d89661da6652888a76ad96f67986b74248d42cd981579f969ab	https://notes.beauhilton.com/tags/web/

The complete query for ranking (using the inverted index) and document lookup (in the forward index) can be expressed as a JOIN query:

WITH results AS (
    -- Insert ranking query here
    SELECT ...
)
SELECT docid, id, url, score
FROM results
NATURAL JOIN documents
ORDER BY score DESC;

For the same search request as above, this gives the following results:

docid	id	url	score
366670	10a8c74a0af6dacef55aea3db93ce39fa540833a3e39f83f70cf3eefd57b876e	https://1day.dev/tags/Web/	6.646564081665443
982722	10a8c74a0af6dacef55aea3db93ce39fa540833a3e39f83f70cf3eefd57b876e	https://1day.dev/tags/Web/	6.646564081665443
6794391	f1cbb115ae323d89661da6652888a76ad96f67986b74248d42cd981579f969ab	https://notes.beauhilton.com/tags/web/	6.646564081665443
4967644	4f6cf7cb523103f09cc379731032e9fbc43a7766f1080a6bbd7d0590aa664692	https://search.n2sm.co.jp/	5.341410781178393
1888202	d2da2f62736ed38d317e9e0c9d24466f9fa0a76b122180e777d61abb878a55e5	https://www.searchenginefinder.com/Computer_and_WWW/Open_Source/	5.328018560809645
683210	0aa2aec48f12aae64ff0b5172d979a0bee098600233c5f6b5e811d23e34fbf5d	https://www.auwebdir.com/aulink/about.aspx	5.139381565823188
4235929	d42036b83a27a753ad199bcd18d962d187d26bbc6d7b269be985f1d81c343bfe	https://freewebsearch.org/en/	5.139381565823188
202836	0f7fcb462b82cbb7eb8c1dec939ef1f07c8983719efef5585fcfa72142d8b728	https://www.bamid.gov.tw/en/Default.aspx	4.999726308254446
1210356	599ee60ce0c3c532282f13ecb64b7129e15e323d78b2ee4c75518fed00c12e51	https://www.bamid.gov.tw/en/	4.999726308254446
1083842	599ee60ce0c3c532282f13ecb64b7129e15e323d78b2ee4c75518fed00c12e51	https://www.bamid.gov.tw/en/	4.999726308254446

Note that some URLs appear multiple times, with different docids; this happens when a page is crawled multiple times over the span of multiple days.

Data Lake representation#

While the remote index allows access to large amounts of crawled data, costs like the basic S3 file glob overhead remain annoying, especially when the size of the index grows or users want to access the index more than once in sequence. Whenever the remote index is used with the views defined earlier, resolving the glob in the S3 file path triggers rather slow listing operations. Also, DuckDB has to assume that the collection of Parquet files on S3 could change dynamically, so every query implies a HEAD request to check for changes.

Resolving these issues is possible by registering the remote index as part of a data lake metadata catalog (like Iceberg, Delta Lake or DuckLake). This simplifies queries significantly, as all files (and important metadata) are registered in a catalog, and can be assumed to be static. Whenever we issue a query to the remote index, the data lake catalog helps select the files that need to be processed, based on the query filters, and can immediately start processing the query (rather than waiting for listing operations and HEAD requests). Apart from these query processing advantages, using a data lake catalog would provide additional advantages as well, for example to handle deletions following from takedown requests, and modeling the curation process for specialised Web collections.

DuckLake example#

DuckLake is in active development, so integration with the remote index is experimental and certain features may not work as expected.

Let’s explore these advantages by creating a local DuckLake catalog, based on the remote index.

The huge amounts of data for the OWI remain hosted remotely, in S3 object storage, but the metadata is copied to your local machine and will be queried much more efficiently. To create a DuckLake, use the following statements:

ATTACH 'ducklake:ows_remote_index.ducklake' AS ows_remote_index;
USE ows_remote_index;

Import the metadata of the available OWI index files that make up the inverted index:

-- Create the remote_index table with the right schema
CREATE TABLE remote_index AS (
    FROM 's3://2006391-owi-remote-index/*/*/*/*/index/*/*/*.parquet'
) WITH NO DATA;

-- Set the partition columns
ALTER TABLE remote_index SET PARTITIONED BY (
    year, month, day, collection, representation, language
);

-- Import the metadata from the existing files
CALL ducklake_add_data_files(
    'ows_remote_index',
    'remote_index',
    's3://2006391-owi-remote-index/*/*/*/*/index/*/*/*.parquet'
);

These commands fetch the metadata of the Parquet files in S3, and register them in DuckLake. Because the OWI comprises a large number of files, the final operation may take a while to complete.

Next, we proceed with analogous queries to download the metadata for the document records that constitute the forward index:

CREATE TABLE documents AS (
    FROM 's3://2006391-owi-remote-index/*/*/*/*/documents/*/*.parquet'
) WITH NO DATA;

ALTER TABLE documents SET PARTITIONED BY (
    year, month, day, collection, language
);

CALL ducklake_add_data_files(
    'ows_remote_index',
    'documents',
    's3://2006391-owi-remote-index/*/*/*/*/documents/*/*.parquet'
);

We can now use the remote_index and documents table in the exact same way as the VIEWs we used earlier (but with reduced S3 overhead). Don’t forget to call ducklake_add_data_files for the new index partitions as they appear daily!