(dnt:remote-index)=
# Remote Access to the Open Web Index

The Open Web Index is not only distributed through collections in iRODS. We also provide a generic representation of the OWI that can be accessed and queried remotely, especially suitable for one-off queries (such as selecting a document subset of interest).

## Representation

The remote index consists of a Parquet-based inverted file representation, stored on S3 object storage. The index consists of two Parquet-based virtual tables: an `index` table and a `documents` table. For a detailed description of the remote index, we refer to [our research paper on remote indexes](https://gitlab.science.ru.nl/informagus/remote-querying) {cite}`hendriksen2025open`.

### Index

The `index` table can be used to find document IDs that contain a certain term or set of terms. It also contains "normalized term frequencies": the frequency of each term in the document if all document length were artificially scaled to be a certain constant length. This simplifies ranking models like BM25, which often use a length-based component: since all documents have the same length, the length-based component can be removed from the ranking formula and we no longer have to look up the lengths of each individual length.

An example `index.parquet` file would look like this:

| term | docid | tf |
| ---- | ----- | -- |
| fish | 1     | 50 |
| one  | 1     | 25 |
| two  | 1     | 25 |

By storing the inverted file in the Parquet file format, we can very efficiently filter on `term` and skip reading the majority of the file.

### Documents

The `documents` table can be used to map the numeric, internal identifiers used in the `index` table back to the external, SHA2-based identifiers that we use throughout the Open Web Index. For convenience, we also include the URL so that users that only want to search a single page can make do with the remote index only.

The `documents.parquet` file would look like this:

| docid | id       | url                     |
| ----- | -------- | ----------------------- |
| 1     | 0a27e... | https://example.com/... |

Given a `docid` returned by retrieval over the `index` table, we find the associated documents by filtering the `documents` table on the `docid` (again skipping a large majority of the file). Alternatively, when performing a larger lookup against the `documents` table (e.g. when finding all documents that contain a certain popular term), it may be beneficial to download the full `documents` table for local use -- it is significantly smaller than the `index` table.

## Usage

We recommend using [DuckDB](https://duckdb.org/) to access the remote index. The examples in this section will be assuming you use DuckDB. We recommend to enable the `parquet_metadata_cache` setting, to speed up queries that re-use the same parts of the index.

```sql
SET parquet_metadata_cache = true;
```

The remote index is served from [Allas](https://docs.csc.fi/data/Allas/), the S3-compliant object storage provided by our partner CSC. The data resides in a public S3 bucket named `2006391-owi-remote-index`. To register the bucket in DuckDB and associate it with the right S3 endpoint, issue the following query:

```sql
CREATE PERSISTENT SECRET ows_remote_index (
    TYPE s3,
    ENDPOINT 'a3s.fi',
    URL_STYLE 'path',
    SCOPE 's3://2006391-owi-remote-index'
);
```

Following the organization of the main Open Web Index, the remote index is distributed as daily shards per data center. The path contains a reference to the collection ID of the associated collection in the Open Web Index. Say, for instance, we have the following path:

```
2006391-owi-remote-index/
    year=2025/
        month=12/
            day=25/
                collection=26eb6c02-e237-11f0-b4b1-8ebf6bb2cab9/
```

The referenced collection ID refers the `main` collection produced for the 25th of December, 2025. It can be found in the LEXIS portal at https://portal.lexis.tech/publicDataSets/26eb6c02-e237-11f0-b4b1-8ebf6bb2cab9/details. Similarly, it can be accessed with Owilix as follows:

```bash
owilix remote ls 'all/id=26eb6c02-e237-11f0-b4b1-8ebf6bb2cab9'
```

At the time of writing (January 2025), the remote index contains almost 550 million documents, the number growing by the day.

### Retrieval

The index is stored in the `index` subdirectory of the collection directory. We provide pre-computed indices for three document representations:
- `main_content`: the body of the Web page (with boilerplate content removed)
- `title`: the HTML title of the page
- `description`: the HTML meta description of the page

Like the regular Open Web Index, the remote index is then partitioned into language-based shards. The full directory structure of the `index` tables is the following:

```
collection=26eb6c02-e237-11f0-b4b1-8ebf6bb2cab9/
    index/
        representation=main_content/
            language=eng/
                postings_0.parquet
                postings_1.parquet
                ...
            ...
        representation=title/
            ...
        representation=description/
            ...
```

To find all English documents from this specific daily index shard that contain the term "search" in the body, we would issue the following query:

```sql
SELECT docid
FROM 's3://2006391-owi-remote-index/year=2025/month=12/day=25/collection=26eb6c02-e237-11f0-b4b1-8ebf6bb2cab9/index/representation=title/language=eng/*.parquet'
WHERE term = 'search';
```

We can define a `VIEW` to simplify access to a larger part of the index:

```sql
CREATE VIEW remote_index AS (
    FROM 's3://2006391-owi-remote-index/*/*/*/*/index/*/*/*.parquet'
);
```
**Note: queries with glob operators (including the `CREATE VIEW` above) may take longer to process, as DuckDB first needs to fetch the full list of files before it can prune away the files it does not need.**


You can use this view with filters of your preference, consider for example:

```sql
-- Find all Dutch documents from December 2025
-- that contain "gezellig" in the title
SELECT docid
FROM remote_index
WHERE representation = 'title'
    AND language = 'nld'
    AND make_date(year, month, day) BETWEEN '2025-12-01' AND '2025-12-31'
    AND term = 'gezellig';
```

We support more complex queries as well:

#### Multi-term (disjunctive) queries
```sql
SELECT docid
FROM remote_index
WHERE representation = 'main_content'
    AND term IN ('open', 'web', 'search')
```

#### Conjunctive queries
```sql
WITH postings AS (
    SELECT docid
    FROM remote_index
    WHERE representation = 'main_content'
        AND term IN ('open', 'web', 'search')
)
SELECT docid
FROM postings
GROUP BY docid
HAVING count(*) = 3;
```

#### Ranking queries
A simple ranking model based on (unsmoothed) unigram language models can be expressed easily:

```sql
SELECT docid, sum(log(tf)) AS score
FROM remote_index
WHERE representation = 'main_content'
    AND term IN ('open', 'web', 'search')
GROUP BY docid
ORDER BY score DESC
LIMIT 10;
```

The BM25 ranking model can also be expressed in terms of remote queries but becomes a bit more complicated. Refer to the [repository of the remote querying paper](https://gitlab.science.ru.nl/informagus/remote-querying) for an example implementation 
([`open-web-index-bm25.sql`](https://gitlab.science.ru.nl/informagus/remote-querying/-/raw/main/open-web-index-bm25.sql?ref_type=heads)).

### Document lookup

Given a set of document identifiers, the next step is to fetch the URLs of the associated Web pages (e.g. for top-10 rankings), or the associated IDs that can be mapped back to the collections in the regular Open Web Index.

The numeric, internal document identifiers are unique within each shard of the index. This means it is not enough to fetch the `docid` column only: the `collection` and `language` columns are also part of the key that uniquely describes a document in the remote index.

We start off by creating another `VIEW`, similar to the one for the index itself.

```sql
CREATE VIEW documents AS (
    FROM 's3://2006391-owi-remote-index/*/*/*/*/documents/*/*.parquet'
);
```

Then, imagine we ran a conjunctive ranking query to find the top 10 documents that reference "Open Web Search":

```sql
SELECT collection, language, docid, sum(log(tf)) AS score
FROM remote_index
WHERE representation = 'main_content'
    AND language = 'eng'
    AND term IN ('open', 'web', 'search')
GROUP BY ALL
    HAVING count(*) = 3
ORDER BY score DESC
LIMIT 10;
```

The query returns the following results:

|              collection              | language |  docid  |       score       |
|--------------------------------------|----------|--------:|------------------:|
| d118aa3e-d5d3-11f0-86db-f6a03915313d | eng      | 6794391 | 6.646564081665443 |
| d8690524-f51b-11f0-89ba-02a47ca5d9fd | eng      | 366670  | 6.646564081665443 |
| 311ef39c-fbfa-11f0-ae4a-02a47ca5d9fd | eng      | 982722  | 6.646564081665443 |
| 311ef39c-fbfa-11f0-ae4a-02a47ca5d9fd | eng      | 4967644 | 5.341410781178393 |
| cac6c1c8-d990-11f0-a6f8-f6a03915313d | eng      | 1888202 | 5.328018560809645 |
| 6b4eceb8-f6d0-11f0-b007-02a47ca5d9fd | eng      | 4235929 | 5.139381565823188 |
| 311ef39c-fbfa-11f0-ae4a-02a47ca5d9fd | eng      | 683210  | 5.139381565823188 |
| 0dd75a3a-f9e0-11f0-8bf1-02a47ca5d9fd | eng      | 354236  | 4.999726308254446 |
| f680cb14-ee31-11f0-ae4a-02a47ca5d9fd | eng      | 1956213 | 4.999726308254446 |
| b7786952-e190-11f0-85af-8ebf6bb2cab9 | eng      | 1083842 | 4.999726308254446 |

We can (trivially) retrieve the document metadata for the top document:

```sql
SELECT docid, id, url
FROM documents
WHERE collection = 'd118aa3e-d5d3-11f0-86db-f6a03915313d'
    AND language = 'eng'
    AND docid = 6794391;
```

This gives the following result:

|  docid  |                                id                                |                               url                                |
|--------:|------------------------------------------------------------------|------------------------------------------------------------------|
| 6794391 | f1cbb115ae323d89661da6652888a76ad96f67986b74248d42cd981579f969ab | https://notes.beauhilton.com/tags/web/                           |

The complete query for ranking (using the inverted index) and document lookup (in the forward index) can be expressed as a `JOIN` query:

```sql
WITH results AS (
    -- Insert ranking query here
    SELECT ...
)
SELECT docid, id, url, score
FROM results
NATURAL JOIN documents
ORDER BY score DESC;
```

For the same search request as above, this gives the following results:

|  docid  |                                id                                |                               url                                |       score       |
|--------:|------------------------------------------------------------------|------------------------------------------------------------------|------------------:|
| 366670  | 10a8c74a0af6dacef55aea3db93ce39fa540833a3e39f83f70cf3eefd57b876e | https://1day.dev/tags/Web/                                       | 6.646564081665443 |
| 982722  | 10a8c74a0af6dacef55aea3db93ce39fa540833a3e39f83f70cf3eefd57b876e | https://1day.dev/tags/Web/                                       | 6.646564081665443 |
| 6794391 | f1cbb115ae323d89661da6652888a76ad96f67986b74248d42cd981579f969ab | https://notes.beauhilton.com/tags/web/                           | 6.646564081665443 |
| 4967644 | 4f6cf7cb523103f09cc379731032e9fbc43a7766f1080a6bbd7d0590aa664692 | https://search.n2sm.co.jp/                                       | 5.341410781178393 |
| 1888202 | d2da2f62736ed38d317e9e0c9d24466f9fa0a76b122180e777d61abb878a55e5 | https://www.searchenginefinder.com/Computer_and_WWW/Open_Source/ | 5.328018560809645 |
| 683210  | 0aa2aec48f12aae64ff0b5172d979a0bee098600233c5f6b5e811d23e34fbf5d | https://www.auwebdir.com/aulink/about.aspx                       | 5.139381565823188 |
| 4235929 | d42036b83a27a753ad199bcd18d962d187d26bbc6d7b269be985f1d81c343bfe | https://freewebsearch.org/en/                                    | 5.139381565823188 |
| 202836  | 0f7fcb462b82cbb7eb8c1dec939ef1f07c8983719efef5585fcfa72142d8b728 | https://www.bamid.gov.tw/en/Default.aspx                         | 4.999726308254446 |
| 1210356 | 599ee60ce0c3c532282f13ecb64b7129e15e323d78b2ee4c75518fed00c12e51 | https://www.bamid.gov.tw/en/                                     | 4.999726308254446 |
| 1083842 | 599ee60ce0c3c532282f13ecb64b7129e15e323d78b2ee4c75518fed00c12e51 | https://www.bamid.gov.tw/en/                                     | 4.999726308254446 |

Note that some URLs appear multiple times, with different `docid`s; this happens when a page is crawled multiple times over the span of multiple days.

## Data Lake representation

While the remote index allows access to large amounts of crawled data, costs like the basic S3 file glob overhead remain annoying, especially when the size of the index grows or users want to access the index more than once in sequence. Whenever the remote index is used with the views defined earlier, resolving the glob in the S3 file path triggers rather slow listing operations. Also, DuckDB has to assume that the collection of Parquet files on S3 could change dynamically, so every query implies a HEAD request to check for changes.

Resolving these issues is possible by registering the remote index as part of a data lake metadata catalog (like [Iceberg](https://iceberg.apache.org/), [Delta Lake](https://delta.io/) or [DuckLake](https://ducklake.select)). This simplifies queries significantly, as all files (and important metadata) are registered in a catalog, and can be assumed to be static. Whenever we issue a query to the remote index, the data lake catalog helps select the files that need to be processed, based on the query filters, and can immediately start processing the query (rather than waiting for listing operations and HEAD requests). Apart from these query processing advantages, using a data lake catalog would provide additional advantages as well, for example to handle deletions following from takedown requests, and modeling the curation process for specialised Web collections.

### DuckLake example

_DuckLake is in active development, so integration with the remote index is experimental and certain features may not work as expected._

Let's explore these advantages by creating a local DuckLake catalog, based on the remote index. 

The huge amounts of data for the OWI remain hosted remotely, in S3 object storage, but the metadata is copied to your local machine and will be queried much more efficiently. To create a DuckLake, use the following statements:

```sql
ATTACH 'ducklake:ows_remote_index.ducklake' AS ows_remote_index;
USE ows_remote_index;
```

Import the metadata of the available OWI index files that make up the inverted index:

```sql
-- Create the remote_index table with the right schema
CREATE TABLE remote_index AS (
    FROM 's3://2006391-owi-remote-index/*/*/*/*/index/*/*/*.parquet'
) WITH NO DATA;

-- Set the partition columns
ALTER TABLE remote_index SET PARTITIONED BY (
    year, month, day, collection, representation, language
);

-- Import the metadata from the existing files
CALL ducklake_add_data_files(
    'ows_remote_index',
    'remote_index',
    's3://2006391-owi-remote-index/*/*/*/*/index/*/*/*.parquet'
);
```

These commands fetch the metadata of the Parquet files in S3, and register them in DuckLake. Because the OWI comprises a large number of files, the final operation may take a while to complete.

Next, we proceed with analogous queries to download the metadata for the document records that constitute the forward index:

```sql
CREATE TABLE documents AS (
    FROM 's3://2006391-owi-remote-index/*/*/*/*/documents/*/*.parquet'
) WITH NO DATA;

ALTER TABLE documents SET PARTITIONED BY (
    year, month, day, collection, language
);

CALL ducklake_add_data_files(
    'ows_remote_index',
    'documents',
    's3://2006391-owi-remote-index/*/*/*/*/documents/*/*.parquet'
);
```

We can now use the `remote_index` and `documents` table in the exact same way as the `VIEW`s we used earlier (but with reduced S3 overhead). Don't forget to call `ducklake_add_data_files` for the new index partitions as they appear daily!
