# OWI Access

```{admonition} TL;DR
- The OWI is delivered as daily index shards
- OWI shards are stored in a federated iRODS infrastructure via the LEXIS platform
  - Metadata per dataset allow better findability and selection of datasets
- OWI shards are structured in a date/language partitioned folder structure, with 
  - metadata in parquet files and the
  - index in CIFF files (common index file format)
- Several access possibilities exist: direct download via LEXIS, scripting via py4lexis or the OWI Cli tool `owilix`
```

As of September 2024, the OWI technology stack has been successfully deployed across multiple data centers, including BADW-LRZ, IT4I@VSB, and CSC.
The pipeline processes daily data to generate index shards — segments of the complete index created from daily processed data.
These index shards are crucial outputs published as [LEXIS](https://portal.lexis.tech/) Datasets on a federated [iRODS](https://irods.org/) infrastructure, which unifies data access across centers. 
To streamline operations, the preprocessing and indexing components of the workflow run within a single HPC workflow on the LEXIS platform. 

## List of Daily Index Shards and Dashboard

You can find the list of daily index shards either in the [LEXIS Portal](https://portal.lexis.tech/) after logging in or at the [OpenWebIndex.eu Dashboard](https://openwebindex.eu/owler/our_datasets). 

![](./figures/dashboard-datasets.png)

The Dashboard offers [further statistics](https://openwebindex.eu/owler/our_data), like the amount of crawled data and the number of published dataset. 

![](./figures/dashboard-statistics.png)

The Dashboard lso 



## Structure of Daily Index Shards

The final product of the full pipeline described above — the Open Web Index — consists of all public datasets produced by the federated datastructure and published through LEXIS.
The Open Web Index is provided as daily shards per data-center in the form of so-called data sets. 
We aim to have a maximum delay of 1 day from crawling the data to preprocessing / indexing.

A data set follows a particular folder structure, as depicted in the image on the right and has associated metadata containing elements of the Data Cite vocabulary3 and application-specific metadata, particularly start data, end date and collection name. The changelog.json file contains potential changes to an index partition, like items removed due to take down requests.
Each dataset contains CIFF and Parquet files, partitioned across language. 
Parquet files contain the metadata, while CIFF files contain the usable index - currently an inverted index.

![](figures/owi-structure.png)

Access control to LEXIS (and thus, the OWI) is arranged through [EUDAT / B2ACCESS](https://b2access.eudat.eu/)
.
To use the Open Web Index, downstream search engines would download the CIFF files they are interested in (e.g. with a specific language or from a specific date range), and import them into a search engine of their choice.
Alongside the CIFF files, the metadata Parquet files can be downloaded for additional use in a search engine. 
For instance, the cleaned text can be used for snippet extraction, and the metadata fields can be used to enrich or filter the search results obtained by a full-text search on the index. 

![](figures/parquet-example.png)

### Folder Structures

Daily shards are stored at remote iRoDS server which can be thought of as a remote filesystem.

#### Folder Structure for Daily Shards

As described above, we created a daily dataset for the OWI and for every configured OWI-CI.

```bash
/ZONE/public<projectid>/<uuid>               # zone and project id
    year={YYYY}                              # year of the slice
        month={MM}                           # month of the slice
            day={DD}                         # day of the slice
                language={LANG}              # language partitions
                    index.ciff.gz            # ciff file containing the index
                    metadata-{num?}.parquet  # parquet file containing the metadata of the index 
```
Zones usually refer to the data center the data is stored in. Currently we support two zones:

- IT4ILexisV2 at the IT4I data center
- OWSLRZZONE at the LRZ data center

Language is a 3 digit language code standarsd


#### Folder Structure for WARC Data

WARC data is stored in a similar folder structure than index shards. 

```bash
/ZONE/proj<projectid>/<uuid>                 # zone and project id
    year={YYYY}                              # year of the slice
        month={MM}                           # month of the slice
            day={DD}                         # day of the slice
               crawler={_crawler_name_}/     # Name of the crawler 
                  HHmmss-{int}.warc.gz       # single warc file. Namen scheme subject to change 
```

#### Available Data Fields

Data and metadata for  individual web-pages are avaialbe in the `.parquet` files as row/column structure. 

Parquet files are created during preprocessing via the [resilipipe](https://opencode.it4i.eu/openwebsearcheu-public/preprocessing-pipeline) and contain the following fields:

<details>
  <summary><b>Schema Version 0.1.0</b></summary>

| Column                  | Description                                                                                                                        | Pyspark Datatype                      |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------|
| id                      | Unique ID based on hash of the URL and crawling time                                                                               | `StringType()`                        |
| record_id               | UUID of the WARC record                                                                                                            | `StringType()`                        |
| title                   | Title from the HTML                                                                                                                | `StringType()`                        |
| plain_text              | Cleaned text from the HTML                                                                                                         | `StringType()`                        |
| json-ld                 | String list of JSON-LD (https://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents)                                        | `StringType()`                        |
| microdata               | String list of HTML Microdata (http://www.w3.org/TR/microdata/#json)                                                               | `StringType()`                        |
| warc_date               | Date from the WARC header                                                                                                          | `StringType()`                        |
| warc_ip                 | IP Address from the WARC header                                                                                                    | `StringType()`                        |
| url                     | Full URL                                                                                                                           | `StringType()`                        |
| url_scheme              | URL scheme specifier                                                                                                               | `StringType()`                        |
| url_path                | Hierarchical path after TLD                                                                                                        | `StringType()`                        |
| url_params              | Parameters for last path element                                                                                                   | `StringType()`                        |
| url_query               | Query component                                                                                                                    | `StringType()`                        |
| url_fragment            | Fragment identifier                                                                                                                | `StringType()`                        |
| url_subdomain           | Subdomain of the network location                                                                                                  | `StringType()`                        |
| url_domain              | Domain of the network location                                                                                                     | `StringType()`                        |
| url_suffix              | Suffix according to the [Public Suffix List](https://publicsuffix.org/)                                                            | `StringType()`                        |
| url_is_private          | If the URL has a private suffix                                                                                                    | `BooleanType()`                       |
| mime_type               | MIME-Type from the HTTP Header                                                                                                     | `StringType()`                        |
| charset                 | charset from the HTTP Header                                                                                                       | `StringType()`                        |
| content_type_other      | List of key, value pairs from the content type that could not be parsed into MIME-type or charset                                  | `MapType(StringType(), StringType())` |
| http_server             | Server from the from the HTTP Header                                                                                               | `StringType()`                        |
| language                | Language as identified by [language.py](preprocessing/parse/language.py); Code according to ISO-639 Part 3                         | `StringType()`                        |
| valid                   | `True`: The record is valid; `False`: The record is no longer valid and should not be processed.                                   | `BooleanType()`                       |
| warc_file               | Name of the original WARC-file that contained record                                                                               | `StringType()`                        |
| ows_canonical           | The canonical link if it exists                                                                                                    | `StringType()`                        |
| ows_resource_type       | Crawl from which the WARC-file originated; Files crawled by the University of Passau are labeled with "Owler"                      | `StringType()`                        |
| ows_curlielabel         | One of the 15 Curlie top level labels                                                                                              | `StringType()`                        |
| ows_index               | `True`: The content is allowed to be used for the purposes of web indexing/web search; `False`: The content cannot be used         | `BooleanType()`                       |
| ows_genai               | `True`: The content is allowed to be used for the purposes of developing Generative AI models; `False`: The content cannot be used | `BooleanType()`                       |
| ows_genai_details       | If `ows_genai=False`, this provides additional context                                                                             | `StringType()`                        |
| ows_fetch_response_time | Fetch time in ms                                                                                                                   | `IntegerType()`                       |
| ows_fetch_num_errors    | Number of errors while fetching (Timeout is the most prominent fetch error)                                                        | `StringType()`                        |
| schema_metadata         | List of key, value pairs that contain global settings like the `schema_version`                                                    | `MapType(StringType(), StringType())` |

</details>

Additional columns can be added by providing modules as outlined in the respective [README](resilipipe/resilipipe/parse/README.md). 
<details>
<Summary> One module is outgoing links detection</Summary>.

| Column             | Description                                                                                                                                                                    | Pyspark Datatype                                                                             |
|--------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
| outgoing_links     | List of all hyperlinks in the HTML that start with 'http'                                                                                                                      | `ArrayType(StringType())`                                                                    |
| image_links        | List of all links to images in the HTML that start with 'http'                                                                                                                 | See `get_spark_schema` in [links.py](resilipipe/resilipipe/parse/modules/links.py)           |
| video_links        | List of all links to videos in the HTML that start with 'http' or iframes with a video                                                                                         | See `get_spark_schema` in [links.py](resilipipe/resilipipe/parse/modules/links.py)           |
| iframes            | List of tuples for nodes that contain an iframe (and are not a video)                                                                                                          | See `get_spark_schema` in [links.py](resilipipe/resilipipe/parse/modules/links.py)           |
| curlielabels       | List of language specific domain labels according to [Curlie.org](https://curlie.org/)                                                                                         | `ArrayType(StringType())`                                                                    |
| curlielabels_en    | List of English domain labels according to [Curlie.org](https://curlie.org/). Mapping by [Lugeon, Sylvain; Piccardi, Tiziano](https://doi.org/10.6084/m9.figshare.19406693.v5) | `ArrayType(StringType())`                                                                    |
| address            | List of dictionaries containing extracted location and coordinates                                                                                                             | See `get_spark_schema` in [geoparsing.py](resilipipe/resilipipe/parse/modules/geoparsing.py) | 
| collection_indices | List of collection indices that a record belongs to. Are defined via `yaml` files on the S3 instance                                                                           | `ArrayType(StringType())                                                                     |                                                                                             |
</details>

````{admonition} Contribute

- Resilipipe is modular and you can add you own modules. But be aware that they need to scale properly!
- We are currently working on geo-coding {cite}`farzana2024towards`, trigger warnings {cite}`wiegmann2023trigger` and genre detection {cite}`stein2008retrieval` as additional metadata

````

### Used Metadata for Describing Datatasets

|IRODS| allows to add metadata to every folder, which is then used to serve datasets via {{ LEXIS }} and to ease searching for proper datasets.

We follow basic [Dublin Core Metadata](https://de.wikipedia.org/wiki/Dublin_Core) and extend by application specific metadata. 

<details>
<summary> Metadata Fiels Version 0.2.0 (currently in use)</summary> 


| Attribute        | Value                                                                                               | Type         |
|------------------|-----------------------------------------------------------------------------------------------------|--------------|
| creator          | OpenWebSearch.eu Consortium                                                                         | `str`        |
| contributor      | A1 Slovenija                                                                                        | `list[str]`  |
| contributor      | Webis Group                                                                                         | `list[str]`  |
| contributor      | CERN - The European Organization for Nuclear Research                                               | `list[str]`  |
| contributor      | CSC - IT Center for Science Ltd                                                                     | `list[str]`  |
| contributor      | German Aerospace Center (DLR)                                                                       | `list[str]`  |
| contributor      | Graz University of Technology                                                                       | `list[str]`  |
| contributor      | Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities                    | `list[str]`  |
| contributor      | Open Search Foundation                                                                              | `list[str]`  |
| contributor      | Stichting Radboud University                                                                        | `list[str]`  |
| contributor      | University of Passau                                                                                | `list[str]`  |
| contributor      | VSB - TECHNICAL UNIVERSITY OF OSTRAVA                                                               | `list[str]`  |
| relatedSoftware  | [Resilipipe](https://opencode.it4i.eu/openwebsearcheu-public/preprocessing-pipeline/)               | `list[str]`  |
| relatedSoftware  | [OWI Indexer](https://opencode.it4i.eu/openwebsearcheu-public/spark-indexer/)                       | `list[str]`  |
| alternateIdentifier | abc                                                                                              | `list[str]`  |
| startDate        | "2023-01-01"                                                                                        | `str`        |
| endDate          | "2023-01-01"                                                                                        | `str`        |
| lastChanged      | "2023-01-01 00:00:00"                                                                               | `str`        |
| owner            | OpenWebSearch.eu Consortium                                                                         | `str`        |
| publicationYear  | 2024                                                                                                | `int`        |
| publisher        | OpenWebSearch.eu Consortium                                                                         | `str`        |
| resourceType     | "owi" (options: "owi", "warc", "owie", "owii", "owip", "unknown")                                  | `str`        |
| subResourceType  | "ciff+parquet"                                                                                      | `str`        |
| rights           | "Open Web Index License V1.0"                                                                       | `list[str]`  |
| rightsIdentifier | "OWIL V1.0"                                                                                         | `str`        |
| rightsURI        | https://ows.eu/owil/current                                                                         | `list[str]`  |
| publication      | https://doi.org/10.1002/asi.24818                                                                   | `str`        |
| provenance       | [owi://mgrani@import](owi://mgrani@import)                                                          | `list[str]`  |
| license          | https://ows.eu/owil/current                                                                         | `str`        |
| dataCenter       | "unknown" (options: "lrz", "it4i", "csc", "unknown")                                               | `str`        |
| collectionName   | "main"                                                                                              | `str`        |
| description      | "All OWI data indexed from {startDate} to (including) {endDate} at {dataCenter}"                    | `str`        |
| resourceTypeGeneral | "Dataset" (options: "Dataset")                                                                  | `str`        |
| encryption       | "no" (options: "no", "yes")                                                                        | `str`        |
| compression      | "no" (options: "no", "yes")                                                                        | `str`        |
| totalSize        | 0                                                                                                   | `int`        |
| fileCount        | 0                                                                                                   | `int`        |
| objectCount      | 0                                                                                                   | `int`        |
| title            | "OWI-Open Web Index-{collectionName}.{resourceType}@{dataCenter}-{startDate}:{endDate}"            | `str`        |
| metadataSource   | Software+Version                                                                                    | `str`        |

</details>

#### Details on  some Metadata Fields

**Title**

The title consist of the agreed index name `OWI-Open Web Index` and collection name: either `main` or a named sub-collection e.g. `curlie`, `legal`.

**Creator, Owner, Publisher**

Creator and contributor has been swapped: we whould have a single creator, the consortium, and contributions by all partners. It is also a UI thing.

**Collection Name**

Index shards can be pre-filtered for a particular purpose, called collections. Currently we support

1. the main collection forming the main index 
2. the `legal collections` containing all contact, gdpr, legal notices pages found during our crawls.

Further collections can be added in the future. 

**resourceType**

Determines the kind of dataset. The kind of dataset determines what files they contain respectively the applications they support. This is indicated in the resource type:

- **owi**: contains parquet + ciff files (i.e. metadata plus index)
  - **owii**: contains only ciff files, but no metadata (not yet available)
  - **owip**: contains only parquet files with metadata 
- **owie**: contains vector-embeddings (not yet available)
- **warc**: contains raw crawl data in warc format

The statistics below count the number of URLs in the parquet files as well as the compressed file size and file count over time for every resource/collection pair.

**subResourceType**

Determines the file type on a more fine grained level available in the index, i.e.

- `ciff`: indictates ciff files are present
- `parquet`: indicatest the availability of parquet files
- `emb`: contains embeddings
- `<algorithm>` contains the algorithm

Note that subResourceTypes can be combined, i.e. "ciff+parquet"

**DataCenter**

Field indicating the data center the data set is stored in.

**Provenance**

space separated list on source (as uri's) the dataset has been derived from. use `owi://uuid` for indicating owi datasets.