(dnt:data-and-datasets)=
# Data and Data Sets

The Open Web Index (TOWI) will be published as daily snapshots per data center containing:
- The index in CIFF Format
- Metadata fields acompanying the index int the  `parquet` format coming out of our preprocessing pipeline
- Auxillary files for future developments (e.g. additional metadata, dense vector embeddings) 

## Naming Conventions and Metadata

Metadata are hosted in the LEXIS Plattform with a rich set of metadata fields with the following naming convention (to ease search):

- Titel: `TOWI-The Open Web Index-<year>-<month>-<day>@<datacenter>-<language>`
- Creator/Contributor/Publisher/Owner
  - Datasets publisehd within the OpenWebSearch.eu project are published through a joint effort of all partners. 
  - We thus refere to all partners of the consoritum by [`The OpenWebSearch.eu Consortium`](https://openwebsearch.eu/partners/) 
- Rights / License: the datasets are licensed under the {{OWIL}}. Please note that the license will be continuosly updated. 
- Resource Type: `Dataset` with a special sub type `Open Web Index V1.0`
- Filename: `TOWI-<year>-<month>-<day>-<datacenter>-<language>-<form>.parquet`
Placeholders:
  - `<year>`: the year of the snapshot
  - `<month>`: the month of the snapshot
  - `<day>`: the day of the snapshot
  - `<datacenter>`: the data center where the snapshot was taken:
    - `lrz`: Leibniz Supercomputing Centre
    - `it4i`: IT4Innovations
    - `csc`: CSC
  - `language`
    - `all` if all available languages are included 
    - otherwise in the three character OSI format.
  - `<form>`: the form of the file, i.e. either 'single` for sinlge file or `multi` for multiple files

The files are provided in ta

### Download Facilities

Downloading of dataset is support for via the {{LEXIS}} either using the portal app or via {{PY4LEXIS}}. Both require a login via [B2ACCESS](https://b2access.eudat.eu/home/).

The {{LEXIS}} Plattform supports downloading full dataset or single files.

#### Download via the LEXIS Plattform

1. Login to {{LEXIS}}. If you don't have an account, you can create one via [B2ACCESS](https://b2access.eudat.eu/home/). B2ACCESS supports different identity providers (mostly from Europe and academia)
2. Navigate to `Data Sets --> Public` and select the dataset (e.g. search for `TOWI`, sort etc.)
2. Click `Details` (Blue Button) to get more information about the dataset. 
3. Click `Download` to download the dataset or go to `File List` select a file and right click to select `Download`. 
4. The download will be **prepared**. To download it, you need to wait until the download button on the top (`arrow-down`) shows the dataset
5. You can watch the status of the operation by going to `Dashboard-->Data Operations-->Downloads`. You can also download the datset from the list here, once preparation is finished. 


#### Download via {{PY4LEXIS}}

{{PY4LEXIS}} is a Python client for the LEXIS Plattform. It allows to download data from the LEXIS Plattform directly from Python.

We will release soon a library / documentation on how to use py4lexis.

### Offload Facilities

Via the {{OWLERDASHBOAR}}, we also support more fine grained filtering and the offloading of data, i.e. data will be pushed from our data centers to your data storage systems.

We currently support OpenSearch, ElasticSearch and S3 as target storage systems. 


### Changes

- 2024-03-10: release of first data example

<!-- TODO: include links to download/ preprocessing, but use definition list /glossary to keep it consistent-->