# Tutorial 1. Data Download with owilix

The project provides the [`owilix`](https://opencode.it4i.eu/openwebsearcheu-public/owi-cli) command line tool for accessing OWI data. 
owilix can be best understood as `git` for the OWI, i.e. to pull and push data. 
This tutorial demonstrates, how owilix can be used for accessing and using the Open Web Index.

## OWI Datasets

The Open Web Index is provided as a set of datasets, published on a daily basis per data center.

The following figure depicts the basic components of a dataset, which is basically a hierarcial partitioning by date and language.

![](./figures/owi-dataset.png)

- `parquet` files contain the metadata obtained from [resiliparse/resilipipe](https://opencode.it4i.eu/openwebsearcheu-public/preprocessing-pipeline) as well as the plain text.
- `ciff` files contain the inverted index created via the [OWI Indexer](https://opencode.it4i.eu/openwebsearcheu-public/spark-indexer)

`owilix` now supports the synchronisation of remote with local datasets similar to `git` for source code.

## Installing owilix

`owilix` depends on a large number of libraries, so we recommend to installing 
it into a separate python environment managed by [pyenv](https://github.com/pyenv/pyenv) 
or [conda](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html). 
The CLI also requires python 3.11 and either Linux or MacOSx (on Windows, they WSL might work but has not been tested).

Usually running the following lines should give you a clean environment (when using conda):

```bash
# Create a new environment
conda create -n owi pip python=3.11
conda activate owi
# Install required packages
pip install py4lexis --index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
pip install owilix --index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple
# Verify installation
owilix --help
```

Note that you always need to activate the owi environemnt when using owilix (i.e. `conda activate owi`)

## First check if everything works

`owilix` supports simple listing of datasets. After installation it is recommended to run a simple `remote ls all` command to see whether everythign is working. 

In case of problems, you can use 
- `owilix logs errors` to show error logs occuring, 
- `owilix clean` to clean potentially problematic files and 
- `owilix remote doctor` to check connection details

Columns to be displayed and the sort order can be configured via profiles and CLI parameters, but that is beyond this tutorial.


### Looking at public datasets in the LEXIS Portal

To see whether you can have access to the public datasets in the LEXIS Portal (and thus rule out any rights problem), 
you should navigate to the `DataSets/Public` menu in the [LEXIS Portal](https://portal.beta.lexis.tech), where you should see the following list of datasets:

![](figures/lexis-public-datasets.png)

Under the menu item `Projects` you can request access to the "openwebsearch" project so that you also could access 
`warc` datasets. However, this also requires to engage with us via [the community platform](https://openwebsearch.eu/community/) 
or contact us via [email](mailto:join@openwebsearch.eu)

## Listing Datasets and Specifiers

### Specifying Sets of Datasets

`owilix` uses so called dataset specifiers to specify sets of datasets. e.g. `all` specifies all datasets 
available while `all:latest` gives you the latest dataset available in all data centers.

In general, specifiers have the following form: `{datacenter|all}[:{YYYY-MM-DD|latest}][#{duration}]\{key1=value1;key2=value2}` where
- `datacenter` has the data center abbreviation (currently it4i, csc or lrz) or all to qualify all data centers.
- `duration` is the number of days backward from the provided day
- `key=value` are key and values for selecting dataset as a filter, i.e. all key=value pair need to be matched for a dataset to be selected.


### Listing Datasets


First we need to identify suitable datasets.

Lets inspect the latest datasets and 7 days back:

```
owilix remote ls all:latest#6 
``` 

which can give us something like:

![](figures/ls_latest_6.png)

Approximately 600 GB of data in two different datacenters. 

To be more specific, we only pick the latest from datacenter it4i but also look at all files available for german (lanugage code =deu)

```sh
owilix remote ls it4i:latest files=**/language=deu/*
```

![](figures/owilix-ls-files.png)


## Pulling datasets

Downloading datasets is easy: instead of the `ls` command, you use the `pull` command.

```sh
owilix remote pull it4i:latest files=**/language=deu/*
```

![](figures/owilix-pull-deu.png)

Note that pulling is file-based and synchronises with already existing files. So if we pull again, but now download all files, the 
`deu` files will not be transfered again (except if you specify overwrite). 
This also means, that if the download gets interrupted, it can be resumed again.

![](figures/owilix-pull-all.png).

You can use `owilix remote help push` to see which other options are available

## Consuming datasets

Consuming datasets depends on your purpose and is not directly part of `owilix`. 

You can either use the index and parquet files via [our MOSAIC search engine](https://opencode.it4i.eu/openwebsearcheu-public/mosaic) 
or directly access the files in the download directory (note that hte default is `~/.owi/public/main/`)

### Querying data sets

One possibility for accessing the data via `owilix` is by querying the data using `owilix query`.

```sh 
owilix query less --local all:latest select=url,title,domain_label "where=url_suffix='at'"
```

![](figures/owilix-query-less.png)


would allow to browse the results similar to the linux command line tool "less", but less fancy. 
We did use a more specific subset, namely all `.at` domains and we only display `url`, `title` and `domain_label`. 
Feel free to play around and explore the index.

Note that in the query case we need to specify on which repositories the specifier should be applied, i.e. local or remote, so that we can also run the query directly over remote data sets. 

Under the hood, `owilix` uses [duckdb](https://duckdb.org/) to execute duckdb sql over parquet files, which can be either local or remote.

