# OWI Access via LEXIS and py4lexis

````{admonition} TL;DR
- The [LEXIS Portal](https://portal.lexis.tech) manages dataset shards (and workflows) 
- LEXIS contains a central dataset catalog and manages the decentralised repositories holding the data
- Preprocssed and index data is available publicly, while WARC data access is restricted (contact us).
- Access to all dataset  shards (including public ones) requires authentication via EUDAT due to legal reason
- The LEXIS Portal supports HTTP Download (which is not recommended)
- The python library `py4lexis` provides a command line interface as well as python library for access, but leaves data set management to you.
````

The [LEXIS Portal](https://portal.lexis.tech) is the central portal to manages hosted by the different partners. 
So naturally, it is a first place to start daily index shards. 
Due to the potential size of index shards, the download has been realised as multi-step process outliend as follows.

## Authentication

LEXIS requires authenticated users even for public download. 
For your convenicence, we integrate the EUDAT authentication provider, which supports login from most European research insitution as well as common social logins with ORCID, Github etc.
**Please note that by logging into LEXIS and downloading the index you accept the current [Open Web Index License (OWIL)](https://openwebsearch.eu/owil-current/)**

## Navigating the Datasets 

The LEXIS platform lists the public OWI datasets under the menu option "Data Sets/Public" named `OWI-Open Web Index-{collection}-{type}@{datacenter}-{dataset}`. 
Please refer to section (TODO: Reference) for the template details and see an example screenshot below.

![](./figures/lexis-navigation.png)

You have the option to filter and search for dataset. Note that the search is text based and you have to put in dates etc. as text format

![](./figures/lexis-navigation-filter.png)

## Listing Data Set Details

After identifying the relevant dataset, you can inspect the details of the dataset and navigate its structure through the blue button and then "File List".
You can also download individual files by right-click and either download or open.

![](./figures/lexis-file-list.png)

Of course, this step is optional.

## Preparing the Download

Independent whether you download a full dataset or a file, the download needs to be prepared first. To do so, select the dataset (or file) and click "download".
You will get notified that the download is prepared.

![](./figures/lexis-prepare.png)

## Download

After the prepration has been successfull, the download can be initiated. You can either do this from the top of the screen with the download arrow

![](./figures/lexis-download-button.png)

Preparation will take some time. You can find the current status of the preparation under `Dashboard/Downloads`

![](./figures/lexis-preparation-status.png)

Download progress:

![](./figures/lexis-download.png)

After the download, you can find the file in your download folder as `download.gz`

## LEXIS Platform Documentation

For further details and documentation, please see [the LEXIS Documentationn](https://docs.lexis.tech/)

## Scripting Downloads with py4lexis

Manual download via the LEXIS portal is only interesting for getting some example dataset. 
Also, the download is slow as it is not done in parallel but via the central LEXIS portal.
So we recommend to use scripts for managing regular download tasks. 

However, the LEXIS Plattform offers a python-client called [py4lexis](https://opencode.it4i.eu/lexis-platform/clients/py4lexis) for faster, parallel download and command-line (CLI) based dataset management.

We give some examples here on how to use py4lexis with the Open Web Index, but link to the [online documentation](https://opencode.it4i.eu/lexis-platform/clients/py4lexis) for further details.

For the examples here we assume you have a Linux shell or Mac Shell with python 3.10 available.


### Starting a session

You need to login to a session, which then gives an inner console that is authenticated towards the LEXIS plattform

You can do this via
```
python -m py4lexis.cli session login-url
```

### Listing datasets and content of a dataset

The `get-all-dataset` command allows you to list all datasets and filter according to projects, zones etc. 
Please note that LEXIS also hosts other projects besides OpenWebSearch, so you might want to apply a filter on the project.
You can find all options by using `get-all-dataset --help`

```
> python -m py4lexis.cli datasets get-all-datasets --filter-project openwebsearch
Welcome to the Py4Lexis!
You have been successfully logged in LEXIS session.
Retrieving data of the datasets...
Converting HTTP content from JSON to pandas Dataframe...
Data of the datasets successfully retrieved (and converted)....
Formatting pandas DataFrame into ASCII table...
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| Title                                                    | Access   | Project        | Zone        | InternalID                           | CreationDate        |
+==========================================================+==========+================+=============+======================================+=====================+
| The Open Web Index (Raw Data)@BAdW-LRZ 2023/12/26        | project  | openwebsearch  | OWSLRZZONE  | 03d73c63-9524-4ecb-8b6e-f7f59b6a9676 | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| The Open Web Index (Raw Data)@BAdW-LRZ 2024/1/20         | project  | openwebsearch  | OWSLRZZONE  | 043344e1-13ad-4ede-9a19-62ad918b3816 | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| The Open Web Index (Raw Data)@BAdW-LRZ 2023/11/14        | project  | openwebsearch  | OWSLRZZONE  | 078d3a3f-fb33-4ce6-9142-2f83543cd231 | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| The Open Web Index (Raw Data)@BAdW-LRZ 2024/1/21         | project  | openwebsearch  | OWSLRZZONE  | 07a3682f-bb3d-458f-8b81-06a9e71df96e | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| The Open Web Index (Raw Data)@BAdW-LRZ 2023/11/13        | project  | openwebsearch  | OWSLRZZONE  | 14a4bb5d-e6d3-4bc8-a00c-14c047950a78 | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| The Open Web Index (Raw Data)@BAdW-LRZ 2023/11/21        | project  | openwebsearch  | OWSLRZZONE  | 1e0c1a10-cf36-4d2c-ad2d-40dd476e8c9c | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
```

The access level determines whether you need to be a member of `openwebsearch` - which is `ACCESS=project` - or not. 
Note that all raw crawl data has access level `project` and thus is only available upon request. 
Preprocessed and indexed datasets are available for everybody authenticated.

`py4lexis` does not support project specific filtering, i.e. based on metadata for our index files. 
However, you can use standard tools like `grep` to filter on text. 


``` 
> python -m py4lexis.cli datasets get-all-datasets|grep OWI | grep 2024-03 
mgrani@mgrani-Precision-3660:~$ python -m py4lexis.cli datasets get-all-datasets|grep OWI | grep 2024-03 
| TOWI-The Open Web Index-2023-12-25@lrz-eng-multi         | public   | openwebsearch  | IT4ILexisV2 | f3ffccb4-dc63-11ee-a201-0242c0a87004 | 2024-03-07 09:20:36 |
| TOWI-The Open Web Index-2023-12-25@lrz-eng               | public   | openwebsearch  | IT4ILexisV2 | c46e4d62-dc5f-11ee-ad7e-0242c0a87004 | 2024-03-07 08:50:38 |
| OWI-Open Web Index-legal.owip@it4i-2024-03-12:2024-03-31 | public   | openwebsearch  | IT4ILexisV2 | 0dac12be-52f5-11ef-a60f-0242c0a81003 | 2024-08-05 06:36:32 |
| OWI-Open Web Index-main.owi@csc-2024-03-01               | public   | openwebsearch  | IT4ILexisV2 | c378f8da-5680-11ef-aad9-0242ac130005 | 2024-08-09 19:21:15 |
| OWI-Open Web Index-main.owi@csc-2024-03-05               | public   | openwebsearch  | IT4ILexisV2 | 439467a0-69f7-11ef-aad9-0242ac130005 | 2024-09-03 13:49:52 |
| OWI-Open Web Index-main.owi@csc-2024-03-02               | public   | openwebsearch  | IT4ILexisV2 | 4389c08a-691f-11ef-aad9-0242ac130005 | 2024-09-02 12:03:31 |
| OWI-Open Web Index-main.owi@csc-2024-03-04               | public   | openwebsearch  | IT4ILexisV2 | 3c19d116-69d6-11ef-aad9-0242ac130005 | 2024-09-03 09:56:45 |
| OWI-Open Web Index-main.owi@csc-2024-03-03               | public   | openwebsearch  | IT4ILexisV2 | 2cd1d28e-6941-11ef-aad9-0242ac130005 | 2024-09-02 16:06:12 |
mgrani@mgrani-Precision-3660:~$ 
```

### Working with datasets

Every dataset in LEXIS is identified via an UUID, refered to as `InternalID`. 
So if you work with a dataset, you have to specify the `InternalID` of that dataset. 
Working with groups of dataset is not supported yet, but you can use the OpenWebSearch.eu CLI that builds on py4lexis named `owilix` (see next section).
Beyond the `internalID`, `py4lexis` also requires the access level (mostly `public`) and the project name `openwebsearch` for dataset access

**Listing content of datasets**

The following command lists the content of a dataset

```
> python -m py4lexis.cli datasets get-content-of-dataset 2cd1d28e-6941-11ef-aad9-0242ac130005 public openwebsearch
Welcome to the Py4Lexis!
You have been successfully logged in LEXIS session.
Retrieving data of files in the dataset...
Converting HTTP content from JSON to pandas Dataframe...
Content of the dataset was successfully retrieved (and converted)...
Formatting pandas DataFrame into ASCII table...
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| Dir/File-name       | Path                                                        | Type   |       Size | CreateTime          | Checksum   |
+=====================+=============================================================+========+============+=====================+============+
| index.ciff.gz       | year=2024/month=3/day=3/language=aar/index.ciff.gz          | file   |      43281 | 2024-09-02T15:50:13 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=aar/metadata_0.parquet     | file   |     116996 | 2024-09-02T15:50:12 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=abc/index.ciff.gz          | file   |       1451 | 2024-09-02T15:57:24 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=abc/metadata_0.parquet     | file   |      18836 | 2024-09-02T15:57:24 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=abk/index.ciff.gz          | file   |      14041 | 2024-09-02T15:48:03 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=abk/metadata_0.parquet     | file   |      54112 | 2024-09-02T15:48:03 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=ace/index.ciff.gz          | file   |       1438 | 2024-09-02T15:50:45 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=ace/metadata_0.parquet     | file   |      25103 | 2024-09-02T15:50:44 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=afr/index.ciff.gz          | file   |    3176754 | 2024-09-02T15:48:05 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=afr/metadata_0.parquet     | file   |   12290453 | 2024-09-02T15:48:04 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=aka/index.ciff.gz          | file   |       5236 | 2024-09-02T15:59:30 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=aka/metadata_0.parquet     | file   |      27690 | 2024-09-02T15:59:29 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=all/index.ciff.gz          | file   |      61807 | 2024-09-02T15:59:25 | None       |
....
```

**Downloading datasets**

Downloading works similar to listing the content of the dataset, just with the `download-dataset` command.
Please note that similar as with the portal, `py4lexis` also needs to prepare the dataset and download via `http`, which can be slow.

```
python -m py4lexis.cli datasets download-dataset 2cd1d28e-6941-11ef-aad9-0242ac130005 public openwebsearch
Welcome to the Py4Lexis!
You have been successfully logged in LEXIS session.
Submitting download request on server...
Download submitted!
Checking the status of download request...
Download request not ready yet, 200/0 retries
Download request not ready yet, 200/1 retries
Download request not ready yet, 200/2 retries
```

Preparation time currently takes quite some time, especially for large datasets. 
However, we are working on a more efficient way for download.