# Starting Point: Working with Open Web Index Data using **owilix**

This tutorial introduces the **owilix** command-line tool for accessing, downloading, and querying datasets from the [Open Web Index](https://openwebsearch.eu/the-project/research-results/open-web-search-book/). 

---

## 1. Getting Started

Depending on your operating system, you can install **owilix** using either docker or conda. 

```{note}
Note that owilix requires Python 3.11 and is only tested under Linux and Mac
```

### Using Docker


Pull the image:

```bash
docker pull opencode.it4i.eu:5050/openwebsearcheu-public/owi-cli:latest
# see if everyting is working
docker run -it owilix --help
```

Run commands: 

```bash
# use an owilix command using current configuration
docker run --rm -it -v ~/.owi/:/home/owi/.owi owilix remote ls
```

Set an alias: 
```bash 
# Add the alias at the end of the file
alias owilixd='docker run --rm -it -v ~/.owi/:/home/owi/.owi owilix'
```

### Using Python

#### Create a New Environment

We recommend using a dedicated environment:

```bash
conda create -n owi pip python=3.11
conda activate owi
```

#### Install Required Packages

Install the dependencies:

```bash
pip install py4lexis --index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
pip install owilix   --index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple
```

---

## 2. Pulling and Inspecting Datasets


1. List Available Datasets and check that everything works

```bash
owilix remote ls
```
2. Download a Small Dataset (only pages with impressum, imprint, contact, privacy policy and terms of use in their url)

```bash
owilix remote pull all/id=e3fb8860-8e52-11f0-9ae7-c687956b5905
```
```{note}
Note that pull synchronizes local and remote datasets on a per file basis allowing to resume downloads.
Results are stored in the local directory: `~/.owi/public/`
```

3. View Some Entries through the CLI or as json

```bash
owilix query less --local all/id=e3fb8860-8e52-11f0-9ae7-c687956b5905
```
```{note}
datasets can be specified with the following syntax:
`datacenter:startdate#days/filter=a;filter=b`

e.g. owilix local all:latest#10/collectionName=main
```

---


```bash
owilix query less --local all/id=e3fb8860-8e52-11f0-9ae7-c687956b5905 as_json=True
```

```{note}
Note that owilix help gives you list of commands and paramters:
`owilix --help`
`owilix remote --help`
`owilix remote help pull` (for command groups help is considered a subcommand)
```

---

## 3. Using OWI Data in a small search engine (requires Docker)

We provide a small Lucene based search engine that can be used to query the OWI data.

We will show here, how this can be done. For more details see [the MOSAIC tutorial](c_mosaic.md). 

### Download Data

We select a dataset and inspect its detail (which takes a bit longer). The id is `7bfda93e-5f84-11f0-9374-528c047b29ff` and we list the files plus group them. Note that we have to make sure taking a dataset from the main collection, as other collections do not have index files yet.

```bash
owilix remote ls all/id=7bfda93e-5f84-11f0-9374-528c047b29ff details=True "files=**/metadata*parquet" groups=4
```
Next download the data. But to keep it smaller, we only download the english files via a glob pattern. for best utilisation, we increase the number of parallel threads.

```bash
owilix remote pull all/id=7bfda93e-5f84-11f0-9374-528c047b29ff files="**/language=eng/*" num_threads=10
```

#### (Optional) A bit more sophisticated data selecting: Slice the datasset into a new dataset

Alternatively you can also slice the data further down by creating a new dataset:

```bash 
owilix query slice --local all/id=7bfda93e-5f84-11f0-9374-528c047b29ff "where=WHERE url_suffix='at'" collection_name="mycollection" creator="me"
```

You will create a dataset with a new id. If you did not note down the id, you can find it with `ls`

```bash 
owilix query ls --local all/collectionNAme=mycollection
```

Note that you must use this ID in the next steps.

#### Preparing Data for MOSAIC

Now change to a directory where you want to store the index data. The MOSAIC framework requires the data to be in a single ciff file and multiple parquet files in a special folder structure. 

```bash 
cd ~/tmp/
mkdir data
owilix local export all/id=7bfda93e-5f84-11f0-9374-528c047b29ff outdir=$(PWD)/data
```

We can now import the data into the MOSAIC framework.

```bash
mkdir -p data/serve/lucene
docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/mosaic/lucene-ciff \
    /data/index.ciff.gz \
    /data/serve/lucene/demo-index
mkdir -p data/serve/metadata/demo-index
mv data/*metadata_* data/serve/metadata/demo-index
```

Now start the MOSAIC framework.

```
docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    -p 8008:8008 \
    opencode.it4i.eu:5050/openwebsearcheu-public/mosaic/search-service \
    --lucene-dir-path /data/serve/lucene/ \
    --parquet-dir-path /data/serve/metadata/
```

and you can now access the search engine at [http://localhost:8008/search?q=test](http://localhost:8008/earch?q=test).



---

## 4. Working with Larger Datasets

### Pull Data Filtered by Language (German only)

Dataset download can be filtered on the file leve. Since files are partitioned by language, you can thus filter by language. 

```bash
owilix remote pull all/internalID=e3cbb70c-8e52-11f0-b931-c687956b5905 "files=**/language=deu/*"
```

### Write Results to JSON File

For strucutred output, you can write the results to a json file.

```bash
owilix query less --local all/internalID=e3fb8860-8e52-11f0-9ae7-c687956b5905 \
    as_json=True json_file=/Users/username/tmp/myjson.json
```

### Filter by Top-Level Domain (.at) and Select Specific Fields

```bash
owilix query less --local all/internalID=e3fb8860-8e52-11f0-9ae7-c687956b5905 \
    "select=title,url" "where=url_suffix='at'"
```

---

## 5. Advanced Queries

### Collect All Sites with Outgoing Links and Microdata

```bash
owilix query less --local all/internalID=e3cbb70c-8e52-11f0-b931-c687956b5905 \
    "select=url,title,outgoing_links,microdata,curlielabels_en" \
    "where=microdata is not NULL"
```

---

## 6. Data and Schema

* Datasets are stored under:

  ```text
  ~/.owi/public/
  ```

* You can work directly with **Parquet data** (metadata + plain text).

* CIFF data can be added to tools such as:

  * [PyTerrier](https://github.com/terrier-org/pyterrier)
  * Lucene
  * Pisa

* Prototype integration with **MOSAIC RAG** is under development:
  [Mosaic RAG How-To](https://openwebsearcheu-public.pages.it4i.eu/ows-the-book/content/howto/c_mosaic.html)

Schema:


### Fixed columns
| Column             | Description                                                                                                | Pyspark Datatype                      |
|--------------------|------------------------------------------------------------------------------------------------------------|---------------------------------------|
| id                 | Unique ID based on the SHA256-hash of the URL                                                              | `StringType()`                        |
| record_id          | UUID of the WARC record                                                                                    | `StringType()`                        |
| title              | Title of the document                                                                                      | `StringType()`                        |
| description        | Description from the document metadata                                                                     | `StringType()`                        |
| keywords           | Keywords from the document metadata                                                                        | `StringType()`                        |
| author             | Author from the document metadata                                                                          | `StringType()`                        |
| main_content       | Main content of the HTML, formatted with minimal HTML tags (`h1-6`, `p`, `ul/ol/li`, `pre`, and `a` tags)  | `StringType()`                        |
| json-ld            | String list of JSON-LD (https://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents)                | `StringType()`                        |
| microdata          | String list of HTML Microdata (http://www.w3.org/TR/microdata/#json)                                       | `StringType()`                        |
| opengraph          | String list of Open Graph data (https://ogp.me/)                                                           | `StringType()`                        |
| warc_date          | Date from the WARC header                                                                                  | `StringType()`                        |
| warc_ip            | IP Address from the WARC header                                                                            | `StringType()`                        |
| url                | Full URL                                                                                                   | `StringType()`                        |
| url_scheme         | URL scheme specifier                                                                                       | `StringType()`                        |
| url_path           | Hierarchical path after TLD                                                                                | `StringType()`                        |
| url_params         | Parameters for last path element                                                                           | `StringType()`                        |
| url_query          | Query component                                                                                            | `StringType()`                        |
| url_fragment       | Fragment identifier                                                                                        | `StringType()`                        |
| url_subdomain      | Subdomain of the network location                                                                          | `StringType()`                        |
| url_domain         | Domain of the network location                                                                             | `StringType()`                        |
| url_suffix         | Suffix according to the [Public Suffix List](https://publicsuffix.org/)                                    | `StringType()`                        |
| url_is_private     | If the URL has a private suffix                                                                            | `BooleanType()`                       |
| mime_type          | MIME-Type from the HTTP Header                                                                             | `StringType()`                        |
| charset            | charset from the HTTP Header                                                                               | `StringType()`                        |
| content_type_other | List of key, value pairs from the content type that could not be parsed into MIME-type or charset          | `MapType(StringType(), StringType())` |
| http_server        | Server from the from the HTTP Header                                                                       | `StringType()`                        |
| language           | Language as identified by [language.py](preprocessing/parse/language.py); Code according to ISO-639 Part 3 | `StringType()`                        |
| valid              | `True`: The record is valid; `False`: The record is not/no longer valid and should not be processed.       | `BooleanType()`                       |
| crawling_error     | Error message set by the crawler. Only set for records with `valid=False`                                  | `StringType()`                        | 
| warc_file          | Name of the original WARC-file that contained record                                                       | `StringType()`                        |
| warc_offset        | Offset of the record in `warc_file` in the (uncompressed) stream                                           | `IntegerType()`                       |
| schema_metadata    | List of key, value pairs that contain global settings like the `schema_version`                            | `MapType(StringType(), StringType())` |

### Columns from [modules](preprocessing/parse/html_modules)

| Column                  | Description                                                                                                                                                                     | Pyspark Datatype                                                                             |
|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
| ows_canonical           | The canonical link if it exists                                                                                                                                                 | `StringType()`                                                                               |
| ows_fetch_response_time | Fetch time in ms                                                                                                                                                                | `IntegerType()`                                                                              |
| ows_fetch_num_errors    | Number of errors while fetching (Timeout is the most prominent fetch error)                                                                                                     | `StringType()`                                                                               |
| ows_genai               | `True`: The content is allowed to be used for the purposes of developing Generative AI models; `False`: The content cannot be used                                              | `BooleanType()`                                                                              |
| ows_genai_details       | If `ows_genai=False`, this provides additional context                                                                                                                          | `StringType()`                                                                               |
| ows_index               | `True`: The content is allowed to be used for the purposes of web indexing/web search; `False`: The content cannot be used                                                      | `BooleanType()`                                                                              |
| ows_referer             | The URL of the page that referred to the current one                                                                                                                            | `StringType()`                                                                               |
| ows_resource_type       | Crawl from which the WARC-file originated; Files crawled by the University of Passau are labeled with "Owler"                                                                   | `StringType()`                                                                               |
| ows_tags                | List of tags assigned by the OWS crawler                                                                                                                                        | `ArrayType(StringType())`                                                                    |
| outgoing_links          | List of all hyperlinks in the HTML that start with 'http'                                                                                                                       | `StructType` with `src` and `anchor_text`                                                    |
| image_links             | List of all links to images in the HTML that start with 'http'                                                                                                                  | `StructType` with `src`, `width`, and `height`                                               |
| video_links             | List of all links to videos in the HTML that start with 'http' or iframes with a video                                                                                          | `StructType` with `src`, `width`, and `height`                                               |
| iframes                 | List of tuples for nodes that contain an iframe (and are not a video)                                                                                                           | `StructType` with `src`, `width`, and `height`                                               |
| curlielabels            | List of language specific domain labels according to [Curlie.org](https://curlie.org/).                                                                                         | `ArrayType(StringType())`                                                                    |
| curlielabels_en         | List of English domain labels according to [Curlie.org](https://curlie.org/). Mapping by [Lugeon, Sylvain; Piccardi, Tiziano](https://doi.org/10.6084/m9.figshare.19406693.v5). | `ArrayType(StringType())`                                                                    |
| address                 | List of dictionaries containing extracted location and coordinates                                                                                                              | See `get_spark_schema` in [geoparsing.py](resilipipe/resilipipe/parse/modules/geoparsing.py) | 
| collection_indices      | List of collection indices that a record belongs to. Are defined via `yaml` files on the S3 instance                                                                            | `ArrayType(StringType())`                                                                    |                                                                                             |



---

## 7. Data Selection

You can filter datasets by:

* Domain (e.g., `.de`)
* Topic (using [curlie.org](https://curlie.org/) hierarchy labels)
* Language (filename-based)
* Site list (CSV of URLs, domains, or TLDs)
* Structured data / microdata
* Outlinks

---




