# Tutorial 0: How to create an index locally

This tutorial shows you, how to setup our indexing pipeline locally, based on local crawls and using docker. 


## Prerequisites

You should have [docker](https://docs.docker.com/engine/install/) installed on your system such that the `docker` command is available from your command line. Also, [`get`](https://www.gnu.org/software/wget/) or a similar tool for scraping websites should be at your disposal.

:::{note}
Please note that you should use local crawling, as done in this tutorial, __only__ for testing purposes or in special cases. Crawling puts potentially heave load on server systems and can be considered as unpolite, if it does not follow the crawl etiquett. 
:::

## Step 1: Scraping a list of Webpages

As a first step, you should perform a simple crawl using `wget` or a similar tool:

```bash
# Create a directory that will be mounted in the Docker containers
mkdir -p data

wget --input-file urls.txt \
    --recursive \
    --level 2 \
    --delete-after \
    --no-directories \
    --warc-file data/crawl
```

You can leave out the --recursive and --level arguments if you only want to fetch the list of URLs, and don't want to perform recursive or explorative crawling. Note that wget includes angle brackets < and > around the URL. The preprocessing pipeline handles this correctly, but the final metadata files will still include these brackets in the column for the complete URL.

```{note}
We assume all directories to be relative from the parent directory. If you have a different setup, you need to adjust the paths accordingly.
```

```{hint}
For large scale crawling, you can also setup our [Open Web Index Crawler - short OWLER](https://opencode.it4i.eu/openwebsearcheu-public/owler) or explore one of our other options down below.
```

## Step 2: Preprocessing

Given the results of your crawler, you can run the preprocessing pipeline:

```bash
docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/preprocessing-pipeline \
    /data/crawl.warc.gz \
    /data/metadata.parquet.gz
    
```

```{hint}
For more details and the source code take a look at our [preprocessing pipeline](https://opencode.it4i.eu/openwebsearcheu-public/preprocessing-pipeline). 
```


## Step 3: Indexing


And finally, you can index the preprocessed metadata and get an the index for your crawl:

```bash
docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
    --description "CIFF description" \
    --input-format parquet \
    --output-format ciff \
    --id-col record_id \
    --content-col plain_text \
    /data/metadata.parquet.gz \
    /data/index/
```

You should now have the following files:

```bash
data/crawl.warc.gz
data/metadata.parquet.gz
data/index/index.ciff.gz
```

```{hint}
For more details on the indexing details please take a look at our [spark indexer](https://opencode.it4i.eu/openwebsearcheu-public/spark-indexer). Especially when experiencing Out of Heap Memory errors.
```


## Step 4: Consuming the index

You can consume the index using [MOSAIC](https://opencode.it4i.eu/openwebsearcheu-public/mosaic) or other CIFF-compatible search engine libraries / frameworks like [PyTerrier PISA](https://github.com/terrierteam/pyterrier_pisa).

For short, you have to import the CIFF file to a Lucene index and then run the application using the imported index. Metadata is consumed from the Parquet file and both the name of the Lucene index and the directory name that contains the Parquet file must be the same.

First, run the container for the importer:

1. Import the CIFF file to a Lucene index
```bash
mkdir -p data/serve/lucene
docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/mosaic/lucene-ciff \
    /data/index/index.ciff.gz \
    /data/serve/lucene/demo-index
mkdir -p data/serve/metadata/demo-index
cp data/metadata.parquet.gz data/serve/metadata/demo-index/metadata.parquet.gz
```
This will create the Lucene index `/data/serve/lucene/demo-index`. Then, serve the Lucene index:

2. Run the search application
```bash
docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    -p 8008:8008 \
    opencode.it4i.eu:5050/openwebsearcheu-public/mosaic/search-service \
    --lucene-dir-path /data/serve/lucene/ \
    --parquet-dir-path /data/serve/metadata/
```
The application should be running on localhost:8008 now and you should be able to perform a search query (e.g., http://localhost:8008/search?q=europe). The parameters are:

- --lucene-dir-path: path of the directory that contains the Lucene index(es)
- --parquet-dir-path: path of the directory that contains the Parquet file(s)

```{hint}
For more details on starting the MOSAIC search service, please take a look at [available options](https://opencode.it4i.eu/openwebsearcheu-public/mosaic/-/blob/main/README.md#cli-options).
```

```{note}
The name of the Lucene index (i.e., the directory name) and the name of the directory that contains the Parquet file must match `{index-name}/metadata.parquet.gz` (e.g., if the name of the Lucene index is `wiki`, the associated metadata directory name must be `wiki`.
```

## Options: Other data sources for indexing


### Wikipedia: Simple Wikipedia Abstracts

Another interesting starting point is to index the [Simple Wikipedia](https://simple.wikipedia.org/wiki/Main_Page) from their [dumps](https://dumps.wikimedia.org).

#### Simple Wikipedia Abstract

The [Simple Wikipedia Abstract](https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-abstract.xml.gz) are quite small and quick to process.

So instead of crawling your own files in Step 1, use [the wiki-to-ows tool](https://opencode.it4i.eu/openwebsearcheu-public/wiki-to-ows/) to create the WARC file as follows:

```bash
docker run \
    --rm \
    -v "$PWD/data":/data \
    opencode.it4i.eu:5050/openwebsearcheu-public/wiki-to-ows \
    --download=https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-abstract.xml.gz \
    --warc \
    --compress \
    -o /data/simple_wiki_abstracts
```

Note that indexing might require more heap memory (see below).

## Options for Indexing

### Setting more Heap Memory and other Spark Properties

For larger crawls the indexing might run out of heap memory. You can either create smaller chunks or increase the heap size. You can override any spark properties using a `spark.conf` file, as follows:

```bash
echo 'spark.driver.memory=8g\nspark.executor.memory=8g'>$PWD/spark-properties.conf && docker run \
--rm \
-v "$PWD/data":/data:Z \
-v "$PWD/spark-properties.conf":/opt/spark/conf/spark-defaults.conf \
opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
--description "CIFF description" \
--input-format parquet \
--output-format ciff \
--id-col record_id \
--content-col plain_text \
/data/metadata.parquet.gz \
/data/index/
```

You can also override the entrypoint:

```bash
docker run \
    --rm \
    -v "$PWD/tmp":/data:Z -w /opt/spark/work-dir --entrypoint /bin/bash  \
    opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
    -c "/opt/spark/bin/spark-submit --driver-memory 8g --executor-memory 8g Indexer-assembly-1.0.jar index --description 'CIFF description'  --input-format parquet --output-format ciff --id-col record_id --content-col plain_text  /data/metadata.parquet.gz /data/index"
```

Alternatively, you can also increase the number of partitions to be used when indexing using the `--num-partitions` options.

```bash
docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
    --description "CIFF description" \
    --input-format parquet \
    --output-format ciff \
    --id-col record_id \
    --content-col plain_text \
    --num-partitions 2000 \
    /data/metadata.parquet.gz \
    /data/index/
```

