
# Tutorial 10. Hosting the OWI on your own S3

```{danger}
The tutorial is not ready yet/ fully working. 
There are still some parsing errors that could not be resovled via the command line.

```

## Scenario

- You want to host (parts of) the OWI index on your own S3 bucket for faster querying
- You want to host slices of the OWI index on your own S3 Bucket.

## Prerequisites

- Having `owilix` installed
- Having an S3 bucket configured

## 1. Configure OWILIX to use your S3 bucket

1. Got to the owilix config directory: `cd ~/.owilix`
2. Edit the "owilix.cfg" file: `vi ~/.owi/owilix.cfg`
3. under repositories.config add the following:
   ```yaml
    mys3: # name / abbrevation of teh repo. Must not start with +
      options: # Note: config options follow fsspec conventions
        protocol: s3a
        key: <Your Key here> # your key to s3 if it is not public
        secret: <Your Secret here> # your secret to s3 if it is not public
        endpoint: <endpoint> # your endpoint to s3
        path: owi-public/{access} # path to the bucket. {access} is a placeholder and must be provided
        async: True
      repository: s3a
    ```
4. (Optional, but recommended): under the `selected_remote` section, you can select the remotes that should be active by default.
   If you do not add your remote here, you have to always specify it in the CLI with --remotes +mys3.
   The drawback of adding it here is, that the more remotes you have, the longer querying could take and you might end up with duplicate data. 
5. Check availability: 
   ```bash
    >> owilix --remotes mys3 remote doctor
    Found 1 configured Remote Repositories
    mys3: connected
            user=False      project=False   public=False
    ```
   Since there are no dataset, it should be normal that user, project and public are all false. This requires a certain folder structur in the bucket. 
   It is important, that it tells you `abbrevation: connected`
   
   
## 2. Pushing full datasets to your S3 bucket

To fill the bucket, run an owi command to select datasets and pull them into the bucket (we use the `mys3` remote assuming it was configured above)

```bash
owilix --remotes +mys3,lrz,it4i remote pull all:2025-04-15#14/collectionName=main push_to_remote=mys3 num_threads=10
```

We use remotes lrz and it4, which are configured by default and host the main OWI index, and add our own s3 in addtion (the little `+`). 
we then pull all datasets from the main index for 14 days before the 15th April 2025 and push them to our own s3 bucket. 
We use 10 threads to speed up the process. it should take about 30 minutes / dataset on a good internet connection.

```{note}
As of now, long running jobs might fail due [to an bug in the platform](https://opencode.it4i.eu/lexis-platform/clients/py4lexis/-/issues/53). 
However, syncing can continue from unsynced files by simply re-running the command.
```

## 3. Working with datasets in your S3 bucket

By specifying `--remotes mys3` all `owilix` commands run on your s3 bucket. Some examples:

### Querying your S3 datasets

List all datasets:

`owilix --remotes mys3 remote ls`


### Slicing and creating new datasets:
   
Slicing refers to the process of creating new datasets by materializing (several) queries over existing datasets. 

-  Step 1: Creating slices on your local machine from existing datasets on your s3
   
   `query --remotes mys3 slice --remote all:latest/collectionName=main creator="Max Mustermann, MyS3Orga" "where=url_suffix='at'" prefetch=200 collection_name="all_of_austria"`
   
   By only selecting the remote `mys3` with `--remotes`, we are limiting the query and thus dataset creation to datasets stored add.
   Note that `--remote` (without the s) is part of the query command, which allows you to select remote and local datasets to be queried / sliced.
- Step 1a(optional): Inspect the local slice
- Step 2: Push the slice to the remote (TODO)
- Step X:  Delete the slices
  If you are not happy with the slices, you can delete the slices / all of your collection easily with
  - locally: `owilix local remove all/collectionName=all_of_austria` 
  - remotely: `owilix --remotes mys3 remote remove all/collectionName=all_of_austria` 
  

```{note}
Currently, slicing requires to create the new dataset on a local machine and then upload it again. 
This might change in the future.
```


