# Tutorial 12: Uploading Your Own Datasets

OpenWebSearch.eu allows you to upload and share your datasets through the [LEXIS Platform](), provided by our HPC partners.

---

## Prerequisites

* Install [py4lexis](https://opencode.it4i.eu/lexis-platform/clients/py4lexis) (`pip install py4lexis --index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple`)
* Have an account with permission to use **OpenWebSearch.eu** project resources

👉 Please contact us if you need access and request access in the [Lexis Portal](https://portal.lexis.tech/) as follows:
1. Login to the Lexis Portal using the [B2Access Federator](https://b2access.eudat.eu/). You should be able to use your academic home login, ORCID or social login, although we encourage the former too.
2. Go to [Projects](https://portal.lexis.tech/projects) and hit Request Project Access. Fill in "openwebsearch" in the form
3. contact us separately via E-Mail / [or Mattermost](https://openwebsearch.eu/community/) to give us a reason to give you access.
---

## Step 1 – Login

Log in with **py4lexis**:

```bash
python -m py4lexis.cli session login-url
```

---

## Step 2 – Choose a Storage Location

Determine the available storages for the `openwebsearch` project:

```bash
python -m py4lexis.cli datasets get-project-storages openwebsearch
```

---

## Step 3 – Prepare Metadata (DataCite JSON)

Fill out a [DataCite](https://datacite.org/) JSON template. A minimal example looks like this:

```json
{
  "creators": [
    {
      "name": "THE CREATOR"
    }
  ],
  "titles": [
    {
      "lang": "en",
      "title": "THE TITLE"
    }
  ],
  "publisher": {
    "name": "THE CREATOR"
  },
  "types": {
    "resourceType": "Dataset",
    "resourceTypeGeneral": "Dataset"
  },
  "publicationYear": 2025
}
```

You can also generate a minimal JSON via Python:

```python
from py4lexis.models.datacite_model import Datacite
print(Datacite.get_basic_data("THE CREATOR", "THE TITLE"))
```

---

## Step 4 – (Optional) Provide a More Complete DataCite JSON

We recommend using a more complete DataCite JSON (this can also be edited later via the portal):

```json
{
  "creators": [
    {
      "name": "THE CREATOR"
    }
  ],
  "titles": [
    {
      "lang": "en",
      "title": "THE TITLE"
    }
  ],
  "publisher": {
    "name": "THE CREATOR"
  },
  "types": {
    "resourceType": "Dataset",
    "resourceTypeGeneral": "Dataset"
  },
  "publicationYear": 2025,
  "dates": [
    {
      "date": "2025-06-09T00:00:00",
      "dateType": "Updated"
    },
    {
      "date": "2025-06-09T00:00:00",
      "dateType": "Other",
      "dateInformation": "startDate"
    },
    {
      "date": "2025-06-09T00:00:00",
      "dateType": "Other",
      "dateInformation": "endDate"
    }
  ],
  "fundingReferences": [
    {
      "funderName": "European Commission",
      "funderIdentifier": "https://doi.org/10.13039/501100000780",
      "funderIdentifierType": "Crossref Funder ID",
      "awardNumber": "101070014",
      "awardUri": "https://cordis.europa.eu/project/id/101070014",
      "awardTitle": "Piloting a Cooperative Open Web Search Infrastructure to Support Europe's Digital Sovereignty"
    }
  ]
}
```

👉 `objectCount` should be set to the number of rows if available.

---

## Step 5 – Create the Dataset

Add additional metadata for **findability**:

```json
"additionalMetadata": {
  "collectionName": "special",
  "totalSize": 192781845880,
  "fileCount": 3361,
  "objectCount": 59215037
}
```

Command template:

```bash
python -m py4lexis.cli datasets create-dataset project openwebsearch "iRODS IT4I" "iRODS LEXIS V2" --title "THE TITLE" --datacite '{
  "creators": [{"name": "THE CREATOR"}],
  "titles": [{"lang": "en", "title": "THE TITLE"}],
  "publisher": {"name": "THE CREATOR"},
  "types": {"resourceType": "owix", "resourceTypeGeneral": "Dataset"},
  "publicationYear": 2025,
  "dates": [
    {"date": "2025-06-09T00:00:00", "dateType": "Updated"},
    {"date": "2025-06-09T00:00:00", "dateType": "Other", "dateInformation": "startDate"},
    {"date": "2025-06-09T00:00:00", "dateType": "Other", "dateInformation": "endDate"}
  ],
  "fundingReferences": [
    {
      "funderName": "European Commission",
      "funderIdentifier": "https://doi.org/10.13039/501100000780",
      "funderIdentifierType": "Crossref Funder ID",
      "awardNumber": "101070014",
      "awardUri": "https://cordis.europa.eu/project/id/101070014",
      "awardTitle": "Piloting a Cooperative Open Web Search Infrastructure to Support Europe's Digital Sovereignty"
    }
  ]
}' --additional-metadata '{
  "collectionName": "special",
  "totalSize": 192781845880,
  "fileCount": 3361,
  "objectCount": 59215037
}'
```

* Use `project` for restricted datasets (project members only).
* Use `public` to make the dataset visible to all users.

The command will return a unique dataset ID, e.g.:

```
The dataset was successfully created with dataset ID: 4322b10c-8fa6-11f0-b931-c687956b5905
```

You can also find the dataset ID in the [Lexis Portal → Datasets → Copy ID](https://portal.lexis.tech/uploadDataSets).

---

## Step 6 – Review and Edit Metadata

Go to the [Lexis Portal → Uploads](https://portal.lexis.tech/uploadDataSets) to review your dataset.

* Metadata can be edited there (e.g., license, related publications).
* Publications should be listed under `relatedIdentifiers`.

![](figures/py4lexis-datasets.png)

---

## Step 7 – Upload Files

Example for uploading a file:

```bash
python -m py4lexis.cli datasets tus-uploader-rewrite 4322b10c-8fa6-11f0-b931-c687956b5905 openwebsearch cc/myfile.parquet "iRODS IT4I" "iRODS LEXIS V2" --file-path . --destination-path /
```

* This works for archives and single files.
* Directory upload is under development, but you can script uploads (Python or shell).

---

## Step 8 – Upload Directories (Optional, Faster)

If your firewall allows iRODS connections, you can upload directories directly:

```bash
python -m py4lexis.cli lexis-irods upload-directory-to-dataset cc project openwebsearch 4322b10c-8fa6-11f0-b931-c687956b5905
```

Here we assume that `cc` is a subfolder of the current directory.


---

## Step 9 – Final Check

After upload:

* Inspect your dataset in the [Lexis Portal](https://portal.lexis.tech/uploadDataSets).
* Optionally ask the OpenWebSearch team to **star** your dataset at the OpenWebIndex portal.
* Public datasets can be linked via:

```
https://portal.lexis.tech/publicDataSets/<DATASETID>/details
```

---

# Example Datasets Uploaded

## Coral-NLP/German-corpus

The Coral-NLP/German-Corpus is a comprehensive collection of German-language text data under open licenses for training German language models. 
It has been uploaded to [Huggingface](https://huggingface.co/datasets/coral-nlp/german-commons), but is
also shared via the Lexis Plattform in OpenWebSearch.eu

```bash
# Download the datasets
hf download "coral-nlp/german-commons" --repo-type dataset --local-dir ./german_commons
```

Note that you need to have a login with [B2ACCESS](https://b2access.eudat.eu/home/authentication) or any other identity provider supported by Lexis. 
If you have none, B2ACCESS allows to create one via social login (ORCID, Github, etc.)

```bash
# Login to the Lexis Portal
python -m py4lexis.cli.session login-url
#....you will get an inner console....from there, create the dataset
python -m py4lexis.cli datasets create-dataset public openwebsearch "iRODS IT4I" "iRODS LEXIS V2" --title "German Commons" --datacite '{
  "creators": [
    {
      "name": "Lukas Gienapp",
      "givenName": "Lukas",
      "familyName": "Gienapp",
      "nameIdentifiers": [
        {
          "nameIdentifier": "0000-0001-5707-3751",
          "nameIdentifierScheme": "ORCID",
          "schemeUri": "https://orcid.org"
        }
      ],
      "affiliation": [
        {
          "name": "University of Kassel"
        },
        {
          "name": "Hessian Center for Artificial Intelligence (hessian.AI)"
        },
        {
          "name": "Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)"
        }
      ]
    },
    {
      "name": "Christopher Schröder",
      "givenName": "Christopher",
      "familyName": "Schröder",
      "nameIdentifiers": [
        {
          "nameIdentifier": "0000-0002-7081-8495",
          "nameIdentifierScheme": "ORCID",
          "schemeUri": "https://orcid.org"
        }
      ],
      "affiliation": [
        {
          "name": "Institut für Angewandte Informatik e.V. (InfAI)"
        },
        {
          "name": "Leipzig University"
        }
      ]
    },
    {
      "name": "Stefan Schweter",
      "givenName": "Stefan",
      "familyName": "Schweter",
      "nameIdentifiers": [
        {
          "nameIdentifier": "0000-0002-7190-2090",
          "nameIdentifierScheme": "ORCID",
          "schemeUri": "https://orcid.org"
        }
      ]
    },
    {
      "name": "Christopher Akiki",
      "givenName": "Christopher",
      "familyName": "Akiki",
      "nameIdentifiers": [
        {
          "nameIdentifier": "0000-0002-1634-5068",
          "nameIdentifierScheme": "ORCID",
          "schemeUri": "https://orcid.org"
        }
      ],
      "affiliation": [
        {
          "name": "Leipzig University"
        },
        {
          "name": "Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)"
        }
      ]
    },
    {
      "name": "Ferdinand Schlatt",
      "givenName": "Ferdinand",
      "familyName": "Schlatt",
      "nameIdentifiers": [
        {
          "nameIdentifier": "0000-0002-6032-909X",
          "nameIdentifierScheme": "ORCID",
          "schemeUri": "https://orcid.org"
        }
      ],
      "affiliation": [
        {
          "name": "Friedrich Schiller University Jena"
        }
      ]
    },
    {
      "name": "Arden Zimmermann",
      "givenName": "Arden",
      "familyName": "Zimmermann",
      "nameIdentifiers": [
        {
          "nameIdentifier": "0000-0001-6470-2384",
          "nameIdentifierScheme": "ORCID",
          "schemeUri": "https://orcid.org"
        }
      ],
      "affiliation": [
        {
          "name": "Deutsche Nationalbibliothek"
        }
      ]
    },
    {
      "name": "Philippe Genêt",
      "givenName": "Philippe",
      "familyName": "Genêt",
      "nameIdentifiers": [
        {
          "nameIdentifier": "0000-0001-6470-2384",
          "nameIdentifierScheme": "ORCID",
          "schemeUri": "https://orcid.org"
        }
      ],
      "affiliation": [
        {
          "name": "Deutsche Nationalbibliothek"
        }
      ]
    },
    {
      "name": "Martin Potthast",
      "givenName": "Martin",
      "familyName": "Potthast",
      "nameIdentifiers": [
        {
          "nameIdentifier": "0000-0003-2451-0665",
          "nameIdentifierScheme": "ORCID",
          "schemeUri": "https://orcid.org"
        }
      ],
      "affiliation": [
        {
          "name": "University of Kassel"
        },
        {
          "name": "Hessian Center for Artificial Intelligence (hessian.AI)"
        },
        {
          "name": "Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)"
        }
      ]
    }
  ],
  "publisher": {
    "name": "CORAL Project"
  },
  "titles": [
    {
      "title": "German Commons",
      "lang": "en"
    }
  ],
  "relatedIdentifiers": [     {
      "relatedIdentifier": "https://huggingface.co/datasets/coral-nlp/german-commons",
      "relatedIdentifierType": "URL",
      "relationType": "IsDocumentedBy",
      "resourceTypeGeneral": "Dataset"
      }],
  "publicationYear": 2025,
  "language": "de",
  "types": {
    "resourceType": "Dataset",
    "resourceTypeGeneral": "Dataset"
  },
  "rightsList": [
    {
      "rights": "Open Data Commons Attribution License v1.0",
      "rightsUri": "https://opendatacommons.org/licenses/by/1-0/"
    }
  ],
  "descriptions": [
    {
      "description": "Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce. We introduce the German Commons, the largest collection of permissively-licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154 billion tokens of high-quality text for language model training. Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources. All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution. The German Commons therefore addresses the critical gap in permissively licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible.",
      "descriptionType": "Abstract"
    }
  ],
  "fundingReferences": [
    {
      "funderName": "Federal Ministry of Education and Research (BMFTR)",
      "awardNumber": "01IS24077A",
      "awardTitle": "CORAL"
    },
    {
      "funderName": "Federal Ministry of Education and Research (BMFTR)",
      "awardNumber": "01IS24077B",
      "awardTitle": "CORAL"
    },
    {
      "funderName": "Federal Ministry of Education and Research (BMFTR)",
      "awardNumber": "01IS24077D",
      "awardTitle": "CORAL"
    }
  ]
}' --additional-metadata '{
  "collectionName": "corpora",
  "totalSize": 247839226,
  "fileCount": 3521
}'

```

Note that in the step before you get a dataset id, that has to be used in the next step.

```bash
python -m py4lexis.cli lexis-irods upload-directory-to-dataset german_commons/ public openwebsearch c93c60d0-d5a1-11f0-a6f8-f6a03915313d
```

After the upload, you are done and can find the dataset at the LEXIS plattform. 


### Accessing the dataset 

- Logging: `python -m py4lexis.cli.session login-url`
- Viewing the content: `python -m py4lexis.cli.datasets get-content-of-dataset c93c60d0-d5a1-11f0-a6f8-f6a03915313d`
- Download dataset (slow via HTTP): `python -m py4lexis.cli.datasets download-dataset c93c60d0-d5a1-11f0-a6f8-f6a03915313d`
- Download dataset (fast via RPC): `python -m py4lexis.cli lexis-irods download-dataset-as-directory public openwebsearch c93c60d0-d5a1-11f0-a6f8-f6a03915313d`


