{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "582a58a5",
   "metadata": {},
   "source": [
    "# Tutorial 14: Finetuning an LLM with OWI data using the LUMI supercomputer\n",
    "\n",
    "This tutorial demonstrates one way to use OWI data to finetune a large language model (LLM) on the [LUMI supercomputer](https://lumi-supercomputer.eu/). In this example Finnish language data is downloaded and used to finetune Meta's [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B), improving its performance on Finnish language tasks.\n",
    "\n",
    "This tutorial has these steps:\n",
    "1. Get started on using LUMI.\n",
    "2. Create a [Singularity](https://docs.lumi-supercomputer.eu/software/containers/singularity/#building-apptainersingularity-sif-containers) container for data downloading with the [owilix](https://opencode.it4i.eu/openwebsearcheu-public/owi-cli) command line tool and download the data using this container.\n",
    "3. Preprocess the data and prepare it for training using Jupyter notebooks within the Jupyter environment provided by the LUMI web interface.\n",
    "4. Create a second Singularity container optimized for training with the necessary Python packages for machine learning.\n",
    "5. Create a [batch job](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/slurm-quickstart/) and write Python scripts for training and inference using the training container."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edec5b0a",
   "metadata": {},
   "source": [
    "**OWI License**\n",
    "\n",
    "Before using any data, you must review the terms under [the license](https://openwebsearch.eu/owil-current/).  \n",
    "**The model trained with OWI data may only be used for research purposes.**\n",
    "\n",
    "**Disclaimer**\n",
    "\n",
    "Please note that this is a technical guide only and does not constitute a legal assessment of whether or how you may use the data.\n",
    "\n",
    "This work uses index files as part of the index partition created by the OpenWebSearch.eu project that has received funding from the European  Union’s Horizon Europe research and innovation programme under grant  agreement No 101070014 (OpenWebSearch.EU)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09d47029-c961-46fa-b64b-57786b7e3af4",
   "metadata": {},
   "source": [
    "# 1. Get started on using LUMI\n",
    "\n",
    "**To begin using the LUMI supercomputer, follow these steps (requires valid project):**\n",
    "1. [Get a user account](https://docs.lumi-supercomputer.eu/firststeps/accessLUMI/)\n",
    "2. [Set up an SSH key pair to be able to use LUMI from a terminal](https://docs.lumi-supercomputer.eu/firststeps/SSH-keys/#__tabbed_2_1)\n",
    "3. [Log in to LUMI with SSH client](https://docs.lumi-supercomputer.eu/firststeps/loggingin/)\n",
    "\n",
    "```bash\n",
    "ssh -i <path-to-private-key> <username>@lumi.csc.fi\n",
    "```\n",
    "\n",
    "## 1.1. Where to store data - disk areas\n",
    "\n",
    "Each user has a home directory (`$HOME`) that can contain up to 20 GB of data. Do not use this for the data and codes - use `/project` or `/scratch` instead. See more about different disk ares here: https://docs.lumi-supercomputer.eu/storage/#where-to-store-data\n",
    "\n",
    "## 1.2. Installing Python packages\n",
    "\n",
    "Installing packages directly via `pip` or `conda` is not recommended as it puts lots of strain on the Lustre file system on LUMI. Instead, users should use Singularity/Apptainer containers. Please also see the official guidance on how to install new Python packages on the [LUMI software guide](https://docs.lumi-supercomputer.eu/software/installing/python/)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "816ed5a6",
   "metadata": {},
   "source": [
    "# 2. Get the data\n",
    "\n",
    "Let's create a Singularity container using [`cotainr`](https://cotainr.readthedocs.io/en/stable/) in order to download the data using the owilix command line tool. \n",
    "\n",
    "## 2.1. Container for owilix \n",
    "\n",
    "\n",
    "First, we will specify packages to be installed in a conda environment .yml file. Then we will use `cotainr` to build a new container with defined packages. In this case we need `owilix` (requires 3.10 or 3.11  Python) and `py4lexis`. \n",
    "\n",
    "**Create owilix_env.yml file:**\n",
    "```yml\n",
    "name: owilix_env\n",
    "channels:\n",
    "  - conda-forge\n",
    "dependencies:\n",
    "  - python=3.11\n",
    "  - pip=24.0\n",
    "  - pip:\n",
    "    - --extra-index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple\n",
    "    - --extra-index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple\n",
    "    - py4lexis\n",
    "    - owilix\n",
    "```\n",
    "\n",
    "\n",
    "**In the terminal of LUMI (note: building container takes several minutes):**\n",
    "```bash\n",
    "# Get needed modules\n",
    "module purge\n",
    "module load LUMI\n",
    "module load cotainr\n",
    "\n",
    "# Use cotainr to build the container \n",
    "cotainr build owilix_env.sif --system=lumi-g --conda-env=owilix_env.yml --accept-license\n",
    "\n",
    "## Add required additional bindings\n",
    "module use /appl/local/containers/ai-modules/\n",
    "module load singularity-AI-bindings \n",
    "\n",
    "# Verify installation\n",
    "singularity exec owilix_env.sif bash -c 'pip list'\n",
    "```\n",
    "\n",
    "## 2.2. Use owilix to download the data\n",
    "\n",
    "In this example, we will download latest Finnish data. We'll open shell connection to the container and use commands to download the data to the desired directory. \n",
    "\n",
    "1. Run a shell within the container:\n",
    "\n",
    "`singularity shell owilix_env.sif`\n",
    "\n",
    "2. Download the data using owilix (remember to set the target directory!): \n",
    "\n",
    "`owilix --target <target-directory-for-the-data> remote pull all:latest#30 files=\"**/language=fin/*\"`\n",
    "\n",
    "**Complete the authentication:**  \n",
    "\n",
    "You will be prompted to accept the terms by typing `yes`. Then copy the web address that appears in the terminal, open it in your browser, and log in to complete the authentication process."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da282f60-9474-401b-92fb-9f38caa90dab",
   "metadata": {},
   "source": [
    "# 3. Preprocess data\n",
    "\n",
    "Now we are ready to preprocess the data using Jupyter. Note that preprocessing demonstrated in this tutorial is minimal. **You should consider more thorough preprocessing for your specific use case!**\n",
    "\n",
    "## 3.1. Activate a Jupyter session\n",
    "We’ll use the Jupyter environment provided by LUMI: \n",
    "- 1. Navigate to **Apps** -> **Jupyter**\n",
    "- 2. Configure the session with the following settings:\n",
    "    - Project: project_XXXXXX\n",
    "    - Partition: small\n",
    "    - Number of CPU cores: 64\n",
    "    - Memory (GiB): 128\n",
    "    - Working directory: Select from the dropdown\n",
    "    - Python: pytorch \n",
    "- 3. Wait for you session to be ready, then click `Connect to Jupyter`\n",
    "\n",
    "Once connected, create a notebook and proceed with the preprocessing steps."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d74f2db-5c33-497f-bb21-f4c85c0630ad",
   "metadata": {},
   "source": [
    "## 3.2. Combine all data to df\n",
    "\n",
    "Next, we’ll load and combine all downloaded data files into a single pandas DataFrame. After all the preprocessing steps, we'll save the result as a .parquet file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8c31740f-9b3c-40d0-9d01-8150091ba733",
   "metadata": {},
   "outputs": [],
   "source": [
    "path_to_owilix_data = '<target-directory-for-the-data-from-owilix/**/*.parquet>'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "16e54d2d-9aa4-49a2-944d-f385eb2ee71e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import os\n",
    "import glob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b2f2e56-7d5b-4d85-bac3-1d41b19adcd2",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Combined DataFrame shape: (1256577, 43)\n"
     ]
    }
   ],
   "source": [
    "parquet_files = glob.glob(os.path.join(path_to_owilix_data, \"**/*.parquet\"), recursive=True)\n",
    "\n",
    "dataframes = []\n",
    "for file in parquet_files:\n",
    "    try:\n",
    "        df = pd.read_parquet(file)\n",
    "        dataframes.append(df)\n",
    "    except Exception as e:\n",
    "        print(f\"Error reading {file}: {e}\")\n",
    "\n",
    "# Combine all DataFrames\n",
    "combined_df = pd.concat(dataframes, ignore_index=True)\n",
    "print(f\"Combined DataFrame shape: {combined_df.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "67019a05-875f-4687-8a17-e038e63eb35b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['id', 'record_id', 'title', 'main_content', 'json-ld', 'microdata',\n",
       "       'opengraph', 'warc_date', 'warc_ip', 'url', 'url_scheme', 'url_path',\n",
       "       'url_params', 'url_query', 'url_fragment', 'url_subdomain',\n",
       "       'url_domain', 'url_suffix', 'url_is_private', 'mime_type', 'charset',\n",
       "       'content_type_other', 'http_server', 'valid', 'warc_file',\n",
       "       'warc_offset', 'schema_metadata', 'ows_canonical', 'ows_resource_type',\n",
       "       'ows_curlielabel', 'ows_index', 'ows_genai', 'ows_genai_details',\n",
       "       'ows_fetch_response_time', 'ows_fetch_num_errors', 'outgoing_links',\n",
       "       'image_links', 'video_links', 'iframes', 'curlielabels',\n",
       "       'curlielabels_en', 'address', 'plain_text'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "combined_df.columns "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1cde09e-2497-4ea7-915d-47d8f42998d7",
   "metadata": {},
   "source": [
    "## 3.3. Filter the content\n",
    "\n",
    "In this step, we will combine data from both the `plain_text` and `main_content` columns as this column was renamed in Schema version 0.2.X. For more information about colums see the [Preprocessing Pipeline](https://opencode.it4i.eu/openwebsearcheu-public/preprocessing-pipeline) documentation.\n",
    "\n",
    "| column | description | Schema version |\n",
    "| ------ | ----------- | -------------- |\n",
    "| plain_text | Cleaned text from the HTML | 0.1.X |\n",
    "| main_content | Main content of the HTML, formatted with minimal HTML tags (`h1-6`, `p`, `ul/ol/li`, `pre`, and`a`tags) | 0.2.X |\n",
    "\n",
    "We will then proceed with the following steps: \n",
    "1. Filter rows where `ows_genai`==True\n",
    "2. Remove duplicates based on `main_content` and `url`\n",
    "3. Filter and clean the `main_content`\n",
    "4. Drop duplicates again after cleaning\n",
    "5. Filter by word count\n",
    "6. Double-check the language with [langdetect](https://pypi.org/project/langdetect/)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1084bba5-4e87-46c8-981b-b8407bf1f255",
   "metadata": {},
   "source": [
    "### 3.3.1. Use ows_genai and drop duplicates\n",
    "\n",
    "Prepare the downloaded OWI data for training by:\n",
    "- Combining content fields and removing empty entries\n",
    "- Filtering for GenAI-suitable content (`ows_genai = True`)  \n",
    "- Removing duplicates by content and URL\n",
    "- Selecting final columns: `title`, `url`, `main_content`\n",
    "\n",
    "Progress is tracked by printing dataset shape after each step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c39c37fe-4716-4c22-8b44-17d2af658bff",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DataFrame shape before first steps: (1256577, 43)\n",
      "DataFrame shape after combining main_content and plain_text: (1256577, 42)\n",
      "DataFrame shape after ows_genai: (1248081, 42)\n",
      "DataFrame shape after dropping dups (main_content): (466222, 42)\n",
      "DataFrame shape after dropping dups (url): (220210, 42)\n",
      "DataFrame shape after all steps: (220210, 3)\n"
     ]
    }
   ],
   "source": [
    "print(f\"DataFrame shape before first steps: {combined_df.shape}\")\n",
    "\n",
    "# Fill missing values in 'main_content' with values from 'plain_text'\n",
    "combined_df['main_content'] = combined_df['main_content'].fillna(combined_df['plain_text']) \n",
    "\n",
    "# Drop rows where 'main_content' is still missing and remove the now-unneeded 'plain_text' column\n",
    "combined_df = combined_df[combined_df[\"main_content\"].notna()].drop(columns=[\"plain_text\"])\n",
    "print(f\"DataFrame shape after combining main_content and plain_text: {combined_df.shape}\")\n",
    "\n",
    "# Keep only rows where 'ows_genai' is True\n",
    "combined_df = combined_df[combined_df['ows_genai'] == True]\n",
    "print(f\"DataFrame shape after ows_genai: {combined_df.shape}\")\n",
    "\n",
    "# Remove duplicate rows based on 'main_content', then remove duplicates based on 'url'\n",
    "combined_df= combined_df.drop_duplicates(subset='main_content') # .drop_duplicates(subset='url')\n",
    "print(f\"DataFrame shape after dropping dups (main_content): {combined_df.shape}\")\n",
    "combined_df= combined_df.drop_duplicates(subset='url')\n",
    "print(f\"DataFrame shape after dropping dups (url): {combined_df.shape}\")\n",
    "\n",
    "# Select only the relevant columns for further processing\n",
    "combined_df = combined_df[['title','url','main_content']]\n",
    "print(f\"DataFrame shape after all steps: {combined_df.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "50eae98c-9023-47bb-8393-b2c3c5b2ec46",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "      <th>main_content</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>forum.bomber.fi - Omat asetukset - Käyttöehdot</td>\n",
       "      <td>https://www.bomber.fi/forums/user/terms?sid=cd...</td>\n",
       "      <td>&lt;h2&gt;forum.bomber.fi - Käyttöehdot&lt;/h2&gt;\\n\\n&lt;p&gt;K...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Yhteystiedot - Mustasaaren seurakuntayhtymä</td>\n",
       "      <td>https://www.mustasaarenseurakuntayhtyma.fi/yht...</td>\n",
       "      <td>&lt;h1&gt;Yhteystiedot&lt;/h1&gt;\\n\\n&lt;p&gt; &lt;/p&gt;\\n\\n&lt;h4&gt;Musta...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>VAELLUSNET - Vaellusturinat II - Omat asetukse...</td>\n",
       "      <td>http://www.vaellusnet.com/ucp.php?mode=terms&amp;s...</td>\n",
       "      <td>&lt;h2&gt;VAELLUSNET - Vaellusturinat II - Käyttöehd...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title  \\\n",
       "0     forum.bomber.fi - Omat asetukset - Käyttöehdot   \n",
       "1        Yhteystiedot - Mustasaaren seurakuntayhtymä   \n",
       "2  VAELLUSNET - Vaellusturinat II - Omat asetukse...   \n",
       "\n",
       "                                                 url  \\\n",
       "0  https://www.bomber.fi/forums/user/terms?sid=cd...   \n",
       "1  https://www.mustasaarenseurakuntayhtyma.fi/yht...   \n",
       "2  http://www.vaellusnet.com/ucp.php?mode=terms&s...   \n",
       "\n",
       "                                        main_content  \n",
       "0  <h2>forum.bomber.fi - Käyttöehdot</h2>\\n\\n<p>K...  \n",
       "1  <h1>Yhteystiedot</h1>\\n\\n<p> </p>\\n\\n<h4>Musta...  \n",
       "2  <h2>VAELLUSNET - Vaellusturinat II - Käyttöehd...  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "combined_df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c718e618-b615-4f79-a7b2-5bf5613eb6eb",
   "metadata": {},
   "source": [
    "### 3.3.2. Filter html content\n",
    "\n",
    "This code performs minimal cleaning of the `main_content` field. You can define terms (like policy-related keywords) in POLICY_TERMS to exclude pages entirely.\n",
    "\n",
    "The function performs the following:\n",
    "* Removes `<a>` tags but keeps the inner text\n",
    "* Replaces block-level HTML tags (`<p>`, `<h1>–<h6>`, etc.) and `<br>` with newlines\n",
    "* Cleans up HTML entities and removes bullet symbols\n",
    "* Filters out short or incomplete lines (e.g. no punctuation, too few words)\n",
    "* Normalizes whitespace and joins the cleaned lines into a final text block\n",
    "* Returns `None` if no meaningful content remains"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "c9964b5c-c17d-4dfe-994b-9efb09dd00e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import html\n",
    "\n",
    "# Terms to exclude early (e.g., policy pages)\n",
    "POLICY_TERMS = [\"käyttöeh\"]\n",
    "\n",
    "# Precompiled regex patterns\n",
    "A_TAG = re.compile(r'<a\\b[^>]*?>(.*?)</a>', flags=re.IGNORECASE | re.DOTALL)\n",
    "BLOCK_TAGS = re.compile(r'</?(h[1-6]|p|pre|ul|ol|li|div)>', flags=re.IGNORECASE)\n",
    "BR_TAG = re.compile(r'<br\\s*/?>', flags=re.IGNORECASE)\n",
    "TAG_CLEANER = re.compile(r'<[^>]+>')  # fallback to remove leftover tags\n",
    "\n",
    "TERMINAL_PUNCT_PATTERN = re.compile(r'[.!?]\\s*$')\n",
    "WHITESPACE_PATTERNS = {\n",
    "    'multiple_newlines': re.compile(r\"\\n{3,}\"),\n",
    "    'spaces': re.compile(r\"[ \\t]+\"),\n",
    "    'trailing_spaces': re.compile(r\" +\\n\")\n",
    "}\n",
    "\n",
    "def clean_html_min(html_str: str):\n",
    "    if not html_str or not html_str.strip():\n",
    "        return None\n",
    "\n",
    "    # Early policy term check\n",
    "    html_lower = html_str.lower()\n",
    "    if any(term in html_lower for term in POLICY_TERMS):\n",
    "        return None\n",
    "\n",
    "    # Unwrap <a> tags but keep inner text\n",
    "    html_str = A_TAG.sub(r'\\1', html_str)\n",
    "\n",
    "    # Replace <br> and block-level tags with newlines\n",
    "    html_str = BR_TAG.sub('\\n', html_str)\n",
    "    html_str = BLOCK_TAGS.sub('\\n', html_str)\n",
    "\n",
    "    # Remove all remaining tags (non-block level)\n",
    "    html_str = TAG_CLEANER.sub('', html_str)\n",
    "\n",
    "    # Decode HTML entities (e.g. &quot; → \", &nbsp; → space)\n",
    "    html_str = html.unescape(html_str)\n",
    "    html_str = html_str.replace('\\xa0', ' ')  # additional non-breaking space cleanup\n",
    "\n",
    "    # Remove common bullet symbols\n",
    "    html_str = re.sub(r'[•◦\\u2022]', '', html_str)\n",
    "\n",
    "    # Normalize and filter lines\n",
    "    lines = [line.strip() for line in html_str.split('\\n') if line.strip()]\n",
    "    cleaned_lines = []\n",
    "\n",
    "    for line in lines:\n",
    "        # Must end in terminal punctuation\n",
    "        if not TERMINAL_PUNCT_PATTERN.search(line):\n",
    "            continue\n",
    "\n",
    "        # Must be long enough\n",
    "        if len(line) < 20 or len(line.split()) < 4:\n",
    "            continue\n",
    "\n",
    "        cleaned_lines.append(line)\n",
    "\n",
    "    if not cleaned_lines:\n",
    "        return None\n",
    "\n",
    "    # Join and normalize whitespace\n",
    "    cleaned_text = '\\n'.join(cleaned_lines)\n",
    "    cleaned_text = WHITESPACE_PATTERNS['multiple_newlines'].sub(\"\\n\\n\", cleaned_text)\n",
    "    cleaned_text = WHITESPACE_PATTERNS['spaces'].sub(\" \", cleaned_text)\n",
    "    cleaned_text = WHITESPACE_PATTERNS['trailing_spaces'].sub(\"\\n\", cleaned_text)\n",
    "\n",
    "    return cleaned_text.strip() if cleaned_text.strip() else None\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc7a9513-e08f-4244-963f-410c44c18a34",
   "metadata": {},
   "source": [
    "### 3.3.3. Example of a site before preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "0633497d-db45-489f-918f-2b6902d13aae",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<a href=\"#bodyContent\">Siirry sisältöön</a>\n",
      "\n",
      "<h1>Kae Araki</h1>\n",
      "\n",
      "Wikipediasta\n",
      "\n",
      "<p>Kae Araki (<a href=\"/wiki/Japanin_kieli\">jap.</a> 荒木香恵, oikealta nimeltään Kae Abe, s. <a href=\"/wiki/6._marraskuuta\">6. marraskuuta</a> <a href=\"/wiki/1966\">1966</a> <a href=\"/wiki/Osaka\">Osaka</a>) on <a href=\"/wiki/Japani\">japanilainen</a> <a href=\"/wiki/Seiy%C5%AB\">ääninäyttelijä</a>, <a href=\"/wiki/Seiy%C5%AB\">seiyū</a>, joka on näytellyt monissa <a href=\"/wiki/Anime\">anime</a>- ja <a href=\"/wiki/Televisio\">televisiosarjoissa</a>, muun muassa <a href=\"/wiki/Babar\">Babar</a>, <a href=\"/wiki/Cardcaptor_Sakura\">Cardcaptor Sakura</a>, <a href=\"/wiki/Digimon\">Digimon</a>, <a href=\"/w/index.php?title=Fushigi_y%C5%ABgi&amp;action=edit&amp;redlink=1\">Fushigi yūgi</a>, <a href=\"/wiki/Great_Teacher_Onizuka\">Great Teacher Onizuka</a>, <a href=\"/wiki/Kodomo_no_omocha\">Kodomo no omocha</a>, <a href=\"/w/index.php?title=Wakakusa_monogatari_%E2%80%93_Nan_to_Jo_no_sensei&amp;action=edit&amp;redlink=1\">Wakakusa monogatari – Nan to Jo no sensei</a> ja <a href=\"/wiki/Pok%C3%A9mon\">Pokémon</a>. Animesarjojen lisäksi hän on esiintynyt monissa peleissä. </p>\n",
      "\n",
      "<h2>Aiheesta muualla</h2>\n",
      "\n",
      "[<a href=\"/w/index.php?title=Kae_Araki&amp;veaction=edit&amp;section=1\">muokkaa</a> | <a href=\"/w/index.php?title=Kae_Araki&amp;action=edit&amp;section=1\">muokkaa wikitekstiä</a>]\n",
      "<ul>\n",
      "  <li><a href=\"https://www.imdb.com/name/nm0032890/\">Kae Araki</a> Internet Movie Databasessa. (englanniksi)</li>\n",
      "</ul>\n",
      "Tämä <a href=\"/wiki/N%C3%A4yttelij%C3%A4\">näyttelijään</a> liittyvä artikkeli on <a href=\"/wiki/Wikipedia:Tynk%C3%A4\">tynkä</a>. Voit auttaa Wikipediaa <a href=\"https://fi.wikipedia.org/w/index.php?title=Kae_Araki&amp;veaction=edit\">laajentamalla</a> artikkelia.<br>\n"
     ]
    }
   ],
   "source": [
    "test = df['main_content'].iloc[10]  \n",
    "print(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8fcb102c",
   "metadata": {},
   "source": [
    "### 3.3.4. Example of the site after preprocessing\n",
    "\n",
    "This short example illustrates how the HTML cleaning code works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "27b90989-69ce-453b-b272-0886b63e7ddc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Kae Araki (jap. 荒木香恵, oikealta nimeltään Kae Abe, s. 6. marraskuuta 1966 Osaka) on japanilainen ääninäyttelijä, seiyū, joka on näytellyt monissa anime- ja televisiosarjoissa, muun muassa Babar, Cardcaptor Sakura, Digimon, Fushigi yūgi, Great Teacher Onizuka, Kodomo no omocha, Wakakusa monogatari – Nan to Jo no sensei ja Pokémon. Animesarjojen lisäksi hän on esiintynyt monissa peleissä.\n",
      "Tämä näyttelijään liittyvä artikkeli on tynkä. Voit auttaa Wikipediaa laajentamalla artikkelia.\n"
     ]
    }
   ],
   "source": [
    "res = clean_html_min(test)\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "7a107ef7-4914-45f2-af30-38d613a4beaa",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Cleaning HTML content: 100%|██████████| 220210/220210 [00:53<00:00, 4142.45it/s] \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "      <th>main_content</th>\n",
       "      <th>cleaned_html_content</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>forum.bomber.fi - Omat asetukset - Käyttöehdot</td>\n",
       "      <td>https://www.bomber.fi/forums/user/terms?sid=cd...</td>\n",
       "      <td>&lt;h2&gt;forum.bomber.fi - Käyttöehdot&lt;/h2&gt;\\n\\n&lt;p&gt;K...</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Yhteystiedot - Mustasaaren seurakuntayhtymä</td>\n",
       "      <td>https://www.mustasaarenseurakuntayhtyma.fi/yht...</td>\n",
       "      <td>&lt;h1&gt;Yhteystiedot&lt;/h1&gt;\\n\\n&lt;p&gt; &lt;/p&gt;\\n\\n&lt;h4&gt;Musta...</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>VAELLUSNET - Vaellusturinat II - Omat asetukse...</td>\n",
       "      <td>http://www.vaellusnet.com/ucp.php?mode=terms&amp;s...</td>\n",
       "      <td>&lt;h2&gt;VAELLUSNET - Vaellusturinat II - Käyttöehd...</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Gives me some privacy | Dekottaa</td>\n",
       "      <td>http://www.dekottaa.com/2014/01/gives-me-some-...</td>\n",
       "      <td>&lt;h2&gt;26.1.2014&lt;/h2&gt;\\n\\n&lt;a href=\"\"&gt;\\n\\n&lt;h1&gt; Give...</td>\n",
       "      <td>Liitutaulutarra kanan muodossa. Jos ei halua j...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Suomen Briard ry - Lähetä sähköpostia</td>\n",
       "      <td>http://www.suomenbriard.net/phpBB/memberlist.p...</td>\n",
       "      <td>&lt;h2&gt;Yhteystiedot käyttäjälle&lt;/h2&gt;\\n\\nYlläpitäj...</td>\n",
       "      <td>Tämä viesti lähetetään pelkkänä tekstinä. Älä ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1091706</th>\n",
       "      <td>Vastauspalvelu</td>\n",
       "      <td>https://vastauspalvelu.omataloyhtio.fi/</td>\n",
       "      <td>&lt;a href=\"https://jurinet.fi/\"&gt;Jurinet&lt;/a&gt;\\nKuv...</td>\n",
       "      <td>Taloyhtiömme on asennettu uusi juuri ilmanpois...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1091814</th>\n",
       "      <td>Sound Particles Studio-ohjelmistot - Pikalatau...</td>\n",
       "      <td>https://www.muziker.fi/sound-particles-studio-...</td>\n",
       "      <td>&lt;p&gt; Valitse maa, johon lähetys toimitetaan &lt;/p...</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1091907</th>\n",
       "      <td>Lattialämmityskaapelit - Hammarin Sähkö Oy</td>\n",
       "      <td>https://www.hammarinsahko.fi/sahkotarvikkeet/l...</td>\n",
       "      <td>Luotettavaa kauppaa yli 110 vuotta\\n\\n&lt;h2&gt;Latt...</td>\n",
       "      <td>Lattialämmityskaapelit varaavaan lattialämmity...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1091934</th>\n",
       "      <td>Kotitalousvähennyslaskuri 2025: Laske kotitalo...</td>\n",
       "      <td>https://vertaakorkoja.fi/kotitalousvahennyslas...</td>\n",
       "      <td>&lt;h1&gt;Kotitalousvähennyslaskuri&lt;/h1&gt;\\n\\n&lt;p&gt;Kotit...</td>\n",
       "      <td>Kotitalousvähennyslaskurin avulla voit laskea ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1092309</th>\n",
       "      <td>Ota meihin yhteyttä – Mothersusurrus.com</td>\n",
       "      <td>https://mothersusurrus.com/ota-meihin-yhteytta/</td>\n",
       "      <td>&lt;h1&gt;Ota meihin yhteyttä&lt;/h1&gt;\\n\\n&lt;h4&gt;Mikäli sin...</td>\n",
       "      <td>Mikäli sinulla on kysyttävää musiikista, tai h...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>220210 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                     title  \\\n",
       "0           forum.bomber.fi - Omat asetukset - Käyttöehdot   \n",
       "1              Yhteystiedot - Mustasaaren seurakuntayhtymä   \n",
       "2        VAELLUSNET - Vaellusturinat II - Omat asetukse...   \n",
       "3                         Gives me some privacy | Dekottaa   \n",
       "4                    Suomen Briard ry - Lähetä sähköpostia   \n",
       "...                                                    ...   \n",
       "1091706                                     Vastauspalvelu   \n",
       "1091814  Sound Particles Studio-ohjelmistot - Pikalatau...   \n",
       "1091907         Lattialämmityskaapelit - Hammarin Sähkö Oy   \n",
       "1091934  Kotitalousvähennyslaskuri 2025: Laske kotitalo...   \n",
       "1092309           Ota meihin yhteyttä – Mothersusurrus.com   \n",
       "\n",
       "                                                       url  \\\n",
       "0        https://www.bomber.fi/forums/user/terms?sid=cd...   \n",
       "1        https://www.mustasaarenseurakuntayhtyma.fi/yht...   \n",
       "2        http://www.vaellusnet.com/ucp.php?mode=terms&s...   \n",
       "3        http://www.dekottaa.com/2014/01/gives-me-some-...   \n",
       "4        http://www.suomenbriard.net/phpBB/memberlist.p...   \n",
       "...                                                    ...   \n",
       "1091706            https://vastauspalvelu.omataloyhtio.fi/   \n",
       "1091814  https://www.muziker.fi/sound-particles-studio-...   \n",
       "1091907  https://www.hammarinsahko.fi/sahkotarvikkeet/l...   \n",
       "1091934  https://vertaakorkoja.fi/kotitalousvahennyslas...   \n",
       "1092309    https://mothersusurrus.com/ota-meihin-yhteytta/   \n",
       "\n",
       "                                              main_content  \\\n",
       "0        <h2>forum.bomber.fi - Käyttöehdot</h2>\\n\\n<p>K...   \n",
       "1        <h1>Yhteystiedot</h1>\\n\\n<p> </p>\\n\\n<h4>Musta...   \n",
       "2        <h2>VAELLUSNET - Vaellusturinat II - Käyttöehd...   \n",
       "3        <h2>26.1.2014</h2>\\n\\n<a href=\"\">\\n\\n<h1> Give...   \n",
       "4        <h2>Yhteystiedot käyttäjälle</h2>\\n\\nYlläpitäj...   \n",
       "...                                                    ...   \n",
       "1091706  <a href=\"https://jurinet.fi/\">Jurinet</a>\\nKuv...   \n",
       "1091814  <p> Valitse maa, johon lähetys toimitetaan </p...   \n",
       "1091907  Luotettavaa kauppaa yli 110 vuotta\\n\\n<h2>Latt...   \n",
       "1091934  <h1>Kotitalousvähennyslaskuri</h1>\\n\\n<p>Kotit...   \n",
       "1092309  <h1>Ota meihin yhteyttä</h1>\\n\\n<h4>Mikäli sin...   \n",
       "\n",
       "                                      cleaned_html_content  \n",
       "0                                                     None  \n",
       "1                                                     None  \n",
       "2                                                     None  \n",
       "3        Liitutaulutarra kanan muodossa. Jos ei halua j...  \n",
       "4        Tämä viesti lähetetään pelkkänä tekstinä. Älä ...  \n",
       "...                                                    ...  \n",
       "1091706  Taloyhtiömme on asennettu uusi juuri ilmanpois...  \n",
       "1091814                                               None  \n",
       "1091907  Lattialämmityskaapelit varaavaan lattialämmity...  \n",
       "1091934  Kotitalousvähennyslaskurin avulla voit laskea ...  \n",
       "1092309  Mikäli sinulla on kysyttävää musiikista, tai h...  \n",
       "\n",
       "[220210 rows x 4 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Using apply with progress tracking:\n",
    "from tqdm import tqdm\n",
    "tqdm.pandas(desc=\"Cleaning HTML content\")\n",
    "combined_df['cleaned_html_content'] = combined_df['main_content'].progress_map(clean_html_min)\n",
    "combined_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf674f41-9688-4e6c-b92b-2a044fe6c94a",
   "metadata": {},
   "source": [
    "### 3.3.3. Drop duplicates and None-values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "1365f882-1740-46a9-a956-a9104c3c2de2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Df shape before.: (220210, 4)\n",
      "Df shape after: (139074, 4)\n"
     ]
    }
   ],
   "source": [
    "print(f\"Df shape before.: {combined_df.shape}\")\n",
    "combined_df = combined_df.drop_duplicates(subset='cleaned_html_content')\n",
    "combined_df = combined_df.dropna(subset=['cleaned_html_content'])\n",
    "\n",
    "print(f\"Df shape after: {combined_df.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fcb60f5-1d23-4b58-8848-1029d6e236d3",
   "metadata": {},
   "source": [
    "### 3.3.4. Filter by word count\n",
    "\n",
    "Next, we calculate the word count for each content entry and filter out any entries with fewer than 30 words. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "aabecb7b-ef23-4dd6-8092-f6735242a8cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate word count for each entry\n",
    "combined_df['word_count'] = combined_df['cleaned_html_content'].str.split().str.len()\n",
    "\n",
    "# Sort by word count and reset the index\n",
    "combined_df = combined_df.sort_values(by='word_count').reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "969e76d2-51da-4fb4-9d0c-32d26a86e53b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "      <th>main_content</th>\n",
       "      <th>cleaned_html_content</th>\n",
       "      <th>word_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Kyky – Welcome</td>\n",
       "      <td>https://kyky.today/</td>\n",
       "      <td>Kyky Kyky\\n  • Ota yhteyttä\\n  • Rekisteröidy\\...</td>\n",
       "      <td>Ostaja maksaa sinulle suoraan!</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Gluteeniton ruoka - Upbeat Intl. Trading Oy</td>\n",
       "      <td>https://www.east-asia-mart.fi/fi/tuoteryhma/23...</td>\n",
       "      <td>|\\n  • e-Lahjakortit ja Onnenkassit (Fukubukur...</td>\n",
       "      <td>300 g Laatikko, Singapore.</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Tietoja sivusta ”C. S. Lewis” – ApoWiki</td>\n",
       "      <td>https://apowiki.fi/index.php?action=info&amp;title...</td>\n",
       "      <td>Anonyymi\\n\\nEt ole kirjautunut\\n\\n  • Keskuste...</td>\n",
       "      <td>Katso tämän sivun suojausloki.</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         title  \\\n",
       "0                               Kyky – Welcome   \n",
       "1  Gluteeniton ruoka - Upbeat Intl. Trading Oy   \n",
       "2      Tietoja sivusta ”C. S. Lewis” – ApoWiki   \n",
       "\n",
       "                                                 url  \\\n",
       "0                                https://kyky.today/   \n",
       "1  https://www.east-asia-mart.fi/fi/tuoteryhma/23...   \n",
       "2  https://apowiki.fi/index.php?action=info&title...   \n",
       "\n",
       "                                        main_content  \\\n",
       "0  Kyky Kyky\\n  • Ota yhteyttä\\n  • Rekisteröidy\\...   \n",
       "1  |\\n  • e-Lahjakortit ja Onnenkassit (Fukubukur...   \n",
       "2  Anonyymi\\n\\nEt ole kirjautunut\\n\\n  • Keskuste...   \n",
       "\n",
       "             cleaned_html_content  word_count  \n",
       "0  Ostaja maksaa sinulle suoraan!           4  \n",
       "1      300 g Laatikko, Singapore.           4  \n",
       "2  Katso tämän sivun suojausloki.           4  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "combined_df.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "79b37ced-55cc-4e17-bd70-842b58a6a466",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "      <th>main_content</th>\n",
       "      <th>cleaned_html_content</th>\n",
       "      <th>word_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>139071</th>\n",
       "      <td>Ortodoksinen oppi pelastuksesta – Tsasounan su...</td>\n",
       "      <td>https://www.tsasouna.net/FI/2024/09/07/ortodok...</td>\n",
       "      <td>Skip to content\\nTsasounan suunnalta\\n\\n  • Or...</td>\n",
       "      <td>Q &amp; A – kysy papilta!\\nQ &amp; A – Mikä ja miksi?\\...</td>\n",
       "      <td>69952</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>139072</th>\n",
       "      <td>Vuosikirja 2021 - Cockerspanielit ry</td>\n",
       "      <td>https://cockerspanielit.org/vuosikirja-2022-2/</td>\n",
       "      <td>&lt;h1&gt;Vuosikirja 2021&lt;/h1&gt;\\n\\n&lt;p&gt;Koostanut Pirjo...</td>\n",
       "      <td>Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ...</td>\n",
       "      <td>73949</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>139073</th>\n",
       "      <td>vierailija, tekijä sivustolla Hiiltä ja timanttia</td>\n",
       "      <td>https://blogit.metropolia.fi/hiilta-ja-timantt...</td>\n",
       "      <td>Hyppää sisältöön\\nMetropolian Blogit\\n  • Uusi...</td>\n",
       "      <td>Verkko-opetus on tullut jäädäkseen, mutta mite...</td>\n",
       "      <td>85067</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    title  \\\n",
       "139071  Ortodoksinen oppi pelastuksesta – Tsasounan su...   \n",
       "139072               Vuosikirja 2021 - Cockerspanielit ry   \n",
       "139073  vierailija, tekijä sivustolla Hiiltä ja timanttia   \n",
       "\n",
       "                                                      url  \\\n",
       "139071  https://www.tsasouna.net/FI/2024/09/07/ortodok...   \n",
       "139072     https://cockerspanielit.org/vuosikirja-2022-2/   \n",
       "139073  https://blogit.metropolia.fi/hiilta-ja-timantt...   \n",
       "\n",
       "                                             main_content  \\\n",
       "139071  Skip to content\\nTsasounan suunnalta\\n\\n  • Or...   \n",
       "139072  <h1>Vuosikirja 2021</h1>\\n\\n<p>Koostanut Pirjo...   \n",
       "139073  Hyppää sisältöön\\nMetropolian Blogit\\n  • Uusi...   \n",
       "\n",
       "                                     cleaned_html_content  word_count  \n",
       "139071  Q & A – kysy papilta!\\nQ & A – Mikä ja miksi?\\...       69952  \n",
       "139072  Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ...       73949  \n",
       "139073  Verkko-opetus on tullut jäädäkseen, mutta mite...       85067  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "combined_df.tail(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "ce5bcf6c-840a-4c15-909b-0f718e610691",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    139074.000000\n",
       "mean        339.775587\n",
       "std         962.370831\n",
       "min           4.000000\n",
       "25%          52.000000\n",
       "50%         148.000000\n",
       "75%         345.000000\n",
       "max       85067.000000\n",
       "Name: word_count, dtype: float64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "combined_df['word_count'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "214aef8e-ba23-4354-914a-650e2addeef6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Df shape before.: (139074, 5)\n",
      "Df shape after: (117133, 5)\n"
     ]
    }
   ],
   "source": [
    "print(f\"Df shape before.: {combined_df.shape}\")\n",
    "combined_df = combined_df[combined_df['word_count'] > 30]\n",
    "print(f\"Df shape after: {combined_df.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c684f7f-1124-4d14-9444-f85be9fff331",
   "metadata": {},
   "source": [
    "### 3.3.5. Detect language\n",
    "\n",
    "Let’s use the langdetect library to double-check the language of each entry and keep only those written in Finnish. This can take few minutes.\n",
    "Then, filter the dataset to include only Finnish-language entries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "6e3458e3-395d-4137-a470-71698075aa10",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langdetect import detect, LangDetectException\n",
    "\n",
    "def detect_language_or_none(text):\n",
    "    try:\n",
    "        return detect(text)\n",
    "    except LangDetectException:\n",
    "        return None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "c98fe059-3153-4d94-b308-5a804db0b142",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "      <th>main_content</th>\n",
       "      <th>cleaned_html_content</th>\n",
       "      <th>word_count</th>\n",
       "      <th>language_detected</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>21941</th>\n",
       "      <td>4.12.2024 -Työturvallisuuskoulutus - CadSa</td>\n",
       "      <td>https://cadsa.fi/koulutuskalenteri/tyoturvalli...</td>\n",
       "      <td>&lt;h1&gt;4.12.2024 -Työturvallisuuskoulutus&lt;/h1&gt;\\n\\...</td>\n",
       "      <td>Työturvallisuuskoulutus on työturvakeskuksen k...</td>\n",
       "      <td>31</td>\n",
       "      <td>fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21942</th>\n",
       "      <td>Huulipuna unohtu</td>\n",
       "      <td>https://huulipunaunohtu.blogspot.com/</td>\n",
       "      <td>Siirry pääsisältöön\\n\\nHuulipuna unohtu\\n\\nÄit...</td>\n",
       "      <td>Äiti on pitänyt meistä huolta, nyt me pidämme ...</td>\n",
       "      <td>31</td>\n",
       "      <td>fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21943</th>\n",
       "      <td>maa-artisokkapikkelsi |  Olemme puutarhassa</td>\n",
       "      <td>http://olemmepuutarhassa.fi/tag/maa-artisokkap...</td>\n",
       "      <td>maa-artisokkapikkelsi | Olemme puutarhassa\\n\\n...</td>\n",
       "      <td>Heti kun maa on sulanut voi esiin kaivaa viime...</td>\n",
       "      <td>31</td>\n",
       "      <td>fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21944</th>\n",
       "      <td>REIDEN LOITONTAJALAITE | Ironfit Store</td>\n",
       "      <td>https://store.ironfit.fi/product/265/ironfit-r...</td>\n",
       "      <td>&lt;p&gt;IRONFIT REIDEN LOITONTAJALAITE ST-6007&lt;/p&gt;\\...</td>\n",
       "      <td>Tilattavissa. Toimitusaika 21 päivää.\\nTilatta...</td>\n",
       "      <td>31</td>\n",
       "      <td>fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21945</th>\n",
       "      <td>Työpenkki Henning, levyn leveys 1500 mm, hylly...</td>\n",
       "      <td>https://www.gerdmans.fi/varasto-ja-teollisuus/...</td>\n",
       "      <td>&lt;h1&gt; Työpenkki Henning, levyn leveys 1500 mm, ...</td>\n",
       "      <td>Työpenkki Henning, levyn leveys 1500 mm, hylly...</td>\n",
       "      <td>31</td>\n",
       "      <td>fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>139069</th>\n",
       "      <td>Sanatarkat istuntoselostukset - Keskiviikko 20...</td>\n",
       "      <td>https://www.europarl.europa.eu/doceo/document/...</td>\n",
       "      <td>\\nTakaisin Europarl-portaaliin\\n\\nChoisissez ...</td>\n",
       "      <td>Der Präsident. – Bevor wir zum Tätigkeitsprogr...</td>\n",
       "      <td>63237</td>\n",
       "      <td>de</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>139070</th>\n",
       "      <td>SKVR</td>\n",
       "      <td>https://aineistot.finlit.fi/exist/apps/skvr/ru...</td>\n",
       "      <td>Esittely Runoluettelo / Metatietosuodatus Runo...</td>\n",
       "      <td>Tällä sivulla voit selata runotyyppejä ja luke...</td>\n",
       "      <td>69370</td>\n",
       "      <td>fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>139071</th>\n",
       "      <td>Ortodoksinen oppi pelastuksesta – Tsasounan su...</td>\n",
       "      <td>https://www.tsasouna.net/FI/2024/09/07/ortodok...</td>\n",
       "      <td>Skip to content\\nTsasounan suunnalta\\n\\n  • Or...</td>\n",
       "      <td>Q &amp; A – kysy papilta!\\nQ &amp; A – Mikä ja miksi?\\...</td>\n",
       "      <td>69952</td>\n",
       "      <td>fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>139072</th>\n",
       "      <td>Vuosikirja 2021 - Cockerspanielit ry</td>\n",
       "      <td>https://cockerspanielit.org/vuosikirja-2022-2/</td>\n",
       "      <td>&lt;h1&gt;Vuosikirja 2021&lt;/h1&gt;\\n\\n&lt;p&gt;Koostanut Pirjo...</td>\n",
       "      <td>Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ...</td>\n",
       "      <td>73949</td>\n",
       "      <td>fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>139073</th>\n",
       "      <td>vierailija, tekijä sivustolla Hiiltä ja timanttia</td>\n",
       "      <td>https://blogit.metropolia.fi/hiilta-ja-timantt...</td>\n",
       "      <td>Hyppää sisältöön\\nMetropolian Blogit\\n  • Uusi...</td>\n",
       "      <td>Verkko-opetus on tullut jäädäkseen, mutta mite...</td>\n",
       "      <td>85067</td>\n",
       "      <td>fi</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>117133 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    title  \\\n",
       "21941          4.12.2024 -Työturvallisuuskoulutus - CadSa   \n",
       "21942                                    Huulipuna unohtu   \n",
       "21943         maa-artisokkapikkelsi |  Olemme puutarhassa   \n",
       "21944              REIDEN LOITONTAJALAITE | Ironfit Store   \n",
       "21945   Työpenkki Henning, levyn leveys 1500 mm, hylly...   \n",
       "...                                                   ...   \n",
       "139069  Sanatarkat istuntoselostukset - Keskiviikko 20...   \n",
       "139070                                               SKVR   \n",
       "139071  Ortodoksinen oppi pelastuksesta – Tsasounan su...   \n",
       "139072               Vuosikirja 2021 - Cockerspanielit ry   \n",
       "139073  vierailija, tekijä sivustolla Hiiltä ja timanttia   \n",
       "\n",
       "                                                      url  \\\n",
       "21941   https://cadsa.fi/koulutuskalenteri/tyoturvalli...   \n",
       "21942               https://huulipunaunohtu.blogspot.com/   \n",
       "21943   http://olemmepuutarhassa.fi/tag/maa-artisokkap...   \n",
       "21944   https://store.ironfit.fi/product/265/ironfit-r...   \n",
       "21945   https://www.gerdmans.fi/varasto-ja-teollisuus/...   \n",
       "...                                                   ...   \n",
       "139069  https://www.europarl.europa.eu/doceo/document/...   \n",
       "139070  https://aineistot.finlit.fi/exist/apps/skvr/ru...   \n",
       "139071  https://www.tsasouna.net/FI/2024/09/07/ortodok...   \n",
       "139072     https://cockerspanielit.org/vuosikirja-2022-2/   \n",
       "139073  https://blogit.metropolia.fi/hiilta-ja-timantt...   \n",
       "\n",
       "                                             main_content  \\\n",
       "21941   <h1>4.12.2024 -Työturvallisuuskoulutus</h1>\\n\\...   \n",
       "21942   Siirry pääsisältöön\\n\\nHuulipuna unohtu\\n\\nÄit...   \n",
       "21943   maa-artisokkapikkelsi | Olemme puutarhassa\\n\\n...   \n",
       "21944   <p>IRONFIT REIDEN LOITONTAJALAITE ST-6007</p>\\...   \n",
       "21945   <h1> Työpenkki Henning, levyn leveys 1500 mm, ...   \n",
       "...                                                   ...   \n",
       "139069   \\nTakaisin Europarl-portaaliin\\n\\nChoisissez ...   \n",
       "139070  Esittely Runoluettelo / Metatietosuodatus Runo...   \n",
       "139071  Skip to content\\nTsasounan suunnalta\\n\\n  • Or...   \n",
       "139072  <h1>Vuosikirja 2021</h1>\\n\\n<p>Koostanut Pirjo...   \n",
       "139073  Hyppää sisältöön\\nMetropolian Blogit\\n  • Uusi...   \n",
       "\n",
       "                                     cleaned_html_content  word_count  \\\n",
       "21941   Työturvallisuuskoulutus on työturvakeskuksen k...          31   \n",
       "21942   Äiti on pitänyt meistä huolta, nyt me pidämme ...          31   \n",
       "21943   Heti kun maa on sulanut voi esiin kaivaa viime...          31   \n",
       "21944   Tilattavissa. Toimitusaika 21 päivää.\\nTilatta...          31   \n",
       "21945   Työpenkki Henning, levyn leveys 1500 mm, hylly...          31   \n",
       "...                                                   ...         ...   \n",
       "139069  Der Präsident. – Bevor wir zum Tätigkeitsprogr...       63237   \n",
       "139070  Tällä sivulla voit selata runotyyppejä ja luke...       69370   \n",
       "139071  Q & A – kysy papilta!\\nQ & A – Mikä ja miksi?\\...       69952   \n",
       "139072  Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ...       73949   \n",
       "139073  Verkko-opetus on tullut jäädäkseen, mutta mite...       85067   \n",
       "\n",
       "       language_detected  \n",
       "21941                 fi  \n",
       "21942                 fi  \n",
       "21943                 fi  \n",
       "21944                 fi  \n",
       "21945                 fi  \n",
       "...                  ...  \n",
       "139069                de  \n",
       "139070                fi  \n",
       "139071                fi  \n",
       "139072                fi  \n",
       "139073                fi  \n",
       "\n",
       "[117133 rows x 6 columns]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "combined_df['language_detected'] = combined_df['cleaned_html_content'].map(detect_language_or_none)\n",
    "\n",
    "combined_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "984b5d54-a575-49c8-8ae6-de24942cb016",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "language_detected\n",
       "fi    103429\n",
       "en      9031\n",
       "sv       999\n",
       "id       497\n",
       "de       358\n",
       "it       353\n",
       "pl       284\n",
       "hr       255\n",
       "et       243\n",
       "nl       233\n",
       "fr       200\n",
       "lt       197\n",
       "es       140\n",
       "sl       119\n",
       "ca       100\n",
       "da        87\n",
       "tr        71\n",
       "cs        70\n",
       "pt        69\n",
       "no        56\n",
       "ro        55\n",
       "lv        51\n",
       "ru        46\n",
       "sk        41\n",
       "hu        26\n",
       "mk        24\n",
       "vi        18\n",
       "tl        14\n",
       "sq        14\n",
       "ko         8\n",
       "sw         8\n",
       "ar         6\n",
       "uk         4\n",
       "hi         4\n",
       "bn         4\n",
       "el         4\n",
       "te         2\n",
       "bg         2\n",
       "cy         2\n",
       "fa         2\n",
       "af         2\n",
       "he         2\n",
       "ne         2\n",
       "so         1\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "combined_df['language_detected'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "eb6faeb6-7458-41f6-ab7e-cd9d41b62da6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Df shape before.: (117133, 6)\n",
      "Df shape after: (103429, 6)\n"
     ]
    }
   ],
   "source": [
    "# Retain only detected finnish data\n",
    "print(f\"Df shape before.: {combined_df.shape}\")\n",
    "combined_df = combined_df[combined_df['language_detected'] == 'fi']\n",
    "print(f\"Df shape after: {combined_df.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbb90f6a-5025-41a2-9b9a-b5ca653a82ef",
   "metadata": {},
   "source": [
    "## 3.4. Save the data to a parquet file\n",
    "\n",
    "Now that the data is cleaned, we’re ready to save it. We'll select only the necessary columns before saving."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "b48e803a-453d-4bb1-a2d9-18ad20d36914",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>cleaned_html_content</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>21941</th>\n",
       "      <td>4.12.2024 -Työturvallisuuskoulutus - CadSa</td>\n",
       "      <td>Työturvallisuuskoulutus on työturvakeskuksen k...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21942</th>\n",
       "      <td>Huulipuna unohtu</td>\n",
       "      <td>Äiti on pitänyt meistä huolta, nyt me pidämme ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            title  \\\n",
       "21941  4.12.2024 -Työturvallisuuskoulutus - CadSa   \n",
       "21942                            Huulipuna unohtu   \n",
       "\n",
       "                                    cleaned_html_content  \n",
       "21941  Työturvallisuuskoulutus on työturvakeskuksen k...  \n",
       "21942  Äiti on pitänyt meistä huolta, nyt me pidämme ...  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# drop the unneccessary columns\n",
    "combined_df = combined_df[['title', 'cleaned_html_content']]\n",
    "combined_df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ea68e00-850e-433c-b948-dba9a6b84cfe",
   "metadata": {},
   "outputs": [],
   "source": [
    "## Save the new dataframe with detected Finnish language\n",
    "path_to_save_the_data = '<path-here-ending-to-parquet-file-name>' # /scratch is recommended for data files\n",
    "combined_df.to_parquet(path_to_save_the_data, index= False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cbb8225-d2ae-4752-b23a-ee9518f79f06",
   "metadata": {},
   "source": [
    "**After saving data to a parquet file**, you may choose to exit the Jupyter environment or continue working within it to create the upcoming Python and batch job scripts while the session remains active."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32ca15f3-4ac9-44aa-b3dc-d745639e989d",
   "metadata": {},
   "source": [
    "# 4. Finetune the model\n",
    "\n",
    "In this step, we’ll train the model using a batch job and a Python script. You can create and edit these files either via the LUMI web interface or by using Visual Studio Code’s Remote SSH extension (for more details, see the documentation [here](https://docs.csc.fi/apps/vscode/)).\n",
    "\n",
    "The training scripts used in this tutorial are based on the [CSCfi/llm-fine-tuning-examples](https://github.com/CSCfi/llm-fine-tuning-examples/tree/master) repository.\n",
    "\n",
    "We will also use [MLflow](https://docs.csc.fi/support/tutorials/ml-workflows/) to track training metrics. For a practical example, see the [tutorial on using MLflow in Puhti and LUMI](https://github.com/CSCfi/puhti_mlflow_tutorial).\n",
    "\n",
    "You can create the necessary files under your project directory, e.g.:\n",
    "\n",
    "`/project/project_46XXXXXXXX/${USER}`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a590c624-5a4a-4f93-a5b1-2d3e814f9718",
   "metadata": {},
   "source": [
    "**Using LLama-models through transformers**  \n",
    "If you want to use LLaMA models via the [`transformers`](https://huggingface.co/docs/transformers/index) library, follow these steps:\n",
    "\n",
    "\n",
    "1. Create [Hugging Face ](https://huggingface.co/) account\n",
    "2. Locate the LLaMA models, read and accept their terms of use, and wait for approval\n",
    "3. Generate an access token on your Hugging Face account\n",
    "4. Set the access token in your HF cache directory (HF_HOME), for example:\n",
    "\n",
    "```bash\n",
    "export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache\n",
    "mkdir -p $HF_HOME\n",
    "\n",
    "cd <path-to-HF_HOME>\n",
    "echo <\"token-here\"> > token\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de0ceb34",
   "metadata": {},
   "source": [
    "## 4.0. Container for training\n",
    "\n",
    "First we need to create a compatible environment for training. \n",
    "\n",
    "This [example](https://github.com/DeiC-HPC/cotainr/tree/main/examples/LUMI/conda_pytorch_rocm) shows how to use [`cotainr`](https://cotainr.readthedocs.io/en/stable/) to build a container with PyTorch configured for LUMI's AMD GPUs. We'll follow this approach to create our training container.\n",
    "\n",
    "**Create owilix_env.yml file:**\n",
    "```yml\n",
    "name: training_env\n",
    "channels:\n",
    "  - conda-forge\n",
    "dependencies:\n",
    "  - filelock=3.13.1\n",
    "  - fsspec=2024.2.0\n",
    "  - jinja2=3.1.3\n",
    "  - markupsafe=2.1.5\n",
    "  - mpmath=1.3.0\n",
    "  - networkx=3.2.1\n",
    "  - numpy=1.26.3\n",
    "  - pillow=10.2.0\n",
    "  - pip=24.0\n",
    "  - python=3.11.7\n",
    "  - sympy=1.12\n",
    "  - typing-extensions=4.9.0\n",
    "  - pip:\n",
    "    - --extra-index-url https://download.pytorch.org/whl/\n",
    "    - pytorch-triton-rocm==2.3.1\n",
    "    - torch==2.3.1+rocm6.0\n",
    "    - torchaudio==2.3.1+rocm6.0\n",
    "    - torchvision==0.18.1+rocm6.0\n",
    "    - langchain==0.3.27\n",
    "    - mlflow==2.22.0\n",
    "    - datasets==4.0.0\n",
    "    - peft==0.17.0\n",
    "    - transformers==4.55.0\n",
    "```\n",
    "\n",
    "**In the terminal of LUMI (note: building container takes several minutes):**\n",
    "```bash\n",
    "# Get needed modules\n",
    "module purge\n",
    "module load LUMI\n",
    "module load cotainr\n",
    "\n",
    "# Use cotainr to build the container \n",
    "cotainr build training_env.sif --system=lumi-g --conda-env=training_env.yml --accept-license\n",
    "\n",
    "## Add required additional bindings\n",
    "module use /appl/local/containers/ai-modules/\n",
    "module load singularity-AI-bindings \n",
    "\n",
    "# Verify installation\n",
    "singularity exec training_env.sif bash -c 'pip list'\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72b430ca-5e17-4cd3-b430-05103ed6816a",
   "metadata": {},
   "source": [
    "## 4.1. Python scripts for training the model \n",
    "\n",
    "Below are the Python scripts used to finetune the Meta Llama-3.2-1B model. They include:\n",
    "\n",
    "* Training data preprocessing using a custom preprocess function that chunks and tokenizes the input text - implemented in **preprocessing.py**\n",
    "* Training setup using Hugging Face’s `Trainer` class - implemented in **train.py**\n",
    "* Metric tracking with MLflow - see **train.py**\n",
    "* Model saving and checkpointing - see **train.py**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ebdadee-32a3-42cf-82ed-83347451ccac",
   "metadata": {},
   "source": [
    "### 4.1.1. Python script: preprocessing.py\n",
    "\n",
    "This script handles text splitting using LangChain’s `RecursiveCharacterTextSplitter`. It breaks long text inputs into smaller overlapping chunks, optionally appending an end-of-sequence token to each chunk. The script also includes a preprocessing function that tokenizes these chunks with a Hugging Face tokenizer.\n",
    "\n",
    "\n",
    "```python\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "def chunk_text(text, chunk_size, overlap_size, eos_token):\n",
    "    \"\"\"Splits a single large text into smaller overlapping chunks.\"\"\"\n",
    "    splitter = RecursiveCharacterTextSplitter(\n",
    "        chunk_size=chunk_size,\n",
    "        chunk_overlap=overlap_size,\n",
    "    )\n",
    "\n",
    "    chunks = splitter.split_text(text)\n",
    "    if eos_token:\n",
    "        chunks = [chunk + f\" {eos_token}\" for chunk in chunks]\n",
    "\n",
    "    return chunks\n",
    "    \n",
    "    \n",
    "\n",
    "def preprocess(examples, tokenizer, max_tokens=4096, chunk_size=8192, overlap_size=200):\n",
    "    \"\"\"Preprocesses a batch of examples by splitting textcontent into chunks and tokenizing them.\"\"\"\n",
    "    all_chunks = []\n",
    "    for text in examples[\"cleaned_html_content\"]:\n",
    "        chunks = chunk_text(text, chunk_size=chunk_size, overlap_size=overlap_size, eos_token=tokenizer.eos_token)\n",
    "        all_chunks.extend(chunks)\n",
    "\n",
    "    tokenized_output = tokenizer(\n",
    "        all_chunks,\n",
    "        padding=False, \n",
    "        truncation=True,\n",
    "        max_length=max_tokens,  \n",
    "        add_special_tokens=True,\n",
    "        return_length=False, \n",
    "    )\n",
    "\n",
    "    return {\n",
    "        \"input_ids\": tokenized_output[\"input_ids\"],\n",
    "        \"attention_mask\": tokenized_output[\"attention_mask\"]\n",
    "    }\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf7591c2-eb4f-443a-b6c7-2bf4b4741299",
   "metadata": {},
   "source": [
    "### 4.1.2. Python script: train.py\n",
    "\n",
    "\n",
    "This is the main Python script used to finetune the **meta-llama/Llama-3.2-1B** model. \n",
    "It uses **MLFlow** to track training metrics, which are saved in the `mlruns` folder inside the specified --output-path.\n",
    "\n",
    "**Remember**: Make sure to provide the correct path to your training data in Parquet format via the `--parquet-file` argument, either here or in your batch job script.\n",
    "\n",
    "**Note!** By default, the script runs a small test training using only 1,000 samples. To train on the full dataset, comment out these lines and adjust the training parameters accordingly:\n",
    "\n",
    "```python\n",
    "    # comment these if you would like to use the whole dataset\n",
    "    tokenized_train_dataset = tokenized_train_dataset.shuffle(seed=42).select(range(900))\n",
    "    tokenized_val_dataset = tokenized_val_dataset.shuffle(seed=42).select(range(100))\n",
    "```\n",
    "\n",
    "**train.py**\n",
    "```python\n",
    "import argparse\n",
    "import os\n",
    "import sys\n",
    "import time\n",
    "import mlflow\n",
    "\n",
    "import torch\n",
    "from datasets import load_dataset\n",
    "from peft import LoraConfig, get_peft_model, AutoPeftModelForCausalLM\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments\n",
    "\n",
    "from functools import partial\n",
    "from preprocessing import preprocess\n",
    "\n",
    "from datasets.utils.logging import disable_progress_bar\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    #disable_progress_bar()  #  Disable progress bar during dataset processing\n",
    "\n",
    "    parser = argparse.ArgumentParser()  #  set up ArgumentParser \n",
    "    parser.add_argument(\n",
    "        \"--input-model\",\n",
    "        type=str,\n",
    "        default=\"meta-llama/Llama-3.2-1B\",\n",
    "        help=\"The pre-trained model from Hugging Face to use as basis: https://huggingface.co/models\",\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        \"--output-path\",\n",
    "        type=str,\n",
    "        help=\"Directory where model checkpoints and outputs will be saved.\",\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        \"--parquet-file\",\n",
    "        type=str,\n",
    "        #default='<path-to-training-data-in-one-parquet-file>',\n",
    "        help=\"Path to the input Parquet file containing training data.\",\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        \"--model_output_name\",\n",
    "        type=str, \n",
    "        help=\"Name for the finetuned model to be saved under.\",\n",
    "    )\n",
    "    parser.add_argument(\"--batch_size\", \"-b\", type=int, default=1, help=\"Training batch size\")\n",
    "    parser.add_argument(\n",
    "        \"--num-workers\",\n",
    "        type=int,\n",
    "        default=1,\n",
    "        help=\"The number of CPU worker processes to use.\",\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        \"--resume\",\n",
    "        default=False,\n",
    "        action=\"store_true\",\n",
    "        help=\"If set, continue from a previously interrupted run. Otherwise, overwrite existing checkpoints.\",\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        \"--max-steps\",\n",
    "        type=int,\n",
    "        default=400,\n",
    "        help=\"The number of training steps.\",\n",
    "    )\n",
    "    parser.add_argument(\"--peft\", action=\"store_true\", help=\"Use PEFT: https://huggingface.co/docs/peft/index\")\n",
    "    parser.add_argument(\n",
    "        \"--4bit\",\n",
    "        dest=\"bnb_4bit\",\n",
    "        action=\"store_true\",\n",
    "        help=\"Use 4bit quantization with bitsandbytes: https://huggingface.co/docs/bitsandbytes/main/en/index\",\n",
    "    )\n",
    "    args, _ = parser.parse_known_args()\n",
    "\n",
    "    # Check for required arguments\n",
    "    if not args.model_output_name:\n",
    "        print(\"ERROR: --model_output_name must be specified.\")\n",
    "        sys.exit(1)\n",
    "\n",
    "    # Read the environment variables provided by torchrun\n",
    "    rank = int(os.environ[\"RANK\"])\n",
    "    local_rank = int(os.environ[\"LOCAL_RANK\"])\n",
    "    world_size = int(os.environ[\"WORLD_SIZE\"])\n",
    "    local_world_size = int(os.environ[\"LOCAL_WORLD_SIZE\"])\n",
    "\n",
    "\n",
    "    # Initialize MLflow only on the main process (rank 0) to prevent multi-process conflicts\n",
    "    if rank == 0:\n",
    "        # Set the MLflow tracking URI to save logs and artifacts under the specified output directory\n",
    "        mlflow_tracking_uri = os.path.join(args.output_path, \"mlruns\")\n",
    "        mlflow.set_tracking_uri(mlflow_tracking_uri)\n",
    "\n",
    "        # Use the model output name as the MLflow experiment name\n",
    "        mlflow.set_experiment(args.model_output_name)\n",
    "        print(f\"MLflow tracking URI: {mlflow_tracking_uri}\")\n",
    "    \n",
    "\n",
    "    # this is where trained model and checkpoints will go\n",
    "    output_model_dir = os.path.join(args.output_path, args.model_output_name)\n",
    "    \n",
    "    if rank == 0:\n",
    "        print(f\"Using {world_size} GPUs.\")\n",
    "        print(f\"Local {local_world_size} GPUs.\")\n",
    "\n",
    "    # Then we determine the device on which to train the model.\n",
    "    if rank == 0:\n",
    "        print(\"Using PyTorch version:\", torch.__version__)\n",
    "        print(f\"world_size: {world_size} GPUs.\")\n",
    "        print(f\"local_world_size {local_world_size}\")\n",
    "        print(f\"Number of available GPUs (visible to this process): {torch.cuda.device_count()}\")\n",
    "        print(f\"Rank: {rank}\")\n",
    "    if torch.cuda.is_available():\n",
    "        device = torch.device(\"cuda\", local_rank)\n",
    "        print(f\"Using GPU {local_rank}, device name: {torch.cuda.get_device_name(device)}\")\n",
    "    else:\n",
    "        print(f\"No GPU found, using CPU instead. (Rank: {local_rank})\")\n",
    "        device = torch.device(\"cpu\")\n",
    "\n",
    "    if rank == 0 and args.batch_size % world_size != 0:\n",
    "        print(f\"ERROR: batch_size={args.batch_size} has to be a multiple of the number of GPUs={world_size}!\")\n",
    "        sys.exit(1)\n",
    "\n",
    "\n",
    "    if rank == 0:\n",
    "        print(f\" output_model_dir: {output_model_dir}\")\n",
    "    \n",
    "    start = time.time()\n",
    "\n",
    "    tokenizer = AutoTokenizer.from_pretrained(args.input_model, use_fast=True)\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "    special_tokens = tokenizer.special_tokens_map\n",
    "    if rank == 0:\n",
    "        print(\"Loading input model and tokenizer\")\n",
    "\n",
    "    quantization_config = None\n",
    "    if args.bnb_4bit:\n",
    "        from transformers import BitsAndBytesConfig\n",
    "\n",
    "        print(\"Using bnb_4bit\")\n",
    "        bnb_config = BitsAndBytesConfig(\n",
    "            load_in_4bit=True,\n",
    "            bnb_4bit_quant_type=\"nf4\",\n",
    "            bnb_4bit_compute_dtype=torch.bfloat16,\n",
    "            bnb_4bit_use_double_quant=True,\n",
    "            bnb_4bit_quant_storage=torch.bfloat16,\n",
    "        )\n",
    "        quantization_config = bnb_config\n",
    "\n",
    "    model = AutoModelForCausalLM.from_pretrained(\n",
    "        args.input_model,\n",
    "        quantization_config=quantization_config,\n",
    "        torch_dtype=torch.bfloat16,\n",
    "        device_map=device,\n",
    "    )\n",
    "\n",
    "    if args.peft:\n",
    "        # peft_config = LoraConfig(\n",
    "        #     task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32,\n",
    "        #     lora_dropout=0.1\n",
    "        # )\n",
    "        # LoRA config from here:\n",
    "        # https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_fsdp_qlora.py#L128\n",
    "        peft_config = LoraConfig(\n",
    "            lora_alpha=8,\n",
    "            lora_dropout=0.05,\n",
    "            r=16,\n",
    "            bias=\"none\",\n",
    "            target_modules=\"all-linear\",\n",
    "            task_type=\"CAUSAL_LM\",\n",
    "            # modules_to_save = [\"lm_head\", \"embed_tokens\"] # add if you want to use the Llama 3 instruct template\n",
    "        )\n",
    "        model = get_peft_model(model, peft_config)\n",
    "        print(\"Using PEFT\")\n",
    "        model.print_trainable_parameters()\n",
    "\n",
    "    stop = time.time()\n",
    "    if rank == 0:\n",
    "        print(f\"Loading model and tokenizer took: {stop - start:.2f} seconds\")\n",
    "    \n",
    "    train_batch_size = args.batch_size\n",
    "    eval_batch_size = args.batch_size\n",
    "\n",
    "    if rank == 0:\n",
    "        print(f\"Global train and eval batch size : {args.batch_size}\")\n",
    "\n",
    "\n",
    "    training_args = TrainingArguments(\n",
    "        disable_tqdm=True,\n",
    "        output_dir=output_model_dir,\n",
    "        save_strategy=\"steps\",\n",
    "        save_steps=50, # MODIFY from quick testing to real training for eg. 50 -> 400!!\n",
    "        save_total_limit=3,\n",
    "        learning_rate=2e-5, #3e-5,\n",
    "        weight_decay=0.01,\n",
    "        bf16=True,  # use 16-bit floating point precision\n",
    "        per_device_train_batch_size=train_batch_size // world_size,\n",
    "        per_device_eval_batch_size=eval_batch_size,\n",
    "        dataloader_num_workers=args.num_workers,\n",
    "        ddp_find_unused_parameters=False, \n",
    "        dataloader_pin_memory=True, \n",
    "        metric_for_best_model=\"eval_loss\", \n",
    "        eval_strategy=\"steps\",\n",
    "        eval_steps=100, # MODIFY from quick testing to real training for eg. 100 -> 200!!\n",
    "        num_train_epochs=2,\n",
    "        max_steps=args.max_steps, # COMMENT THIS IF using bigger dataset\n",
    "        \n",
    "        # MLflow integration \n",
    "        report_to=[\"mlflow\"], \n",
    "        logging_steps=50, # MODIFY !!\n",
    "        logging_strategy=\"steps\",\n",
    "        \n",
    "        # Run name for MLflow — includes SLURM job ID  to indentify run \n",
    "        run_name=f\"{args.model_output_name}_{os.environ.get('SLURM_JOB_ID')}\",\n",
    "    )\n",
    "\n",
    "    #if rank == 0:\n",
    "        # print(f\"Training arguments : {training_args}\")\n",
    "\n",
    "    # Load parquet data\n",
    "    raw_dataset = load_dataset(\"parquet\", data_files=args.parquet_file)\n",
    "\n",
    "    # Split dataset into train and validation sets\n",
    "    split_dataset = raw_dataset[\"train\"].train_test_split(test_size=0.1, seed=42)\n",
    "    max_tokens = 2048\n",
    "    overlap_tokens = 50\n",
    "\n",
    "    if rank == 0:\n",
    "        print(\"Dataset columns:\", raw_dataset[\"train\"].column_names)\n",
    "        print(f\"Type of column_names: {type(raw_dataset['train'].column_names)}\")\n",
    "    \n",
    "    column_names = raw_dataset[\"train\"].column_names\n",
    "\n",
    "    preprocess_function = partial(\n",
    "        preprocess, tokenizer=tokenizer, max_tokens=max_tokens, chunk_size=8192, overlap_size=overlap_tokens\n",
    "    )\n",
    "\n",
    "    tokenized_train_dataset = split_dataset[\"train\"].map(\n",
    "        preprocess_function,\n",
    "        batched=True,\n",
    "        remove_columns=column_names,\n",
    "        num_proc=args.num_workers,\n",
    "    )\n",
    "\n",
    "    tokenized_val_dataset = split_dataset[\"test\"].map(\n",
    "        preprocess_function,\n",
    "        batched=True,\n",
    "        remove_columns=column_names,\n",
    "        num_proc=args.num_workers,\n",
    "    )\n",
    "    ####################################################\n",
    "    # comment these if you would like to use the whole dataset\n",
    "    tokenized_train_dataset = tokenized_train_dataset.shuffle(seed=42).select(range(900))\n",
    "    tokenized_val_dataset = tokenized_val_dataset.shuffle(seed=42).select(range(100))\n",
    "\n",
    "    # Print the sizes to verify\n",
    "    if rank == 0:\n",
    "        print(f\"Train dataset size: {len(tokenized_train_dataset)}\")\n",
    "        print(f\"Validation dataset size: {len(tokenized_val_dataset)}\")\n",
    "\n",
    "    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors=\"pt\")\n",
    "\n",
    "    # Initialize the Trainer\n",
    "    trainer = Trainer(\n",
    "        model=model,\n",
    "        args=training_args,\n",
    "        train_dataset=tokenized_train_dataset,\n",
    "        eval_dataset=tokenized_val_dataset,\n",
    "        tokenizer=tokenizer,\n",
    "        data_collator=data_collator,\n",
    "    )\n",
    "\n",
    "    start_train = time.time()\n",
    "    if rank== 0:\n",
    "        print(f\"Training starting...\")\n",
    "\n",
    "    # Train the model - MLflow will automatically log metrics\n",
    "    trainer.train(resume_from_checkpoint=args.resume)\n",
    "\n",
    "    stop_train = time.time()\n",
    "    if rank == 0:\n",
    "        elapsed = stop_train - start_train\n",
    "        hours = int(elapsed // 3600)\n",
    "        minutes = int((elapsed % 3600) // 60)\n",
    "        seconds = int(elapsed % 60)\n",
    "        print(f\"Finetuning model took: {hours}h {minutes}m {seconds}s\")\n",
    "\n",
    "    # Save the model\n",
    "    if trainer.is_fsdp_enabled:\n",
    "        trainer.accelerator.state.fsdp_plugin.set_state_dict_type(\"FULL_STATE_DICT\")\n",
    "    trainer.save_model(output_model_dir)\n",
    "    \n",
    "    if rank == 0:\n",
    "        print()\n",
    "        print(\"Training done, you can find the final model (and checkpoints) in\", output_model_dir)\n",
    "        print(f\"\\nMLflow experiment data stored in: {mlflow_tracking_uri}\")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0ad01b2-b7b6-48a9-a45b-6d9061a849f2",
   "metadata": {},
   "source": [
    "## 4.2. Batch job script for training with 8GPUs\n",
    "\n",
    "To run training on LUMI using 8 GPUs, you need to submit a batch job via a SLURM script. Below is an example script named **run_train_8gpu.sh**.\n",
    "\n",
    "This script:\n",
    "* Requests resources from the GPU [partition](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/partitions/) (eg. dev-g / small-g) including 8 GPUs, 56 CPU cores, and 480 GB of memory.\n",
    "* Loads the necessary modules for Singularity container support.\n",
    "* Sets environment variables for Hugging Face cache and tokenizer behavior.\n",
    "* Defines an output directory for saving the trained model and logs.\n",
    "* Launches the training inside the container using `torchrun` with distributed training support.\n",
    "\n",
    "**Remember** to replace `<number-here>` with your project ID, `<path-to-training-data-parquet-file>` with the actual path to your preprocessed training data, and `<path-to-training-container>` with the path to your training container (e.g., training_env.sif).\n",
    "Also, consider switching the partition to `small-g` and adjusting the `--time` parameter for longer training runs.\n",
    "\n",
    "\n",
    "**run_train_8gpu.sh**\n",
    "```bash\n",
    "#!/bin/bash\n",
    "#SBATCH --account=project_<number-here>\n",
    "#SBATCH --partition=dev-g\n",
    "#SBATCH --ntasks=1\n",
    "#SBATCH --cpus-per-task=56\n",
    "#SBATCH --mem=480G\n",
    "#SBATCH --time=00:15:00 \n",
    "#SBATCH --gpus-per-node=8\n",
    "\n",
    "module use /appl/local/containers/ai-modules\n",
    "module load singularity-AI-bindings\n",
    "\n",
    "# This will store all the Hugging Face cache such as downloaded models\n",
    "# and datasets in the project's scratch folder\n",
    "export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache\n",
    "mkdir -p $HF_HOME\n",
    "\n",
    "# Path to where the trained model and logging data will go\n",
    "OUTPUT_DIR=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/training_output_data\n",
    "mkdir -p $OUTPUT_DIR\n",
    "\n",
    "TRAINING_DATA_FILE=<path-to-training-data-parquet-file>\n",
    "\n",
    "# Disable internal parallelism of huggingface's tokenizer since we\n",
    "# want to retain direct control of parallelism options.\n",
    "export TOKENIZERS_PARALLELISM=false\n",
    "\n",
    "set -xv  # print the command so that we can verify setting arguments correctly from the logs\n",
    "\n",
    "CONTAINER=<path-to-training-container>\n",
    "     \n",
    "srun singularity exec  $CONTAINER \\\n",
    "    torchrun --standalone \\\n",
    "        --nnodes=1 \\\n",
    "        --nproc-per-node=$SLURM_GPUS_PER_NODE \\\n",
    "        train.py $* \\\n",
    "       --output-path $OUTPUT_DIR \\\n",
    "       --parquet-file $TRAINING_DATA_FILE \\\n",
    "       --model_output_name=\"Llama-3.2-1B-finetuned\" \\\n",
    "       --num-workers $SLURM_CPUS_PER_TASK \\\n",
    "       --batch_size=8\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e06cc994-34c4-4e8e-9092-2b18451ecf7e",
   "metadata": {},
   "source": [
    "### 4.3. Run training script\n",
    "\n",
    "To train the model on LUMI with 8 GPUs, submit the batch job using the SLURM script provided in **run_train_8gpu.sh**.\n",
    "\n",
    "Simply run the following command in the LUMI terminal:\n",
    "````\n",
    "sbatch run_train_8gpu.sh\n",
    "````\n",
    "\n",
    "Once the job starts, a SLURM job file named `slurm-{slurm_job_id}.job` will be created automatically.\n",
    "\n",
    "You can monitor the status of your jobs at any time using: `sacct`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9d1460f-5c61-427e-95a5-e26f8df68f9c",
   "metadata": {},
   "source": [
    "### 4.4. Use MLflow to check metrics \n",
    "\n",
    "After the training completes, you’ll find the logged data inside the `mlruns` folder located within your specified output directory. If you didn't change this part in the **run_train_8gpu.sh**, the lcoation for mlflow metrics is `/scratch/${SLURM_JOB_ACCOUNT}/${USER}/training_output_data/mlruns`.\n",
    "\n",
    "To visualize and monitor your training metrics, you can open an MLflow session via the LUMI web interface.\n",
    "Navigate to **Apps** -> **Mlflow**.\n",
    "\n",
    "Set the `Location where MLflow files are stored` to the full path where your `mlruns` folder is located. After launching the session, you can interactively browse training metrics, losses and parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7477bf14-307d-433e-8d10-752c0c83d8f4",
   "metadata": {},
   "source": [
    "# 5. Test the model\n",
    "\n",
    "After finetuning, you can test the model using a Python script and a SLURM batch job. Inference results will be saved to a logging file for review.\n",
    "\n",
    "To run inference, simply submit the batch job with: `sbatch run_inference.sh`\n",
    "\n",
    "This will generate model outputs for your predefined prompts and log them for inspection.\n",
    "\n",
    "*Note! We don't need to use the container here since we don't need any additional packages.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "561fa025-10e8-40a2-8f87-dd7af32791f2",
   "metadata": {},
   "source": [
    "**run_inference.sh**\n",
    "```bash\n",
    "#!/bin/bash\n",
    "#SBATCH --account=project_XXXXXXXXXX\n",
    "#SBATCH --partition=dev-g\n",
    "#SBATCH --ntasks=1\n",
    "#SBATCH --cpus-per-task=7\n",
    "#SBATCH --mem=60G\n",
    "#SBATCH --time=0:15:00\n",
    "#SBATCH --gpus-per-node=1\n",
    "\n",
    "module purge\n",
    "module use /appl/local/csc/modulefiles/\n",
    "module load pytorch/2.5\n",
    "\n",
    "# This will store all the Hugging Face cache such as downloaded models\n",
    "# and datasets in the project's scratch folder\n",
    "export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache\n",
    "mkdir -p $HF_HOME\n",
    "\n",
    "export LOG_FILE_PATH=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/inference_logs\n",
    "mkdir -p $LOG_FILE_PATH\n",
    "export LOG_FILE=${LOG_FILE_PATH}/inference_prints.log\n",
    "\n",
    "# Path to where the trained model and logging data will go\n",
    "OUTPUT_DIR=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-data\n",
    "mkdir -p $OUTPUT_DIR\n",
    "\n",
    "# Disable internal parallelism of huggingface's tokenizer since we\n",
    "# want to retain direct control of parallelism options.\n",
    "export TOKENIZERS_PARALLELISM=false\n",
    "\n",
    "set -xv  # print the command so that we can verify setting arguments correctly from the logs\n",
    "\n",
    "MODEL_PATH_1=\"meta-llama/Llama-3.2-1B\"\n",
    "MODEL_PATH_2=\"</path/to/your/finetuned/model>\"\n",
    "\n",
    "# Define prompts as an array\n",
    "PROMPTS=(\n",
    "  \"Tekoälyn kehitys muuttaa maailmaa nopeasti ja siksi \"\n",
    "  \"Tervetuloa \"\n",
    ")\n",
    "# Run inference for each model and prompt combination\n",
    "for MODEL in \"$MODEL_PATH_1\" \"$MODEL_PATH_2\"; do\n",
    "  for PROMPT in \"${PROMPTS[@]}\"; do\n",
    "    srun python inference.py \\\n",
    "      --model \"$MODEL\" \\\n",
    "      --prompt \"$PROMPT\"\n",
    "  done\n",
    "done\n",
    "```\n",
    "\n",
    "**inference.py**\n",
    "```python\n",
    "import logging\n",
    "import argparse\n",
    "import torch\n",
    "import os\n",
    "\n",
    "from datetime import datetime\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "LOG_FILE = os.environ.get('LOG_FILE')\n",
    "slurmjob_id = os.environ['SLURM_JOBID']\n",
    "\n",
    "# logging file settings\n",
    "logging.basicConfig(\n",
    "    filename=LOG_FILE,\n",
    "    level=logging.INFO\n",
    ")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    parser = argparse.ArgumentParser()\n",
    "    parser.add_argument(\n",
    "        \"--model\",\n",
    "        type=str,\n",
    "        help=\"Path to fine-tuned model directory\"\n",
    "    )\n",
    "    \n",
    "    parser.add_argument(\n",
    "        \"--prompt\",\n",
    "        type=str,\n",
    "        help=\"Prompt for the LLM to continue\"\n",
    "    )\n",
    "    args = parser.parse_args()\n",
    "\n",
    "\n",
    "    logging.info(f\"Slurmjob_ID : {slurmjob_id}\")\n",
    "    logging.info(f\"Model Path: {args.model}\")\n",
    "    logging.info(f\"Prompt: {args.prompt}\")\n",
    "\n",
    "    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n",
    "    print(f\"Using device {device}\")\n",
    "    if device.type == 'cuda':\n",
    "        print(f\"Device name is {torch.cuda.get_device_name(device)}\")\n",
    "\n",
    "    tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=True)\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "    model = AutoModelForCausalLM.from_pretrained(args.model)\n",
    "    model.to(device)\n",
    "\n",
    "\n",
    "    with torch.no_grad():\n",
    "        inputs = tokenizer(args.prompt, return_tensors='pt').to(device)\n",
    "        outputs = model.generate(**inputs, do_sample=True, max_length=200, num_return_sequences=2)\n",
    "        decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)\n",
    "\n",
    "        \n",
    "        print(\"Generated Outputs:\")\n",
    "        logging.info(\"Generated Outputs:\")\n",
    "        for i, text in enumerate(decoded_outputs):\n",
    "            print(f\"\\n--- Output {i + 1} ---\\n{text}\")\n",
    "            logging.info(f\"\\n--- Output {i + 1} ---\\n{text}\")\n",
    "\n",
    "    logging.info(\"-\" * 40)\n",
    "    \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11216550-b93c-40c9-a9df-61930fb016aa",
   "metadata": {},
   "source": [
    "**Thank you for following the tutorial — we hope you found it useful!**\n",
    "\n",
    "For more information on the OpenWebSearch.eu project see: https://openwebsearch.eu/\n",
    "\n",
    "For more information on the LUMI supercomputer and CSC, see: https://www.lumi-supercomputer.eu/, https://www.csc.fi/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1265f4f7",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "7d408975",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (venv)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0rc1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}