{ "cells": [ { "cell_type": "markdown", "id": "582a58a5", "metadata": {}, "source": [ "# Tutorial 14: Finetuning an LLM with OWI data using the LUMI supercomputer\n", "\n", "This tutorial demonstrates one way to use OWI data to finetune a large language model (LLM) on the [LUMI supercomputer](https://lumi-supercomputer.eu/). In this example Finnish language data is downloaded and used to finetune Meta's [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B), improving its performance on Finnish language tasks.\n", "\n", "This tutorial has these steps:\n", "1. Get started on using LUMI.\n", "2. Create a [Singularity](https://docs.lumi-supercomputer.eu/software/containers/singularity/#building-apptainersingularity-sif-containers) container for data downloading with the [owilix](https://opencode.it4i.eu/openwebsearcheu-public/owi-cli) command line tool and download the data using this container.\n", "3. Preprocess the data and prepare it for training using Jupyter notebooks within the Jupyter environment provided by the LUMI web interface.\n", "4. Create a second Singularity container optimized for training with the necessary Python packages for machine learning.\n", "5. Create a [batch job](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/slurm-quickstart/) and write Python scripts for training and inference using the training container." ] }, { "cell_type": "markdown", "id": "edec5b0a", "metadata": {}, "source": [ "**OWI License**\n", "\n", "Before using any data, you must review the terms under [the license](https://openwebsearch.eu/owil-current/). \n", "**The model trained with OWI data may only be used for research purposes.**\n", "\n", "**Disclaimer**\n", "\n", "Please note that this is a technical guide only and does not constitute a legal assessment of whether or how you may use the data.\n", "\n", "This work uses index files as part of the index partition created by the OpenWebSearch.eu project that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070014 (OpenWebSearch.EU)." ] }, { "cell_type": "markdown", "id": "09d47029-c961-46fa-b64b-57786b7e3af4", "metadata": {}, "source": [ "# 1. Get started on using LUMI\n", "\n", "**To begin using the LUMI supercomputer, follow these steps (requires valid project):**\n", "1. [Get a user account](https://docs.lumi-supercomputer.eu/firststeps/accessLUMI/)\n", "2. [Set up an SSH key pair to be able to use LUMI from a terminal](https://docs.lumi-supercomputer.eu/firststeps/SSH-keys/#__tabbed_2_1)\n", "3. [Log in to LUMI with SSH client](https://docs.lumi-supercomputer.eu/firststeps/loggingin/)\n", "\n", "```bash\n", "ssh -i @lumi.csc.fi\n", "```\n", "\n", "## 1.1. Where to store data - disk areas\n", "\n", "Each user has a home directory (`$HOME`) that can contain up to 20 GB of data. Do not use this for the data and codes - use `/project` or `/scratch` instead. See more about different disk ares here: https://docs.lumi-supercomputer.eu/storage/#where-to-store-data\n", "\n", "## 1.2. Installing Python packages\n", "\n", "Installing packages directly via `pip` or `conda` is not recommended as it puts lots of strain on the Lustre file system on LUMI. Instead, users should use Singularity/Apptainer containers. Please also see the official guidance on how to install new Python packages on the [LUMI software guide](https://docs.lumi-supercomputer.eu/software/installing/python/)." ] }, { "cell_type": "markdown", "id": "816ed5a6", "metadata": {}, "source": [ "# 2. Get the data\n", "\n", "Let's create a Singularity container using [`cotainr`](https://cotainr.readthedocs.io/en/stable/) in order to download the data using the owilix command line tool. \n", "\n", "## 2.1. Container for owilix \n", "\n", "\n", "First, we will specify packages to be installed in a conda environment .yml file. Then we will use `cotainr` to build a new container with defined packages. In this case we need `owilix` (requires 3.10 or 3.11 Python) and `py4lexis`. \n", "\n", "**Create owilix_env.yml file:**\n", "```yml\n", "name: owilix_env\n", "channels:\n", " - conda-forge\n", "dependencies:\n", " - python=3.11\n", " - pip=24.0\n", " - pip:\n", " - --extra-index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple\n", " - --extra-index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple\n", " - py4lexis\n", " - owilix\n", "```\n", "\n", "\n", "**In the terminal of LUMI (note: building container takes several minutes):**\n", "```bash\n", "# Get needed modules\n", "module purge\n", "module load LUMI\n", "module load cotainr\n", "\n", "# Use cotainr to build the container \n", "cotainr build owilix_env.sif --system=lumi-g --conda-env=owilix_env.yml --accept-license\n", "\n", "## Add required additional bindings\n", "module use /appl/local/containers/ai-modules/\n", "module load singularity-AI-bindings \n", "\n", "# Verify installation\n", "singularity exec owilix_env.sif bash -c 'pip list'\n", "```\n", "\n", "## 2.2. Use owilix to download the data\n", "\n", "In this example, we will download latest Finnish data. We'll open shell connection to the container and use commands to download the data to the desired directory. \n", "\n", "1. Run a shell within the container:\n", "\n", "`singularity shell owilix_env.sif`\n", "\n", "2. Download the data using owilix (remember to set the target directory!): \n", "\n", "`owilix --target remote pull all:latest#30 files=\"**/language=fin/*\"`\n", "\n", "**Complete the authentication:** \n", "\n", "You will be prompted to accept the terms by typing `yes`. Then copy the web address that appears in the terminal, open it in your browser, and log in to complete the authentication process." ] }, { "cell_type": "markdown", "id": "da282f60-9474-401b-92fb-9f38caa90dab", "metadata": {}, "source": [ "# 3. Preprocess data\n", "\n", "Now we are ready to preprocess the data using Jupyter. Note that preprocessing demonstrated in this tutorial is minimal. **You should consider more thorough preprocessing for your specific use case!**\n", "\n", "## 3.1. Activate a Jupyter session\n", "We’ll use the Jupyter environment provided by LUMI: \n", "- 1. Navigate to **Apps** -> **Jupyter**\n", "- 2. Configure the session with the following settings:\n", " - Project: project_XXXXXX\n", " - Partition: small\n", " - Number of CPU cores: 64\n", " - Memory (GiB): 128\n", " - Working directory: Select from the dropdown\n", " - Python: pytorch \n", "- 3. Wait for you session to be ready, then click `Connect to Jupyter`\n", "\n", "Once connected, create a notebook and proceed with the preprocessing steps." ] }, { "cell_type": "markdown", "id": "0d74f2db-5c33-497f-bb21-f4c85c0630ad", "metadata": {}, "source": [ "## 3.2. Combine all data to df\n", "\n", "Next, we’ll load and combine all downloaded data files into a single pandas DataFrame. After all the preprocessing steps, we'll save the result as a .parquet file." ] }, { "cell_type": "code", "execution_count": 1, "id": "8c31740f-9b3c-40d0-9d01-8150091ba733", "metadata": {}, "outputs": [], "source": [ "path_to_owilix_data = ''" ] }, { "cell_type": "code", "execution_count": 3, "id": "16e54d2d-9aa4-49a2-944d-f385eb2ee71e", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import os\n", "import glob" ] }, { "cell_type": "code", "execution_count": null, "id": "9b2f2e56-7d5b-4d85-bac3-1d41b19adcd2", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Combined DataFrame shape: (1256577, 43)\n" ] } ], "source": [ "parquet_files = glob.glob(os.path.join(path_to_owilix_data, \"**/*.parquet\"), recursive=True)\n", "\n", "dataframes = []\n", "for file in parquet_files:\n", " try:\n", " df = pd.read_parquet(file)\n", " dataframes.append(df)\n", " except Exception as e:\n", " print(f\"Error reading {file}: {e}\")\n", "\n", "# Combine all DataFrames\n", "combined_df = pd.concat(dataframes, ignore_index=True)\n", "print(f\"Combined DataFrame shape: {combined_df.shape}\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "67019a05-875f-4687-8a17-e038e63eb35b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['id', 'record_id', 'title', 'main_content', 'json-ld', 'microdata',\n", " 'opengraph', 'warc_date', 'warc_ip', 'url', 'url_scheme', 'url_path',\n", " 'url_params', 'url_query', 'url_fragment', 'url_subdomain',\n", " 'url_domain', 'url_suffix', 'url_is_private', 'mime_type', 'charset',\n", " 'content_type_other', 'http_server', 'valid', 'warc_file',\n", " 'warc_offset', 'schema_metadata', 'ows_canonical', 'ows_resource_type',\n", " 'ows_curlielabel', 'ows_index', 'ows_genai', 'ows_genai_details',\n", " 'ows_fetch_response_time', 'ows_fetch_num_errors', 'outgoing_links',\n", " 'image_links', 'video_links', 'iframes', 'curlielabels',\n", " 'curlielabels_en', 'address', 'plain_text'],\n", " dtype='object')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined_df.columns " ] }, { "cell_type": "markdown", "id": "e1cde09e-2497-4ea7-915d-47d8f42998d7", "metadata": {}, "source": [ "## 3.3. Filter the content\n", "\n", "In this step, we will combine data from both the `plain_text` and `main_content` columns as this column was renamed in Schema version 0.2.X. For more information about colums see the [Preprocessing Pipeline](https://opencode.it4i.eu/openwebsearcheu-public/preprocessing-pipeline) documentation.\n", "\n", "| column | description | Schema version |\n", "| ------ | ----------- | -------------- |\n", "| plain_text | Cleaned text from the HTML | 0.1.X |\n", "| main_content | Main content of the HTML, formatted with minimal HTML tags (`h1-6`, `p`, `ul/ol/li`, `pre`, and`a`tags) | 0.2.X |\n", "\n", "We will then proceed with the following steps: \n", "1. Filter rows where `ows_genai`==True\n", "2. Remove duplicates based on `main_content` and `url`\n", "3. Filter and clean the `main_content`\n", "4. Drop duplicates again after cleaning\n", "5. Filter by word count\n", "6. Double-check the language with [langdetect](https://pypi.org/project/langdetect/)" ] }, { "cell_type": "markdown", "id": "1084bba5-4e87-46c8-981b-b8407bf1f255", "metadata": {}, "source": [ "### 3.3.1. Use ows_genai and drop duplicates\n", "\n", "Prepare the downloaded OWI data for training by:\n", "- Combining content fields and removing empty entries\n", "- Filtering for GenAI-suitable content (`ows_genai = True`) \n", "- Removing duplicates by content and URL\n", "- Selecting final columns: `title`, `url`, `main_content`\n", "\n", "Progress is tracked by printing dataset shape after each step." ] }, { "cell_type": "code", "execution_count": 7, "id": "c39c37fe-4716-4c22-8b44-17d2af658bff", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame shape before first steps: (1256577, 43)\n", "DataFrame shape after combining main_content and plain_text: (1256577, 42)\n", "DataFrame shape after ows_genai: (1248081, 42)\n", "DataFrame shape after dropping dups (main_content): (466222, 42)\n", "DataFrame shape after dropping dups (url): (220210, 42)\n", "DataFrame shape after all steps: (220210, 3)\n" ] } ], "source": [ "print(f\"DataFrame shape before first steps: {combined_df.shape}\")\n", "\n", "# Fill missing values in 'main_content' with values from 'plain_text'\n", "combined_df['main_content'] = combined_df['main_content'].fillna(combined_df['plain_text']) \n", "\n", "# Drop rows where 'main_content' is still missing and remove the now-unneeded 'plain_text' column\n", "combined_df = combined_df[combined_df[\"main_content\"].notna()].drop(columns=[\"plain_text\"])\n", "print(f\"DataFrame shape after combining main_content and plain_text: {combined_df.shape}\")\n", "\n", "# Keep only rows where 'ows_genai' is True\n", "combined_df = combined_df[combined_df['ows_genai'] == True]\n", "print(f\"DataFrame shape after ows_genai: {combined_df.shape}\")\n", "\n", "# Remove duplicate rows based on 'main_content', then remove duplicates based on 'url'\n", "combined_df= combined_df.drop_duplicates(subset='main_content') # .drop_duplicates(subset='url')\n", "print(f\"DataFrame shape after dropping dups (main_content): {combined_df.shape}\")\n", "combined_df= combined_df.drop_duplicates(subset='url')\n", "print(f\"DataFrame shape after dropping dups (url): {combined_df.shape}\")\n", "\n", "# Select only the relevant columns for further processing\n", "combined_df = combined_df[['title','url','main_content']]\n", "print(f\"DataFrame shape after all steps: {combined_df.shape}\")" ] }, { "cell_type": "code", "execution_count": 9, "id": "50eae98c-9023-47bb-8393-b2c3c5b2ec46", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleurlmain_content
0forum.bomber.fi - Omat asetukset - Käyttöehdothttps://www.bomber.fi/forums/user/terms?sid=cd...<h2>forum.bomber.fi - Käyttöehdot</h2>\\n\\n<p>K...
1Yhteystiedot - Mustasaaren seurakuntayhtymähttps://www.mustasaarenseurakuntayhtyma.fi/yht...<h1>Yhteystiedot</h1>\\n\\n<p> </p>\\n\\n<h4>Musta...
2VAELLUSNET - Vaellusturinat II - Omat asetukse...http://www.vaellusnet.com/ucp.php?mode=terms&s...<h2>VAELLUSNET - Vaellusturinat II - Käyttöehd...
\n", "
" ], "text/plain": [ " title \\\n", "0 forum.bomber.fi - Omat asetukset - Käyttöehdot \n", "1 Yhteystiedot - Mustasaaren seurakuntayhtymä \n", "2 VAELLUSNET - Vaellusturinat II - Omat asetukse... \n", "\n", " url \\\n", "0 https://www.bomber.fi/forums/user/terms?sid=cd... \n", "1 https://www.mustasaarenseurakuntayhtyma.fi/yht... \n", "2 http://www.vaellusnet.com/ucp.php?mode=terms&s... \n", "\n", " main_content \n", "0

forum.bomber.fi - Käyttöehdot

\\n\\n

K... \n", "1

Yhteystiedot

\\n\\n

 

\\n\\n

Musta... \n", "2

VAELLUSNET - Vaellusturinat II - Käyttöehd... " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined_df.head(3)" ] }, { "cell_type": "markdown", "id": "c718e618-b615-4f79-a7b2-5bf5613eb6eb", "metadata": {}, "source": [ "### 3.3.2. Filter html content\n", "\n", "This code performs minimal cleaning of the `main_content` field. You can define terms (like policy-related keywords) in POLICY_TERMS to exclude pages entirely.\n", "\n", "The function performs the following:\n", "* Removes `` tags but keeps the inner text\n", "* Replaces block-level HTML tags (`

`, `

`, etc.) and `
` with newlines\n", "* Cleans up HTML entities and removes bullet symbols\n", "* Filters out short or incomplete lines (e.g. no punctuation, too few words)\n", "* Normalizes whitespace and joins the cleaned lines into a final text block\n", "* Returns `None` if no meaningful content remains" ] }, { "cell_type": "code", "execution_count": 10, "id": "c9964b5c-c17d-4dfe-994b-9efb09dd00e9", "metadata": {}, "outputs": [], "source": [ "import re\n", "import html\n", "\n", "# Terms to exclude early (e.g., policy pages)\n", "POLICY_TERMS = [\"käyttöeh\"]\n", "\n", "# Precompiled regex patterns\n", "A_TAG = re.compile(r']*?>(.*?)
', flags=re.IGNORECASE | re.DOTALL)\n", "BLOCK_TAGS = re.compile(r'', flags=re.IGNORECASE)\n", "BR_TAG = re.compile(r'', flags=re.IGNORECASE)\n", "TAG_CLEANER = re.compile(r'<[^>]+>') # fallback to remove leftover tags\n", "\n", "TERMINAL_PUNCT_PATTERN = re.compile(r'[.!?]\\s*$')\n", "WHITESPACE_PATTERNS = {\n", " 'multiple_newlines': re.compile(r\"\\n{3,}\"),\n", " 'spaces': re.compile(r\"[ \\t]+\"),\n", " 'trailing_spaces': re.compile(r\" +\\n\")\n", "}\n", "\n", "def clean_html_min(html_str: str):\n", " if not html_str or not html_str.strip():\n", " return None\n", "\n", " # Early policy term check\n", " html_lower = html_str.lower()\n", " if any(term in html_lower for term in POLICY_TERMS):\n", " return None\n", "\n", " # Unwrap tags but keep inner text\n", " html_str = A_TAG.sub(r'\\1', html_str)\n", "\n", " # Replace
and block-level tags with newlines\n", " html_str = BR_TAG.sub('\\n', html_str)\n", " html_str = BLOCK_TAGS.sub('\\n', html_str)\n", "\n", " # Remove all remaining tags (non-block level)\n", " html_str = TAG_CLEANER.sub('', html_str)\n", "\n", " # Decode HTML entities (e.g. " → \",   → space)\n", " html_str = html.unescape(html_str)\n", " html_str = html_str.replace('\\xa0', ' ') # additional non-breaking space cleanup\n", "\n", " # Remove common bullet symbols\n", " html_str = re.sub(r'[•◦\\u2022]', '', html_str)\n", "\n", " # Normalize and filter lines\n", " lines = [line.strip() for line in html_str.split('\\n') if line.strip()]\n", " cleaned_lines = []\n", "\n", " for line in lines:\n", " # Must end in terminal punctuation\n", " if not TERMINAL_PUNCT_PATTERN.search(line):\n", " continue\n", "\n", " # Must be long enough\n", " if len(line) < 20 or len(line.split()) < 4:\n", " continue\n", "\n", " cleaned_lines.append(line)\n", "\n", " if not cleaned_lines:\n", " return None\n", "\n", " # Join and normalize whitespace\n", " cleaned_text = '\\n'.join(cleaned_lines)\n", " cleaned_text = WHITESPACE_PATTERNS['multiple_newlines'].sub(\"\\n\\n\", cleaned_text)\n", " cleaned_text = WHITESPACE_PATTERNS['spaces'].sub(\" \", cleaned_text)\n", " cleaned_text = WHITESPACE_PATTERNS['trailing_spaces'].sub(\"\\n\", cleaned_text)\n", "\n", " return cleaned_text.strip() if cleaned_text.strip() else None\n" ] }, { "cell_type": "markdown", "id": "bc7a9513-e08f-4244-963f-410c44c18a34", "metadata": {}, "source": [ "### 3.3.3. Example of a site before preprocessing" ] }, { "cell_type": "code", "execution_count": 11, "id": "0633497d-db45-489f-918f-2b6902d13aae", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "
Siirry sisältöön\n", "\n", "

Kae Araki

\n", "\n", "Wikipediasta\n", "\n", "

Kae Araki (jap. 荒木香恵, oikealta nimeltään Kae Abe, s. 6. marraskuuta 1966 Osaka) on japanilainen ääninäyttelijä, seiyū, joka on näytellyt monissa anime- ja televisiosarjoissa, muun muassa Babar, Cardcaptor Sakura, Digimon, Fushigi yūgi, Great Teacher Onizuka, Kodomo no omocha, Wakakusa monogatari – Nan to Jo no sensei ja Pokémon. Animesarjojen lisäksi hän on esiintynyt monissa peleissä.

\n", "\n", "

Aiheesta muualla

\n", "\n", "[muokkaa | muokkaa wikitekstiä]\n", "
    \n", "
  • Kae Araki Internet Movie Databasessa. (englanniksi)
  • \n", "
\n", "Tämä näyttelijään liittyvä artikkeli on tynkä. Voit auttaa Wikipediaa laajentamalla artikkelia.
\n" ] } ], "source": [ "test = df['main_content'].iloc[10] \n", "print(test)" ] }, { "cell_type": "markdown", "id": "8fcb102c", "metadata": {}, "source": [ "### 3.3.4. Example of the site after preprocessing\n", "\n", "This short example illustrates how the HTML cleaning code works." ] }, { "cell_type": "code", "execution_count": 12, "id": "27b90989-69ce-453b-b272-0886b63e7ddc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Kae Araki (jap. 荒木香恵, oikealta nimeltään Kae Abe, s. 6. marraskuuta 1966 Osaka) on japanilainen ääninäyttelijä, seiyū, joka on näytellyt monissa anime- ja televisiosarjoissa, muun muassa Babar, Cardcaptor Sakura, Digimon, Fushigi yūgi, Great Teacher Onizuka, Kodomo no omocha, Wakakusa monogatari – Nan to Jo no sensei ja Pokémon. Animesarjojen lisäksi hän on esiintynyt monissa peleissä.\n", "Tämä näyttelijään liittyvä artikkeli on tynkä. Voit auttaa Wikipediaa laajentamalla artikkelia.\n" ] } ], "source": [ "res = clean_html_min(test)\n", "print(res)" ] }, { "cell_type": "code", "execution_count": 13, "id": "7a107ef7-4914-45f2-af30-38d613a4beaa", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Cleaning HTML content: 100%|██████████| 220210/220210 [00:53<00:00, 4142.45it/s] \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleurlmain_contentcleaned_html_content
0forum.bomber.fi - Omat asetukset - Käyttöehdothttps://www.bomber.fi/forums/user/terms?sid=cd...<h2>forum.bomber.fi - Käyttöehdot</h2>\\n\\n<p>K...None
1Yhteystiedot - Mustasaaren seurakuntayhtymähttps://www.mustasaarenseurakuntayhtyma.fi/yht...<h1>Yhteystiedot</h1>\\n\\n<p> </p>\\n\\n<h4>Musta...None
2VAELLUSNET - Vaellusturinat II - Omat asetukse...http://www.vaellusnet.com/ucp.php?mode=terms&s...<h2>VAELLUSNET - Vaellusturinat II - Käyttöehd...None
3Gives me some privacy | Dekottaahttp://www.dekottaa.com/2014/01/gives-me-some-...<h2>26.1.2014</h2>\\n\\n<a href=\"\">\\n\\n<h1> Give...Liitutaulutarra kanan muodossa. Jos ei halua j...
4Suomen Briard ry - Lähetä sähköpostiahttp://www.suomenbriard.net/phpBB/memberlist.p...<h2>Yhteystiedot käyttäjälle</h2>\\n\\nYlläpitäj...Tämä viesti lähetetään pelkkänä tekstinä. Älä ...
...............
1091706Vastauspalveluhttps://vastauspalvelu.omataloyhtio.fi/<a href=\"https://jurinet.fi/\">Jurinet</a>\\nKuv...Taloyhtiömme on asennettu uusi juuri ilmanpois...
1091814Sound Particles Studio-ohjelmistot - Pikalatau...https://www.muziker.fi/sound-particles-studio-...<p> Valitse maa, johon lähetys toimitetaan </p...None
1091907Lattialämmityskaapelit - Hammarin Sähkö Oyhttps://www.hammarinsahko.fi/sahkotarvikkeet/l...Luotettavaa kauppaa yli 110 vuotta\\n\\n<h2>Latt...Lattialämmityskaapelit varaavaan lattialämmity...
1091934Kotitalousvähennyslaskuri 2025: Laske kotitalo...https://vertaakorkoja.fi/kotitalousvahennyslas...<h1>Kotitalousvähennyslaskuri</h1>\\n\\n<p>Kotit...Kotitalousvähennyslaskurin avulla voit laskea ...
1092309Ota meihin yhteyttä – Mothersusurrus.comhttps://mothersusurrus.com/ota-meihin-yhteytta/<h1>Ota meihin yhteyttä</h1>\\n\\n<h4>Mikäli sin...Mikäli sinulla on kysyttävää musiikista, tai h...
\n", "

220210 rows × 4 columns

\n", "
" ], "text/plain": [ " title \\\n", "0 forum.bomber.fi - Omat asetukset - Käyttöehdot \n", "1 Yhteystiedot - Mustasaaren seurakuntayhtymä \n", "2 VAELLUSNET - Vaellusturinat II - Omat asetukse... \n", "3 Gives me some privacy | Dekottaa \n", "4 Suomen Briard ry - Lähetä sähköpostia \n", "... ... \n", "1091706 Vastauspalvelu \n", "1091814 Sound Particles Studio-ohjelmistot - Pikalatau... \n", "1091907 Lattialämmityskaapelit - Hammarin Sähkö Oy \n", "1091934 Kotitalousvähennyslaskuri 2025: Laske kotitalo... \n", "1092309 Ota meihin yhteyttä – Mothersusurrus.com \n", "\n", " url \\\n", "0 https://www.bomber.fi/forums/user/terms?sid=cd... \n", "1 https://www.mustasaarenseurakuntayhtyma.fi/yht... \n", "2 http://www.vaellusnet.com/ucp.php?mode=terms&s... \n", "3 http://www.dekottaa.com/2014/01/gives-me-some-... \n", "4 http://www.suomenbriard.net/phpBB/memberlist.p... \n", "... ... \n", "1091706 https://vastauspalvelu.omataloyhtio.fi/ \n", "1091814 https://www.muziker.fi/sound-particles-studio-... \n", "1091907 https://www.hammarinsahko.fi/sahkotarvikkeet/l... \n", "1091934 https://vertaakorkoja.fi/kotitalousvahennyslas... \n", "1092309 https://mothersusurrus.com/ota-meihin-yhteytta/ \n", "\n", " main_content \\\n", "0

forum.bomber.fi - Käyttöehdot

\\n\\n

K... \n", "1

Yhteystiedot

\\n\\n

 

\\n\\n

Musta... \n", "2

VAELLUSNET - Vaellusturinat II - Käyttöehd... \n", "3

26.1.2014

\\n\\n\\n\\n

Give... \n", "4

Yhteystiedot käyttäjälle

\\n\\nYlläpitäj... \n", "... ... \n", "1091706
Jurinet\\nKuv... \n", "1091814

Valitse maa, johon lähetys toimitetaan Latt... \n", "1091934

Kotitalousvähennyslaskuri

\\n\\n

Kotit... \n", "1092309

Ota meihin yhteyttä

\\n\\n

Mikäli sin... \n", "\n", " cleaned_html_content \n", "0 None \n", "1 None \n", "2 None \n", "3 Liitutaulutarra kanan muodossa. Jos ei halua j... \n", "4 Tämä viesti lähetetään pelkkänä tekstinä. Älä ... \n", "... ... \n", "1091706 Taloyhtiömme on asennettu uusi juuri ilmanpois... \n", "1091814 None \n", "1091907 Lattialämmityskaapelit varaavaan lattialämmity... \n", "1091934 Kotitalousvähennyslaskurin avulla voit laskea ... \n", "1092309 Mikäli sinulla on kysyttävää musiikista, tai h... \n", "\n", "[220210 rows x 4 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Using apply with progress tracking:\n", "from tqdm import tqdm\n", "tqdm.pandas(desc=\"Cleaning HTML content\")\n", "combined_df['cleaned_html_content'] = combined_df['main_content'].progress_map(clean_html_min)\n", "combined_df" ] }, { "cell_type": "markdown", "id": "bf674f41-9688-4e6c-b92b-2a044fe6c94a", "metadata": {}, "source": [ "### 3.3.3. Drop duplicates and None-values" ] }, { "cell_type": "code", "execution_count": 14, "id": "1365f882-1740-46a9-a956-a9104c3c2de2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Df shape before.: (220210, 4)\n", "Df shape after: (139074, 4)\n" ] } ], "source": [ "print(f\"Df shape before.: {combined_df.shape}\")\n", "combined_df = combined_df.drop_duplicates(subset='cleaned_html_content')\n", "combined_df = combined_df.dropna(subset=['cleaned_html_content'])\n", "\n", "print(f\"Df shape after: {combined_df.shape}\")" ] }, { "cell_type": "markdown", "id": "2fcb60f5-1d23-4b58-8848-1029d6e236d3", "metadata": {}, "source": [ "### 3.3.4. Filter by word count\n", "\n", "Next, we calculate the word count for each content entry and filter out any entries with fewer than 30 words. " ] }, { "cell_type": "code", "execution_count": 16, "id": "aabecb7b-ef23-4dd6-8092-f6735242a8cb", "metadata": {}, "outputs": [], "source": [ "# Calculate word count for each entry\n", "combined_df['word_count'] = combined_df['cleaned_html_content'].str.split().str.len()\n", "\n", "# Sort by word count and reset the index\n", "combined_df = combined_df.sort_values(by='word_count').reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 17, "id": "969e76d2-51da-4fb4-9d0c-32d26a86e53b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleurlmain_contentcleaned_html_contentword_count
0Kyky – Welcomehttps://kyky.today/Kyky Kyky\\n • Ota yhteyttä\\n • Rekisteröidy\\...Ostaja maksaa sinulle suoraan!4
1Gluteeniton ruoka - Upbeat Intl. Trading Oyhttps://www.east-asia-mart.fi/fi/tuoteryhma/23...|\\n • e-Lahjakortit ja Onnenkassit (Fukubukur...300 g Laatikko, Singapore.4
2Tietoja sivusta ”C. S. Lewis” – ApoWikihttps://apowiki.fi/index.php?action=info&title...Anonyymi\\n\\nEt ole kirjautunut\\n\\n • Keskuste...Katso tämän sivun suojausloki.4
\n", "
" ], "text/plain": [ " title \\\n", "0 Kyky – Welcome \n", "1 Gluteeniton ruoka - Upbeat Intl. Trading Oy \n", "2 Tietoja sivusta ”C. S. Lewis” – ApoWiki \n", "\n", " url \\\n", "0 https://kyky.today/ \n", "1 https://www.east-asia-mart.fi/fi/tuoteryhma/23... \n", "2 https://apowiki.fi/index.php?action=info&title... \n", "\n", " main_content \\\n", "0 Kyky Kyky\\n • Ota yhteyttä\\n • Rekisteröidy\\... \n", "1 |\\n • e-Lahjakortit ja Onnenkassit (Fukubukur... \n", "2 Anonyymi\\n\\nEt ole kirjautunut\\n\\n • Keskuste... \n", "\n", " cleaned_html_content word_count \n", "0 Ostaja maksaa sinulle suoraan! 4 \n", "1 300 g Laatikko, Singapore. 4 \n", "2 Katso tämän sivun suojausloki. 4 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined_df.head(3)" ] }, { "cell_type": "code", "execution_count": 18, "id": "79b37ced-55cc-4e17-bd70-842b58a6a466", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleurlmain_contentcleaned_html_contentword_count
139071Ortodoksinen oppi pelastuksesta – Tsasounan su...https://www.tsasouna.net/FI/2024/09/07/ortodok...Skip to content\\nTsasounan suunnalta\\n\\n • Or...Q & A – kysy papilta!\\nQ & A – Mikä ja miksi?\\...69952
139072Vuosikirja 2021 - Cockerspanielit ryhttps://cockerspanielit.org/vuosikirja-2022-2/<h1>Vuosikirja 2021</h1>\\n\\n<p>Koostanut Pirjo...Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ...73949
139073vierailija, tekijä sivustolla Hiiltä ja timanttiahttps://blogit.metropolia.fi/hiilta-ja-timantt...Hyppää sisältöön\\nMetropolian Blogit\\n • Uusi...Verkko-opetus on tullut jäädäkseen, mutta mite...85067
\n", "
" ], "text/plain": [ " title \\\n", "139071 Ortodoksinen oppi pelastuksesta – Tsasounan su... \n", "139072 Vuosikirja 2021 - Cockerspanielit ry \n", "139073 vierailija, tekijä sivustolla Hiiltä ja timanttia \n", "\n", " url \\\n", "139071 https://www.tsasouna.net/FI/2024/09/07/ortodok... \n", "139072 https://cockerspanielit.org/vuosikirja-2022-2/ \n", "139073 https://blogit.metropolia.fi/hiilta-ja-timantt... \n", "\n", " main_content \\\n", "139071 Skip to content\\nTsasounan suunnalta\\n\\n • Or... \n", "139072

Vuosikirja 2021

\\n\\n

Koostanut Pirjo... \n", "139073 Hyppää sisältöön\\nMetropolian Blogit\\n • Uusi... \n", "\n", " cleaned_html_content word_count \n", "139071 Q & A – kysy papilta!\\nQ & A – Mikä ja miksi?\\... 69952 \n", "139072 Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ... 73949 \n", "139073 Verkko-opetus on tullut jäädäkseen, mutta mite... 85067 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined_df.tail(3)" ] }, { "cell_type": "code", "execution_count": 19, "id": "ce5bcf6c-840a-4c15-909b-0f718e610691", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 139074.000000\n", "mean 339.775587\n", "std 962.370831\n", "min 4.000000\n", "25% 52.000000\n", "50% 148.000000\n", "75% 345.000000\n", "max 85067.000000\n", "Name: word_count, dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined_df['word_count'].describe()" ] }, { "cell_type": "code", "execution_count": 20, "id": "214aef8e-ba23-4354-914a-650e2addeef6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Df shape before.: (139074, 5)\n", "Df shape after: (117133, 5)\n" ] } ], "source": [ "print(f\"Df shape before.: {combined_df.shape}\")\n", "combined_df = combined_df[combined_df['word_count'] > 30]\n", "print(f\"Df shape after: {combined_df.shape}\")" ] }, { "cell_type": "markdown", "id": "6c684f7f-1124-4d14-9444-f85be9fff331", "metadata": {}, "source": [ "### 3.3.5. Detect language\n", "\n", "Let’s use the langdetect library to double-check the language of each entry and keep only those written in Finnish. This can take few minutes.\n", "Then, filter the dataset to include only Finnish-language entries:" ] }, { "cell_type": "code", "execution_count": 21, "id": "6e3458e3-395d-4137-a470-71698075aa10", "metadata": {}, "outputs": [], "source": [ "from langdetect import detect, LangDetectException\n", "\n", "def detect_language_or_none(text):\n", " try:\n", " return detect(text)\n", " except LangDetectException:\n", " return None" ] }, { "cell_type": "code", "execution_count": 22, "id": "c98fe059-3153-4d94-b308-5a804db0b142", "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleurlmain_contentcleaned_html_contentword_countlanguage_detected
219414.12.2024 -Työturvallisuuskoulutus - CadSahttps://cadsa.fi/koulutuskalenteri/tyoturvalli...<h1>4.12.2024 -Työturvallisuuskoulutus</h1>\\n\\...Työturvallisuuskoulutus on työturvakeskuksen k...31fi
21942Huulipuna unohtuhttps://huulipunaunohtu.blogspot.com/Siirry pääsisältöön\\n\\nHuulipuna unohtu\\n\\nÄit...Äiti on pitänyt meistä huolta, nyt me pidämme ...31fi
21943maa-artisokkapikkelsi | Olemme puutarhassahttp://olemmepuutarhassa.fi/tag/maa-artisokkap...maa-artisokkapikkelsi | Olemme puutarhassa\\n\\n...Heti kun maa on sulanut voi esiin kaivaa viime...31fi
21944REIDEN LOITONTAJALAITE | Ironfit Storehttps://store.ironfit.fi/product/265/ironfit-r...<p>IRONFIT REIDEN LOITONTAJALAITE ST-6007</p>\\...Tilattavissa. Toimitusaika 21 päivää.\\nTilatta...31fi
21945Työpenkki Henning, levyn leveys 1500 mm, hylly...https://www.gerdmans.fi/varasto-ja-teollisuus/...<h1> Työpenkki Henning, levyn leveys 1500 mm, ...Työpenkki Henning, levyn leveys 1500 mm, hylly...31fi
.....................
139069Sanatarkat istuntoselostukset - Keskiviikko 20...https://www.europarl.europa.eu/doceo/document/...\\nTakaisin Europarl-portaaliin\\n\\nChoisissez ...Der Präsident. – Bevor wir zum Tätigkeitsprogr...63237de
139070SKVRhttps://aineistot.finlit.fi/exist/apps/skvr/ru...Esittely Runoluettelo / Metatietosuodatus Runo...Tällä sivulla voit selata runotyyppejä ja luke...69370fi
139071Ortodoksinen oppi pelastuksesta – Tsasounan su...https://www.tsasouna.net/FI/2024/09/07/ortodok...Skip to content\\nTsasounan suunnalta\\n\\n • Or...Q & A – kysy papilta!\\nQ & A – Mikä ja miksi?\\...69952fi
139072Vuosikirja 2021 - Cockerspanielit ryhttps://cockerspanielit.org/vuosikirja-2022-2/<h1>Vuosikirja 2021</h1>\\n\\n<p>Koostanut Pirjo...Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ...73949fi
139073vierailija, tekijä sivustolla Hiiltä ja timanttiahttps://blogit.metropolia.fi/hiilta-ja-timantt...Hyppää sisältöön\\nMetropolian Blogit\\n • Uusi...Verkko-opetus on tullut jäädäkseen, mutta mite...85067fi
\n", "

117133 rows × 6 columns

\n", "
" ], "text/plain": [ " title \\\n", "21941 4.12.2024 -Työturvallisuuskoulutus - CadSa \n", "21942 Huulipuna unohtu \n", "21943 maa-artisokkapikkelsi | Olemme puutarhassa \n", "21944 REIDEN LOITONTAJALAITE | Ironfit Store \n", "21945 Työpenkki Henning, levyn leveys 1500 mm, hylly... \n", "... ... \n", "139069 Sanatarkat istuntoselostukset - Keskiviikko 20... \n", "139070 SKVR \n", "139071 Ortodoksinen oppi pelastuksesta – Tsasounan su... \n", "139072 Vuosikirja 2021 - Cockerspanielit ry \n", "139073 vierailija, tekijä sivustolla Hiiltä ja timanttia \n", "\n", " url \\\n", "21941 https://cadsa.fi/koulutuskalenteri/tyoturvalli... \n", "21942 https://huulipunaunohtu.blogspot.com/ \n", "21943 http://olemmepuutarhassa.fi/tag/maa-artisokkap... \n", "21944 https://store.ironfit.fi/product/265/ironfit-r... \n", "21945 https://www.gerdmans.fi/varasto-ja-teollisuus/... \n", "... ... \n", "139069 https://www.europarl.europa.eu/doceo/document/... \n", "139070 https://aineistot.finlit.fi/exist/apps/skvr/ru... \n", "139071 https://www.tsasouna.net/FI/2024/09/07/ortodok... \n", "139072 https://cockerspanielit.org/vuosikirja-2022-2/ \n", "139073 https://blogit.metropolia.fi/hiilta-ja-timantt... \n", "\n", " main_content \\\n", "21941

4.12.2024 -Työturvallisuuskoulutus

\\n\\... \n", "21942 Siirry pääsisältöön\\n\\nHuulipuna unohtu\\n\\nÄit... \n", "21943 maa-artisokkapikkelsi | Olemme puutarhassa\\n\\n... \n", "21944

IRONFIT REIDEN LOITONTAJALAITE ST-6007

\\... \n", "21945

Työpenkki Henning, levyn leveys 1500 mm, ... \n", "... ... \n", "139069  \\nTakaisin Europarl-portaaliin\\n\\nChoisissez ... \n", "139070 Esittely Runoluettelo / Metatietosuodatus Runo... \n", "139071 Skip to content\\nTsasounan suunnalta\\n\\n • Or... \n", "139072

Vuosikirja 2021

\\n\\n

Koostanut Pirjo... \n", "139073 Hyppää sisältöön\\nMetropolian Blogit\\n • Uusi... \n", "\n", " cleaned_html_content word_count \\\n", "21941 Työturvallisuuskoulutus on työturvakeskuksen k... 31 \n", "21942 Äiti on pitänyt meistä huolta, nyt me pidämme ... 31 \n", "21943 Heti kun maa on sulanut voi esiin kaivaa viime... 31 \n", "21944 Tilattavissa. Toimitusaika 21 päivää.\\nTilatta... 31 \n", "21945 Työpenkki Henning, levyn leveys 1500 mm, hylly... 31 \n", "... ... ... \n", "139069 Der Präsident. – Bevor wir zum Tätigkeitsprogr... 63237 \n", "139070 Tällä sivulla voit selata runotyyppejä ja luke... 69370 \n", "139071 Q & A – kysy papilta!\\nQ & A – Mikä ja miksi?\\... 69952 \n", "139072 Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ... 73949 \n", "139073 Verkko-opetus on tullut jäädäkseen, mutta mite... 85067 \n", "\n", " language_detected \n", "21941 fi \n", "21942 fi \n", "21943 fi \n", "21944 fi \n", "21945 fi \n", "... ... \n", "139069 de \n", "139070 fi \n", "139071 fi \n", "139072 fi \n", "139073 fi \n", "\n", "[117133 rows x 6 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined_df['language_detected'] = combined_df['cleaned_html_content'].map(detect_language_or_none)\n", "\n", "combined_df" ] }, { "cell_type": "code", "execution_count": 23, "id": "984b5d54-a575-49c8-8ae6-de24942cb016", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "language_detected\n", "fi 103429\n", "en 9031\n", "sv 999\n", "id 497\n", "de 358\n", "it 353\n", "pl 284\n", "hr 255\n", "et 243\n", "nl 233\n", "fr 200\n", "lt 197\n", "es 140\n", "sl 119\n", "ca 100\n", "da 87\n", "tr 71\n", "cs 70\n", "pt 69\n", "no 56\n", "ro 55\n", "lv 51\n", "ru 46\n", "sk 41\n", "hu 26\n", "mk 24\n", "vi 18\n", "tl 14\n", "sq 14\n", "ko 8\n", "sw 8\n", "ar 6\n", "uk 4\n", "hi 4\n", "bn 4\n", "el 4\n", "te 2\n", "bg 2\n", "cy 2\n", "fa 2\n", "af 2\n", "he 2\n", "ne 2\n", "so 1\n", "Name: count, dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined_df['language_detected'].value_counts()" ] }, { "cell_type": "code", "execution_count": 25, "id": "eb6faeb6-7458-41f6-ab7e-cd9d41b62da6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Df shape before.: (117133, 6)\n", "Df shape after: (103429, 6)\n" ] } ], "source": [ "# Retain only detected finnish data\n", "print(f\"Df shape before.: {combined_df.shape}\")\n", "combined_df = combined_df[combined_df['language_detected'] == 'fi']\n", "print(f\"Df shape after: {combined_df.shape}\")" ] }, { "cell_type": "markdown", "id": "bbb90f6a-5025-41a2-9b9a-b5ca653a82ef", "metadata": {}, "source": [ "## 3.4. Save the data to a parquet file\n", "\n", "Now that the data is cleaned, we’re ready to save it. We'll select only the necessary columns before saving." ] }, { "cell_type": "code", "execution_count": 26, "id": "b48e803a-453d-4bb1-a2d9-18ad20d36914", "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlecleaned_html_content
219414.12.2024 -Työturvallisuuskoulutus - CadSaTyöturvallisuuskoulutus on työturvakeskuksen k...
21942Huulipuna unohtuÄiti on pitänyt meistä huolta, nyt me pidämme ...
\n", "
" ], "text/plain": [ " title \\\n", "21941 4.12.2024 -Työturvallisuuskoulutus - CadSa \n", "21942 Huulipuna unohtu \n", "\n", " cleaned_html_content \n", "21941 Työturvallisuuskoulutus on työturvakeskuksen k... \n", "21942 Äiti on pitänyt meistä huolta, nyt me pidämme ... " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop the unneccessary columns\n", "combined_df = combined_df[['title', 'cleaned_html_content']]\n", "combined_df.head(2)" ] }, { "cell_type": "code", "execution_count": null, "id": "5ea68e00-850e-433c-b948-dba9a6b84cfe", "metadata": {}, "outputs": [], "source": [ "## Save the new dataframe with detected Finnish language\n", "path_to_save_the_data = '' # /scratch is recommended for data files\n", "combined_df.to_parquet(path_to_save_the_data, index= False)" ] }, { "cell_type": "markdown", "id": "6cbb8225-d2ae-4752-b23a-ee9518f79f06", "metadata": {}, "source": [ "**After saving data to a parquet file**, you may choose to exit the Jupyter environment or continue working within it to create the upcoming Python and batch job scripts while the session remains active." ] }, { "cell_type": "markdown", "id": "32ca15f3-4ac9-44aa-b3dc-d745639e989d", "metadata": {}, "source": [ "# 4. Finetune the model\n", "\n", "In this step, we’ll train the model using a batch job and a Python script. You can create and edit these files either via the LUMI web interface or by using Visual Studio Code’s Remote SSH extension (for more details, see the documentation [here](https://docs.csc.fi/apps/vscode/)).\n", "\n", "The training scripts used in this tutorial are based on the [CSCfi/llm-fine-tuning-examples](https://github.com/CSCfi/llm-fine-tuning-examples/tree/master) repository.\n", "\n", "We will also use [MLflow](https://docs.csc.fi/support/tutorials/ml-workflows/) to track training metrics. For a practical example, see the [tutorial on using MLflow in Puhti and LUMI](https://github.com/CSCfi/puhti_mlflow_tutorial).\n", "\n", "You can create the necessary files under your project directory, e.g.:\n", "\n", "`/project/project_46XXXXXXXX/${USER}`" ] }, { "cell_type": "markdown", "id": "a590c624-5a4a-4f93-a5b1-2d3e814f9718", "metadata": {}, "source": [ "**Using LLama-models through transformers** \n", "If you want to use LLaMA models via the [`transformers`](https://huggingface.co/docs/transformers/index) library, follow these steps:\n", "\n", "\n", "1. Create [Hugging Face ](https://huggingface.co/) account\n", "2. Locate the LLaMA models, read and accept their terms of use, and wait for approval\n", "3. Generate an access token on your Hugging Face account\n", "4. Set the access token in your HF cache directory (HF_HOME), for example:\n", "\n", "```bash\n", "export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache\n", "mkdir -p $HF_HOME\n", "\n", "cd \n", "echo <\"token-here\"> > token\n", "```" ] }, { "cell_type": "markdown", "id": "de0ceb34", "metadata": {}, "source": [ "## 4.0. Container for training\n", "\n", "First we need to create a compatible environment for training. \n", "\n", "This [example](https://github.com/DeiC-HPC/cotainr/tree/main/examples/LUMI/conda_pytorch_rocm) shows how to use [`cotainr`](https://cotainr.readthedocs.io/en/stable/) to build a container with PyTorch configured for LUMI's AMD GPUs. We'll follow this approach to create our training container.\n", "\n", "**Create owilix_env.yml file:**\n", "```yml\n", "name: training_env\n", "channels:\n", " - conda-forge\n", "dependencies:\n", " - filelock=3.13.1\n", " - fsspec=2024.2.0\n", " - jinja2=3.1.3\n", " - markupsafe=2.1.5\n", " - mpmath=1.3.0\n", " - networkx=3.2.1\n", " - numpy=1.26.3\n", " - pillow=10.2.0\n", " - pip=24.0\n", " - python=3.11.7\n", " - sympy=1.12\n", " - typing-extensions=4.9.0\n", " - pip:\n", " - --extra-index-url https://download.pytorch.org/whl/\n", " - pytorch-triton-rocm==2.3.1\n", " - torch==2.3.1+rocm6.0\n", " - torchaudio==2.3.1+rocm6.0\n", " - torchvision==0.18.1+rocm6.0\n", " - langchain==0.3.27\n", " - mlflow==2.22.0\n", " - datasets==4.0.0\n", " - peft==0.17.0\n", " - transformers==4.55.0\n", "```\n", "\n", "**In the terminal of LUMI (note: building container takes several minutes):**\n", "```bash\n", "# Get needed modules\n", "module purge\n", "module load LUMI\n", "module load cotainr\n", "\n", "# Use cotainr to build the container \n", "cotainr build training_env.sif --system=lumi-g --conda-env=training_env.yml --accept-license\n", "\n", "## Add required additional bindings\n", "module use /appl/local/containers/ai-modules/\n", "module load singularity-AI-bindings \n", "\n", "# Verify installation\n", "singularity exec training_env.sif bash -c 'pip list'\n", "```" ] }, { "cell_type": "markdown", "id": "72b430ca-5e17-4cd3-b430-05103ed6816a", "metadata": {}, "source": [ "## 4.1. Python scripts for training the model \n", "\n", "Below are the Python scripts used to finetune the Meta Llama-3.2-1B model. They include:\n", "\n", "* Training data preprocessing using a custom preprocess function that chunks and tokenizes the input text - implemented in **preprocessing.py**\n", "* Training setup using Hugging Face’s `Trainer` class - implemented in **train.py**\n", "* Metric tracking with MLflow - see **train.py**\n", "* Model saving and checkpointing - see **train.py**" ] }, { "cell_type": "markdown", "id": "7ebdadee-32a3-42cf-82ed-83347451ccac", "metadata": {}, "source": [ "### 4.1.1. Python script: preprocessing.py\n", "\n", "This script handles text splitting using LangChain’s `RecursiveCharacterTextSplitter`. It breaks long text inputs into smaller overlapping chunks, optionally appending an end-of-sequence token to each chunk. The script also includes a preprocessing function that tokenizes these chunks with a Hugging Face tokenizer.\n", "\n", "\n", "```python\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "\n", "def chunk_text(text, chunk_size, overlap_size, eos_token):\n", " \"\"\"Splits a single large text into smaller overlapping chunks.\"\"\"\n", " splitter = RecursiveCharacterTextSplitter(\n", " chunk_size=chunk_size,\n", " chunk_overlap=overlap_size,\n", " )\n", "\n", " chunks = splitter.split_text(text)\n", " if eos_token:\n", " chunks = [chunk + f\" {eos_token}\" for chunk in chunks]\n", "\n", " return chunks\n", " \n", " \n", "\n", "def preprocess(examples, tokenizer, max_tokens=4096, chunk_size=8192, overlap_size=200):\n", " \"\"\"Preprocesses a batch of examples by splitting textcontent into chunks and tokenizing them.\"\"\"\n", " all_chunks = []\n", " for text in examples[\"cleaned_html_content\"]:\n", " chunks = chunk_text(text, chunk_size=chunk_size, overlap_size=overlap_size, eos_token=tokenizer.eos_token)\n", " all_chunks.extend(chunks)\n", "\n", " tokenized_output = tokenizer(\n", " all_chunks,\n", " padding=False, \n", " truncation=True,\n", " max_length=max_tokens, \n", " add_special_tokens=True,\n", " return_length=False, \n", " )\n", "\n", " return {\n", " \"input_ids\": tokenized_output[\"input_ids\"],\n", " \"attention_mask\": tokenized_output[\"attention_mask\"]\n", " }\n", "```" ] }, { "cell_type": "markdown", "id": "cf7591c2-eb4f-443a-b6c7-2bf4b4741299", "metadata": {}, "source": [ "### 4.1.2. Python script: train.py\n", "\n", "\n", "This is the main Python script used to finetune the **meta-llama/Llama-3.2-1B** model. \n", "It uses **MLFlow** to track training metrics, which are saved in the `mlruns` folder inside the specified --output-path.\n", "\n", "**Remember**: Make sure to provide the correct path to your training data in Parquet format via the `--parquet-file` argument, either here or in your batch job script.\n", "\n", "**Note!** By default, the script runs a small test training using only 1,000 samples. To train on the full dataset, comment out these lines and adjust the training parameters accordingly:\n", "\n", "```python\n", " # comment these if you would like to use the whole dataset\n", " tokenized_train_dataset = tokenized_train_dataset.shuffle(seed=42).select(range(900))\n", " tokenized_val_dataset = tokenized_val_dataset.shuffle(seed=42).select(range(100))\n", "```\n", "\n", "**train.py**\n", "```python\n", "import argparse\n", "import os\n", "import sys\n", "import time\n", "import mlflow\n", "\n", "import torch\n", "from datasets import load_dataset\n", "from peft import LoraConfig, get_peft_model, AutoPeftModelForCausalLM\n", "from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments\n", "\n", "from functools import partial\n", "from preprocessing import preprocess\n", "\n", "from datasets.utils.logging import disable_progress_bar\n", "\n", "if __name__ == \"__main__\":\n", " #disable_progress_bar() # Disable progress bar during dataset processing\n", "\n", " parser = argparse.ArgumentParser() # set up ArgumentParser \n", " parser.add_argument(\n", " \"--input-model\",\n", " type=str,\n", " default=\"meta-llama/Llama-3.2-1B\",\n", " help=\"The pre-trained model from Hugging Face to use as basis: https://huggingface.co/models\",\n", " )\n", " parser.add_argument(\n", " \"--output-path\",\n", " type=str,\n", " help=\"Directory where model checkpoints and outputs will be saved.\",\n", " )\n", " parser.add_argument(\n", " \"--parquet-file\",\n", " type=str,\n", " #default='',\n", " help=\"Path to the input Parquet file containing training data.\",\n", " )\n", " parser.add_argument(\n", " \"--model_output_name\",\n", " type=str, \n", " help=\"Name for the finetuned model to be saved under.\",\n", " )\n", " parser.add_argument(\"--batch_size\", \"-b\", type=int, default=1, help=\"Training batch size\")\n", " parser.add_argument(\n", " \"--num-workers\",\n", " type=int,\n", " default=1,\n", " help=\"The number of CPU worker processes to use.\",\n", " )\n", " parser.add_argument(\n", " \"--resume\",\n", " default=False,\n", " action=\"store_true\",\n", " help=\"If set, continue from a previously interrupted run. Otherwise, overwrite existing checkpoints.\",\n", " )\n", " parser.add_argument(\n", " \"--max-steps\",\n", " type=int,\n", " default=400,\n", " help=\"The number of training steps.\",\n", " )\n", " parser.add_argument(\"--peft\", action=\"store_true\", help=\"Use PEFT: https://huggingface.co/docs/peft/index\")\n", " parser.add_argument(\n", " \"--4bit\",\n", " dest=\"bnb_4bit\",\n", " action=\"store_true\",\n", " help=\"Use 4bit quantization with bitsandbytes: https://huggingface.co/docs/bitsandbytes/main/en/index\",\n", " )\n", " args, _ = parser.parse_known_args()\n", "\n", " # Check for required arguments\n", " if not args.model_output_name:\n", " print(\"ERROR: --model_output_name must be specified.\")\n", " sys.exit(1)\n", "\n", " # Read the environment variables provided by torchrun\n", " rank = int(os.environ[\"RANK\"])\n", " local_rank = int(os.environ[\"LOCAL_RANK\"])\n", " world_size = int(os.environ[\"WORLD_SIZE\"])\n", " local_world_size = int(os.environ[\"LOCAL_WORLD_SIZE\"])\n", "\n", "\n", " # Initialize MLflow only on the main process (rank 0) to prevent multi-process conflicts\n", " if rank == 0:\n", " # Set the MLflow tracking URI to save logs and artifacts under the specified output directory\n", " mlflow_tracking_uri = os.path.join(args.output_path, \"mlruns\")\n", " mlflow.set_tracking_uri(mlflow_tracking_uri)\n", "\n", " # Use the model output name as the MLflow experiment name\n", " mlflow.set_experiment(args.model_output_name)\n", " print(f\"MLflow tracking URI: {mlflow_tracking_uri}\")\n", " \n", "\n", " # this is where trained model and checkpoints will go\n", " output_model_dir = os.path.join(args.output_path, args.model_output_name)\n", " \n", " if rank == 0:\n", " print(f\"Using {world_size} GPUs.\")\n", " print(f\"Local {local_world_size} GPUs.\")\n", "\n", " # Then we determine the device on which to train the model.\n", " if rank == 0:\n", " print(\"Using PyTorch version:\", torch.__version__)\n", " print(f\"world_size: {world_size} GPUs.\")\n", " print(f\"local_world_size {local_world_size}\")\n", " print(f\"Number of available GPUs (visible to this process): {torch.cuda.device_count()}\")\n", " print(f\"Rank: {rank}\")\n", " if torch.cuda.is_available():\n", " device = torch.device(\"cuda\", local_rank)\n", " print(f\"Using GPU {local_rank}, device name: {torch.cuda.get_device_name(device)}\")\n", " else:\n", " print(f\"No GPU found, using CPU instead. (Rank: {local_rank})\")\n", " device = torch.device(\"cpu\")\n", "\n", " if rank == 0 and args.batch_size % world_size != 0:\n", " print(f\"ERROR: batch_size={args.batch_size} has to be a multiple of the number of GPUs={world_size}!\")\n", " sys.exit(1)\n", "\n", "\n", " if rank == 0:\n", " print(f\" output_model_dir: {output_model_dir}\")\n", " \n", " start = time.time()\n", "\n", " tokenizer = AutoTokenizer.from_pretrained(args.input_model, use_fast=True)\n", " tokenizer.pad_token = tokenizer.eos_token\n", " special_tokens = tokenizer.special_tokens_map\n", " if rank == 0:\n", " print(\"Loading input model and tokenizer\")\n", "\n", " quantization_config = None\n", " if args.bnb_4bit:\n", " from transformers import BitsAndBytesConfig\n", "\n", " print(\"Using bnb_4bit\")\n", " bnb_config = BitsAndBytesConfig(\n", " load_in_4bit=True,\n", " bnb_4bit_quant_type=\"nf4\",\n", " bnb_4bit_compute_dtype=torch.bfloat16,\n", " bnb_4bit_use_double_quant=True,\n", " bnb_4bit_quant_storage=torch.bfloat16,\n", " )\n", " quantization_config = bnb_config\n", "\n", " model = AutoModelForCausalLM.from_pretrained(\n", " args.input_model,\n", " quantization_config=quantization_config,\n", " torch_dtype=torch.bfloat16,\n", " device_map=device,\n", " )\n", "\n", " if args.peft:\n", " # peft_config = LoraConfig(\n", " # task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32,\n", " # lora_dropout=0.1\n", " # )\n", " # LoRA config from here:\n", " # https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_fsdp_qlora.py#L128\n", " peft_config = LoraConfig(\n", " lora_alpha=8,\n", " lora_dropout=0.05,\n", " r=16,\n", " bias=\"none\",\n", " target_modules=\"all-linear\",\n", " task_type=\"CAUSAL_LM\",\n", " # modules_to_save = [\"lm_head\", \"embed_tokens\"] # add if you want to use the Llama 3 instruct template\n", " )\n", " model = get_peft_model(model, peft_config)\n", " print(\"Using PEFT\")\n", " model.print_trainable_parameters()\n", "\n", " stop = time.time()\n", " if rank == 0:\n", " print(f\"Loading model and tokenizer took: {stop - start:.2f} seconds\")\n", " \n", " train_batch_size = args.batch_size\n", " eval_batch_size = args.batch_size\n", "\n", " if rank == 0:\n", " print(f\"Global train and eval batch size : {args.batch_size}\")\n", "\n", "\n", " training_args = TrainingArguments(\n", " disable_tqdm=True,\n", " output_dir=output_model_dir,\n", " save_strategy=\"steps\",\n", " save_steps=50, # MODIFY from quick testing to real training for eg. 50 -> 400!!\n", " save_total_limit=3,\n", " learning_rate=2e-5, #3e-5,\n", " weight_decay=0.01,\n", " bf16=True, # use 16-bit floating point precision\n", " per_device_train_batch_size=train_batch_size // world_size,\n", " per_device_eval_batch_size=eval_batch_size,\n", " dataloader_num_workers=args.num_workers,\n", " ddp_find_unused_parameters=False, \n", " dataloader_pin_memory=True, \n", " metric_for_best_model=\"eval_loss\", \n", " eval_strategy=\"steps\",\n", " eval_steps=100, # MODIFY from quick testing to real training for eg. 100 -> 200!!\n", " num_train_epochs=2,\n", " max_steps=args.max_steps, # COMMENT THIS IF using bigger dataset\n", " \n", " # MLflow integration \n", " report_to=[\"mlflow\"], \n", " logging_steps=50, # MODIFY !!\n", " logging_strategy=\"steps\",\n", " \n", " # Run name for MLflow — includes SLURM job ID to indentify run \n", " run_name=f\"{args.model_output_name}_{os.environ.get('SLURM_JOB_ID')}\",\n", " )\n", "\n", " #if rank == 0:\n", " # print(f\"Training arguments : {training_args}\")\n", "\n", " # Load parquet data\n", " raw_dataset = load_dataset(\"parquet\", data_files=args.parquet_file)\n", "\n", " # Split dataset into train and validation sets\n", " split_dataset = raw_dataset[\"train\"].train_test_split(test_size=0.1, seed=42)\n", " max_tokens = 2048\n", " overlap_tokens = 50\n", "\n", " if rank == 0:\n", " print(\"Dataset columns:\", raw_dataset[\"train\"].column_names)\n", " print(f\"Type of column_names: {type(raw_dataset['train'].column_names)}\")\n", " \n", " column_names = raw_dataset[\"train\"].column_names\n", "\n", " preprocess_function = partial(\n", " preprocess, tokenizer=tokenizer, max_tokens=max_tokens, chunk_size=8192, overlap_size=overlap_tokens\n", " )\n", "\n", " tokenized_train_dataset = split_dataset[\"train\"].map(\n", " preprocess_function,\n", " batched=True,\n", " remove_columns=column_names,\n", " num_proc=args.num_workers,\n", " )\n", "\n", " tokenized_val_dataset = split_dataset[\"test\"].map(\n", " preprocess_function,\n", " batched=True,\n", " remove_columns=column_names,\n", " num_proc=args.num_workers,\n", " )\n", " ####################################################\n", " # comment these if you would like to use the whole dataset\n", " tokenized_train_dataset = tokenized_train_dataset.shuffle(seed=42).select(range(900))\n", " tokenized_val_dataset = tokenized_val_dataset.shuffle(seed=42).select(range(100))\n", "\n", " # Print the sizes to verify\n", " if rank == 0:\n", " print(f\"Train dataset size: {len(tokenized_train_dataset)}\")\n", " print(f\"Validation dataset size: {len(tokenized_val_dataset)}\")\n", "\n", " data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors=\"pt\")\n", "\n", " # Initialize the Trainer\n", " trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=tokenized_train_dataset,\n", " eval_dataset=tokenized_val_dataset,\n", " tokenizer=tokenizer,\n", " data_collator=data_collator,\n", " )\n", "\n", " start_train = time.time()\n", " if rank== 0:\n", " print(f\"Training starting...\")\n", "\n", " # Train the model - MLflow will automatically log metrics\n", " trainer.train(resume_from_checkpoint=args.resume)\n", "\n", " stop_train = time.time()\n", " if rank == 0:\n", " elapsed = stop_train - start_train\n", " hours = int(elapsed // 3600)\n", " minutes = int((elapsed % 3600) // 60)\n", " seconds = int(elapsed % 60)\n", " print(f\"Finetuning model took: {hours}h {minutes}m {seconds}s\")\n", "\n", " # Save the model\n", " if trainer.is_fsdp_enabled:\n", " trainer.accelerator.state.fsdp_plugin.set_state_dict_type(\"FULL_STATE_DICT\")\n", " trainer.save_model(output_model_dir)\n", " \n", " if rank == 0:\n", " print()\n", " print(\"Training done, you can find the final model (and checkpoints) in\", output_model_dir)\n", " print(f\"\\nMLflow experiment data stored in: {mlflow_tracking_uri}\")\n", "```" ] }, { "cell_type": "markdown", "id": "a0ad01b2-b7b6-48a9-a45b-6d9061a849f2", "metadata": {}, "source": [ "## 4.2. Batch job script for training with 8GPUs\n", "\n", "To run training on LUMI using 8 GPUs, you need to submit a batch job via a SLURM script. Below is an example script named **run_train_8gpu.sh**.\n", "\n", "This script:\n", "* Requests resources from the GPU [partition](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/partitions/) (eg. dev-g / small-g) including 8 GPUs, 56 CPU cores, and 480 GB of memory.\n", "* Loads the necessary modules for Singularity container support.\n", "* Sets environment variables for Hugging Face cache and tokenizer behavior.\n", "* Defines an output directory for saving the trained model and logs.\n", "* Launches the training inside the container using `torchrun` with distributed training support.\n", "\n", "**Remember** to replace `` with your project ID, `` with the actual path to your preprocessed training data, and `` with the path to your training container (e.g., training_env.sif).\n", "Also, consider switching the partition to `small-g` and adjusting the `--time` parameter for longer training runs.\n", "\n", "\n", "**run_train_8gpu.sh**\n", "```bash\n", "#!/bin/bash\n", "#SBATCH --account=project_\n", "#SBATCH --partition=dev-g\n", "#SBATCH --ntasks=1\n", "#SBATCH --cpus-per-task=56\n", "#SBATCH --mem=480G\n", "#SBATCH --time=00:15:00 \n", "#SBATCH --gpus-per-node=8\n", "\n", "module use /appl/local/containers/ai-modules\n", "module load singularity-AI-bindings\n", "\n", "# This will store all the Hugging Face cache such as downloaded models\n", "# and datasets in the project's scratch folder\n", "export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache\n", "mkdir -p $HF_HOME\n", "\n", "# Path to where the trained model and logging data will go\n", "OUTPUT_DIR=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/training_output_data\n", "mkdir -p $OUTPUT_DIR\n", "\n", "TRAINING_DATA_FILE=\n", "\n", "# Disable internal parallelism of huggingface's tokenizer since we\n", "# want to retain direct control of parallelism options.\n", "export TOKENIZERS_PARALLELISM=false\n", "\n", "set -xv # print the command so that we can verify setting arguments correctly from the logs\n", "\n", "CONTAINER=\n", " \n", "srun singularity exec $CONTAINER \\\n", " torchrun --standalone \\\n", " --nnodes=1 \\\n", " --nproc-per-node=$SLURM_GPUS_PER_NODE \\\n", " train.py $* \\\n", " --output-path $OUTPUT_DIR \\\n", " --parquet-file $TRAINING_DATA_FILE \\\n", " --model_output_name=\"Llama-3.2-1B-finetuned\" \\\n", " --num-workers $SLURM_CPUS_PER_TASK \\\n", " --batch_size=8\n", "```\n" ] }, { "cell_type": "markdown", "id": "e06cc994-34c4-4e8e-9092-2b18451ecf7e", "metadata": {}, "source": [ "### 4.3. Run training script\n", "\n", "To train the model on LUMI with 8 GPUs, submit the batch job using the SLURM script provided in **run_train_8gpu.sh**.\n", "\n", "Simply run the following command in the LUMI terminal:\n", "````\n", "sbatch run_train_8gpu.sh\n", "````\n", "\n", "Once the job starts, a SLURM job file named `slurm-{slurm_job_id}.job` will be created automatically.\n", "\n", "You can monitor the status of your jobs at any time using: `sacct`." ] }, { "cell_type": "markdown", "id": "a9d1460f-5c61-427e-95a5-e26f8df68f9c", "metadata": {}, "source": [ "### 4.4. Use MLflow to check metrics \n", "\n", "After the training completes, you’ll find the logged data inside the `mlruns` folder located within your specified output directory. If you didn't change this part in the **run_train_8gpu.sh**, the lcoation for mlflow metrics is `/scratch/${SLURM_JOB_ACCOUNT}/${USER}/training_output_data/mlruns`.\n", "\n", "To visualize and monitor your training metrics, you can open an MLflow session via the LUMI web interface.\n", "Navigate to **Apps** -> **Mlflow**.\n", "\n", "Set the `Location where MLflow files are stored` to the full path where your `mlruns` folder is located. After launching the session, you can interactively browse training metrics, losses and parameters." ] }, { "cell_type": "markdown", "id": "7477bf14-307d-433e-8d10-752c0c83d8f4", "metadata": {}, "source": [ "# 5. Test the model\n", "\n", "After finetuning, you can test the model using a Python script and a SLURM batch job. Inference results will be saved to a logging file for review.\n", "\n", "To run inference, simply submit the batch job with: `sbatch run_inference.sh`\n", "\n", "This will generate model outputs for your predefined prompts and log them for inspection.\n", "\n", "*Note! We don't need to use the container here since we don't need any additional packages.*" ] }, { "cell_type": "markdown", "id": "561fa025-10e8-40a2-8f87-dd7af32791f2", "metadata": {}, "source": [ "**run_inference.sh**\n", "```bash\n", "#!/bin/bash\n", "#SBATCH --account=project_XXXXXXXXXX\n", "#SBATCH --partition=dev-g\n", "#SBATCH --ntasks=1\n", "#SBATCH --cpus-per-task=7\n", "#SBATCH --mem=60G\n", "#SBATCH --time=0:15:00\n", "#SBATCH --gpus-per-node=1\n", "\n", "module purge\n", "module use /appl/local/csc/modulefiles/\n", "module load pytorch/2.5\n", "\n", "# This will store all the Hugging Face cache such as downloaded models\n", "# and datasets in the project's scratch folder\n", "export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache\n", "mkdir -p $HF_HOME\n", "\n", "export LOG_FILE_PATH=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/inference_logs\n", "mkdir -p $LOG_FILE_PATH\n", "export LOG_FILE=${LOG_FILE_PATH}/inference_prints.log\n", "\n", "# Path to where the trained model and logging data will go\n", "OUTPUT_DIR=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-data\n", "mkdir -p $OUTPUT_DIR\n", "\n", "# Disable internal parallelism of huggingface's tokenizer since we\n", "# want to retain direct control of parallelism options.\n", "export TOKENIZERS_PARALLELISM=false\n", "\n", "set -xv # print the command so that we can verify setting arguments correctly from the logs\n", "\n", "MODEL_PATH_1=\"meta-llama/Llama-3.2-1B\"\n", "MODEL_PATH_2=\"\"\n", "\n", "# Define prompts as an array\n", "PROMPTS=(\n", " \"Tekoälyn kehitys muuttaa maailmaa nopeasti ja siksi \"\n", " \"Tervetuloa \"\n", ")\n", "# Run inference for each model and prompt combination\n", "for MODEL in \"$MODEL_PATH_1\" \"$MODEL_PATH_2\"; do\n", " for PROMPT in \"${PROMPTS[@]}\"; do\n", " srun python inference.py \\\n", " --model \"$MODEL\" \\\n", " --prompt \"$PROMPT\"\n", " done\n", "done\n", "```\n", "\n", "**inference.py**\n", "```python\n", "import logging\n", "import argparse\n", "import torch\n", "import os\n", "\n", "from datetime import datetime\n", "from transformers import AutoModelForCausalLM, AutoTokenizer\n", "\n", "LOG_FILE = os.environ.get('LOG_FILE')\n", "slurmjob_id = os.environ['SLURM_JOBID']\n", "\n", "# logging file settings\n", "logging.basicConfig(\n", " filename=LOG_FILE,\n", " level=logging.INFO\n", ")\n", "\n", "if __name__ == \"__main__\":\n", " parser = argparse.ArgumentParser()\n", " parser.add_argument(\n", " \"--model\",\n", " type=str,\n", " help=\"Path to fine-tuned model directory\"\n", " )\n", " \n", " parser.add_argument(\n", " \"--prompt\",\n", " type=str,\n", " help=\"Prompt for the LLM to continue\"\n", " )\n", " args = parser.parse_args()\n", "\n", "\n", " logging.info(f\"Slurmjob_ID : {slurmjob_id}\")\n", " logging.info(f\"Model Path: {args.model}\")\n", " logging.info(f\"Prompt: {args.prompt}\")\n", "\n", " device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n", " print(f\"Using device {device}\")\n", " if device.type == 'cuda':\n", " print(f\"Device name is {torch.cuda.get_device_name(device)}\")\n", "\n", " tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=True)\n", " tokenizer.pad_token = tokenizer.eos_token\n", " model = AutoModelForCausalLM.from_pretrained(args.model)\n", " model.to(device)\n", "\n", "\n", " with torch.no_grad():\n", " inputs = tokenizer(args.prompt, return_tensors='pt').to(device)\n", " outputs = model.generate(**inputs, do_sample=True, max_length=200, num_return_sequences=2)\n", " decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)\n", "\n", " \n", " print(\"Generated Outputs:\")\n", " logging.info(\"Generated Outputs:\")\n", " for i, text in enumerate(decoded_outputs):\n", " print(f\"\\n--- Output {i + 1} ---\\n{text}\")\n", " logging.info(f\"\\n--- Output {i + 1} ---\\n{text}\")\n", "\n", " logging.info(\"-\" * 40)\n", " \n", "```" ] }, { "cell_type": "markdown", "id": "11216550-b93c-40c9-a9df-61930fb016aa", "metadata": {}, "source": [ "**Thank you for following the tutorial — we hope you found it useful!**\n", "\n", "For more information on the OpenWebSearch.eu project see: https://openwebsearch.eu/\n", "\n", "For more information on the LUMI supercomputer and CSC, see: https://www.lumi-supercomputer.eu/, https://www.csc.fi/" ] }, { "cell_type": "markdown", "id": "1265f4f7", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "7d408975", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (venv)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0rc1" } }, "nbformat": 4, "nbformat_minor": 5 }