## Creating an index for searching arguments for controversial topics. (Webis)


### Abstract
TODO 

### Use Case
<!-- Key functionality: High-level description from the end-user point of view -->
<!-- Context and benefit: For whom is the application important and why -->
Argument search engines like [args.me](https://www.args.me) allow all people who are interested to inform themselves about arguments in controversial debates or to gather arguments for a debate. They provide users with an interface to search for arguments about controversial topics, such as the adoption of school uniforms, or the abolition of the death penalty. The results are presented in a contrastive view of arguments supporting and attacking a specific claim and provide a starting point for further research.
Different to traditional web search about a controversial topic, the argument search engine allows people to get an overview of relevant arguments in the debate at one glance, providing additional features such as an argument quality score and the possibility to search for sepcific sub-topics in particular.
Different to asking a LLM about arguments in a debate, args.me allows users to track the sources and also the relevance of an argument in terms of convincingness or popularity.

![Argument Search Engine](figures/args_interface.png)
Figure: Argument Search Engine


### Application
<!-- Conceptual description, software architecture, short technical description, Screenshots  -->

#### Conceptual Description and Architecture
We mainly aim at collecting argumentative data from the OWI in order to complement and extend the database of the args.me search engine. The general pipeline for collecting data is composed of the following steps: 1. Downloading English webpages from the curlie-collection, 2. Filtering for potentially argumentative webpages, 3. Extracting arguments from the filtered webpages, 4. Creating an argument index.

#### Step 1: Download Webpages from the OWI
The downloading step is done using the [owilix client](https://opencode.it4i.eu/openwebsearcheu-public/owi-cli), where we can specifiy the collection from which to download the data, as well as the language of the webpages.

#### Step 2: Filter for Argumentative Webpages
We use the set of 31 controversial topics from the [ArgKP dataset of IBM](https://research.ibm.com/haifa/dept/vst/debating_data.shtml#Key_Point_Analysis) ([Friedman et al. 2021](https://aclanthology.org/2021.argmining-1.16/)). The reason is that this dataset provides a number of necessary features for further processing, such as the formulation of a topic in form of a controversial statement rather than in form of general keywords, the stance information for each argument, and a set of key points for each topic.
For an efficient processing of the data, we apply a keyword-based approach for collecting topic-related webpages. For each of the 31 topics, we create a list of keywords with synonyms and inflected word forms like 'school clothing', 'school uniform', 'school attire', 'student uniforms' for the topic 'We should abandon the use of school uniforms'. A webpage is collected if any of these keywords can be found in either its title or corresponding curlielabel.

#### Step 3: Extract Arguments
For the topic-related webpages, we segment the main text (column `main_content`) into paragraphs (based on the minimal html-information provided in the main_content field), or into sentences using [NLTK](https://www.nltk.org/)'s sentence tokenizer ([Bird et al. 2009](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/steven-bird-evan-klein-and-edward-loper-natural-language-processing-with-python-oreilly-media-inc2009-isbn-9780596516499/CB47911C9ACC7E9E499289B568AA8786)). Afterwards, a [fine-tuned BERT model](https://github.com/UKPLab/acl2019-BERT-argument-classification-and-clustering/blob/master/argument-classification/README.md) ([Reimers et al. 2019](https://aclanthology.org/P19-1054/)) is used to classify each segment in either pro (supporting the topic-claim), con (attacking the topic-claim) or non if the segment is not argumentative at all with respect to the topic. Furthermore, we estimate the argument quality of each segment with a fine-tuned [argument quality model](https://aclanthology.org/2025.argmining-1.17/) (based on the work of [Gretz et al. 2020](https://ojs.aaai.org/index.php/AAAI/article/view/6285)), which can be used as an additional filter or in subsequent analysis tasks like key point generation.

#### Step 4: Create the Argument Index
The arguments collected from the OWI are fed into an [Elasticsearch](https://www.elastic.co/elasticsearch) index. This index stores the controversial topic as claim, and all supporting and attacking text segments as premises. Moreover, it provides the curlielabels and keywords of the original text, the calculated argument quality score and the link to the original text for each argument. In the [new args-demo](https://args-reloaded.web.webis.de/), users can not only search for general arguments in a debate, but also for specific aspects/ sub-topics within the debate, for example arguments related to 'deterrence' in the discussion 'We should fight for the abolition of nuclear weapons'.

#### Demo Application
The original args.me is based on data crawled from different debate portals and is publicly accessible under the following link: [www.args.me](https://www.args.me).
A demo of the args.me search engine with OWS data can be tested here: [www.args-reloaded.web.webis.de](https://args-reloaded.web.webis.de/). In the top search box, you can search for one of the controversial topics from the ArgKP dataset. The lower search box can be used to specify a specific aspect or sub-topic of the debate for which you want to find arguments. Alternatively, this can also be left empty in order to get a general overview of arguments in this debate. The `source` link in each search result brings you to the corresponding webpage from which the argument was taken.

#### Statistics
We extracted arguments from the curlie-collection from July until September.
- number of related texts: 24.359
- number of extracted arguments (paragraph segmentations): 158.675
- number of related texts and arguments for exemplary topics:
    - 'We should abolish the right to keep and bear arms': 352 texts, 4116 arguments
    - 'We should fight for the abolition of nuclear weapons': 111 texts, 1260 arguments
    - 'We should abolish capital punishment': 102 texts, 805 arguments
    - 'We should legalize cannabis': 975 texts, 13.319 arguments
    - 'We should abandon the use of school uniform': 43 texts, 184 arguments


### Index Data
<!-- Which data is used/needed by the application
    How has the index data been compilied -->
We use data from the `curlie_full` collection index, where the [curlielabels](https://curlie.org/) provide a categorization of web pages, which allows us to filter for potentially interesting pages. On the one hand, we search for forum pages, assuming that discussions there provide argumentative data. On the other hand, we search for webpages related to specific controversial topics in order to extend existing data on these topics.
For both scenarios, the curlielabels provide a valuable categorization. For example, we can extract forum pages based on the curlielabel `Chats_and_forums` and train a classifier in order to also identify forum pages without curlielabel. More specific categories such as 'Abortion' (path in curlie-tree: 'Society/ Issues/ Abortion'), 'Homeschooling' or 'Vegetarianism' ('Society/ Lifestyle Choices/') can help to find webpages related to controversial topics.

### Evaluation
<!-- results of the evaluation (if applicable) -->
TODO ?

### Sustainability and ELSA 
<!-- statement about ethical and legal aspects
    Source code and installation (links)
    Publications related to application 
    Future and outlook, use in the organisation -->

#### Ethical and Legal Aspects
The OWI data contain a flag that indicates whether the content of a page may be used for web indexing or web search purposes. During the collection of argumentative texts, we remove records where this flag is False.

#### Source Code and Installation

TODO

#### Publications

##### Original args.me
[Building an Argument Search Engine for the Web](https://webis.de/publications.html#wachsmuth_2017f) (Wachsmuth et al. 2017) <br>
[Data Acquisition for Argument Search:
The args.me Corpus](https://webis.de/publications.html#ajjour_2019a) (Ajjour et al. 2019)

##### Recent
[Segmentation of Argumentative Texts by Key Statements for Argument Mining from the Web](https://webis.de/publications.html#zelch_2025a) (Zelch et al. 2025)  <br>
[Reproducing the Argument Quality Prediction of Project Debater](https://webis.de/publications.html#zelch_2025b) (Zelch et al. 2025)

#### Future and Outlook

We plan to further extend the current argument index by more data from the OWI. Apart from that, we want to enrich the index with more argument-related information such as key points. Additionally, we work on a forum classifier in order to find forum threads. These threads could also prove to be a valuable data source for collecting arguments on various topics, especially since they already provide some kind of reference between the differen posts. The forum classifier can then also be implemented as a preprocessing module and added the index creation pipeline.
