Architecture and Infrastructure

Architecture and Infrastructure#

Cluster Tier Architecture#

The figure below outlines a first version for cluster tiers and their interplay (see also issue #43) . A cluster tier is defined as a well-defined set of services (incl. the corresponding software-stack) that must run in the same cluster / a set of virtual machines. A cluster tier comes with an installation procedure, a software stack and hardware/cluster requirements and has a responsible operator.

Currently we have foreseen four cluster tiers:

Crawling Cluster Tier (CCT) contains a crawler responsible for crawling the web thereby creating WARC-files (i.e. collection of HTTP streams from the crawling process). WARC Files are stored in an S3 Bucket which can be accessed by the other tiers. There can be multiple CCT, but there should be only one CCT per data center, as the CCT can scale with the amount of resources provided.
Crawler Frontier Tier (CFT) will be a singleton tier (i.e. there is only one instance across all participating data centers) coordinating the crawls. It stores crawled urls including access statistics and cache digest. From the crawler frontier we will derive the website registry and provide an interface for the interaction from webmasters with the frontier (e.g. an interface for take-down requests).
Preprocessing and Enrichment Tier (PET) takes the WARC files from the S3 Bucket filled by the CCT and extracts cleaned HTML as well as further metadata. Following the partitioning of the CCT, each data center should have a dedicated PET for the WARC files stored at that data center. The metadata extracted by the PET will be stored in Parquet format. Responsible WP: WP 2
Index and Storage Tier (IST) turns the cleaned content from the PET into 1) a usable index/inverted file, and 2) a web graph (including anchor texts). Similar to the CCT, each data center should have its own IST (and only one), responsible for creating an index for the documents crawled by that data center. The ISTs of each data center might need to communicate (parts of) the web graph to each other, to account for hyperlinks between pages crawled by different data centers. The inverted files will be partitioned by certain types of metadata, such as topic, language, TLD, etc. Aside from the inverted file and web graph, the IST will also contain a metadata store in Parquet format. The inverted files, web graphs and metadata stores can be downloaded through an API. The API provides filtering options, to allow third parties to download only those parts of the index that are relevant to their use cases. The inverted files are distributed as CIFF files, the web graphs and metadata stores as Parquet files.
Preprocessing Plugins Evaluation Tier (PPET) enables the evaluation of plugins to the content analysis library. In order to expand enrichment capabilities, both project members as well as third parties may develop plugins used in the PET. To ensure good quality and sufficient throughput, candidate plugins will perform their enrichment task on benchmarking data. This evaluation will be performed using the TIRA platform hosted at a data center.

Federated Data Storage#

../../_images/federated_storage_architecture.png

The data infrastructure exposes a file hierarchy / object hierarchy. Data written to iRODS will be federated across the data centers. MinIO will be the S3 compliant access protocol with one bucket per data center. The bucket will access the iRODS data folder of that particular zone.

Data exchange between individual tiers is organised according to an agreed folder structure. Zonenames/bucket-names form the root entry and specify the access location and guide data-locality. If the tier does not run in the same zone, data access can be cost intensive.

Below are specifications examples for the folder hierarchy (path-on-system). Replication needs to be specified on a per-folder basis.

Example Folder Structure#

> mc du -r -d 3 owseu-lrz/lrz/
3.0GiB	683 objects	lrz/index/2023-05
38GiB	3156 objects	lrz/index/2023-06
60GiB	4302 objects	lrz/index/2023-07
102GiB	8141 objects	lrz/index
543GiB	32751 objects	lrz/preprocessed/2023-09
4.5GiB	554 objects	lrz/preprocessed/2023-10
32MiB	121 objects	lrz/preprocessed/log
547GiB	33426 objects	lrz/preprocessed
16GiB	69 objects	lrz/spark-logs
162GiB	1661 objects	lrz/warc/2023-08
3.1TiB	33188 objects	lrz/warc/2023-09
553GiB	6019 objects	lrz/warc/2023-10
3.8TiB	40868 objects	lrz/warc
4.5TiB	82509 objects	lrz

Folder Structure Specification#

bucket-name/                           # zone to enable fast access to data for all tiers in the zone
  warc/                                # folder containing ALL WARC files crawled
    YYYY-MM/                           # folder with WARC files crawled in year YYYY and month MM
      DD/                              # WARC files for day DD. 
        {_crawler_name_}/              # Name of the crawler 
          HHmmss-{int}.warc.gz         # single warc file. Namen scheme subject to change 
  preprocessed/	                       # folder containing output from preprocessing based access.
    YYYY-MM/                           # folder with preprocessing data in year YYYY and month MM
      DD/                              # daily batch of results. 
        Date-{class}-{split}.parquet   # main result of preprocessing. naming scheme to be defined
        Date-{class}-{split}.log.gz    # zipped log file for written parquet files (when + stats)
  index/                               # folder containing output from indexing (i.e. CIFF files)
    YYYY-MM/                           # folder with preprocessing data in year YYYY and month MM
      DD/                              # sub-daily batch of results. 
        {TLD}/{LANG}/                  # Index partitioning scheme. Subject to change based on experiments	
          index-{num?}.ciff            # main result of indexing. naming scheme to be defined. 
          metadata-{num?}.parquet      # main result of indexing/preprocessing. naming scheme to be defined. 

Other File Structures:#

shared/ 
  auxiliary/	        # auxiliary data which is replicated among all nodes
	models/	            # machine learning models, preprocessing models
	plugins/ 	        # plugins for preprocessing to be shared
	filters/	        # data for filters to be shared. 
	logs.csv	        # log file of updates. Potentially useful for notification
  on_demand/ 	        # folder for on demand data crawling / indexing requests This is data which is not
	uuid-on-demand/     # folder with configuration files of a specific on demand request
	   config.json      # json file with the configuration except for the urls
	   url_patterns.txt # patterns of urls to be provided
  public/ 	            # folder containing public files (e.g. downloads)
	[ABCDEF]/ 	 	    # probably provision of sub-indices or delta updates with a
  community/ 	        # folder for communities and third parties doing their own
	                    # processing / computing. This can be either interested third
	 	                # parties or third parties funded by us.
	[ABCDEF]/ 	 	    # one folder per community / third party

Crawling data will be preprocessed on a daily basis, i.e. the crawler guarantees to write WARCs on a daily basis. Nevertheless, the crawler cannot guarantee it on the second, so DD/ directories should be only preprocessed after a certain amount of time (e.g. 2 hours after DD+1).
Preprocessing results will be written to the file in batches, potentially on a sub-daily basis, i.e. the preprocessor converts the daily WARCs and puts them into a directory for that day.