The KBK-1M Dataset (‘Koninklijke Bibliotheek Kranten – 1 Miljoen’) is a collection of 1,603,396 images and accompanying captions of the period 1922 – 1994. We extracted the images from digitised newspapers that are stored in the National Library (KB) Newspaper Archive and that are publicly accessible via www.delpher.nl . Via Delpher visitors can search and browse through several collections including Dutch newspapers. One way to narrow down retrieved results is by clicking on facets. One of these is ‘illustraties met onderschrift’ (illustrations with caption) that contain photographs (black & white and colour), comic strips, political cartoons and weather-forecasts. This KBK-1M dataset contains these illustrations with captions of all newspapers in the period 1922-1994 which were on Delpher when we crawled the illustrations, in August 2015.
In the newspaper archive of the KB, each issue is stored as a set of scanned pages with one JPEG per newspaper page. Each page is associated with a set of metadata files which describe the locations of each image, caption and article on that page. During the digitisation process of the newspapers, these locations were manually annotated by trained workers. The article and caption texts are available through automatic OCR-processed output. We took these data as starting point when we built the harvester to create the KBK-1M dataset. The data harvester was built using the Python programming language which prepared and extracted the images and captions using KB-internal RESTful APIs. Figure 1 below, shows how we transformed the raw source material into the dataset that contains JPEG files for the images and JSON files for the metadata.