The KB Europeana Newspapers NER dataset was created for the purpose of evaluation and training of NER (named entities recognition) software. The original OCR of a selection of KB newspapers has been manually annotated with named entities information to provide a 'perfect' result, otherwise also known as ground truth.

Each page has a OCR result in ALTO format and a corresponding BIO file that contains the manually annotated entities for the text. For more information on the BIO format, please go here.

The Dutch Europeana Newspapers NER dataset is based on a sample of the KB newspaper collection and has been created during the Europeana Newspapers project, a European research project aimed at the improvement of access to digital newspapers online. The full dataset is being made available under CC-0. You can download the individual pages as zipped archives from here:

The software for NER processing of digital newspapers can be found here: