The IMPACT KB data set was created for the purpose of evaluation and training of OCR software. The original OCR and layout recognition of a selection of KB material has been manually corrected to 99,95% accuracy to provide a 'perfect' result, otherwise also known as ground truth. The set consists of:
Each page has a master image in TIF format and a corresponding PAGE XML file that contains the ground truth for both the text and the layout. For more information on the PAGE format, please go here.
The IMPACT KB dataset is a representation of the KB collections and has been compiled during the IMPACT project, a European research project aimed at the improvement of access to historical text. The project ran from 2008-2012 and was coordinated by the KB.
The full dataset is being made available in the Public Domain. You can download the individual sets as zipped archives from here:
In addition, this spreadsheet provides a concordance of the IMPACT file names and the according KB-IDs.
TIF (15.1 GB), XML (9.0 MB)
TIF (16.7 GB), XML (12.8 MB)
- Parliamentary Proceedings:
TIF (5.3 GB), XML (8.5 MB)
- Radio Bulletins:
TIF (0.9 GB), XML (0.5 MB)