Deep learning OCR post-correction

Evaluation and post-correction of OCR of digitised historical newspapers

Humanities research makes extensive use of digital archives. Most of these archives, including the KB newspaper data, consist of digitized text. One of the major challenges of using these collections for research is the fact that Optical Character Recognition (OCR) on scanned historical documents is far from perfect. Although it is hard to quantify the impact of OCR mistakes on humanities research, it is known that these mistakes have a negative impact on basic text processing techniques such as sentence boundary detection, tokenization, and part-of-speech tagging. As these basic techniques are often used prior to performing more advanced techniques and most advanced techniques use words as features, it is likely that OCR mistakes have a negative impact on more advanced text mining tasks humanities researchers are interested in, such as named entity recognition, topic modeling, and sentiment analysis.

The goal of this research is to bring the digitized text closer to the original newspaper articles by applying post-correction. Post-correction involves improving digitized text quality by manipulating the textual output of the OCR process directly. The idea is that better quality data boosts eHumantities research. Although the quality of the KB newspaper data would definitely benefit from improving the OCR process itself (improved image recognition), post-correction will still be necessary, because the quality of historical newspapers is suboptimal for OCR (for example, due to poor paper and print quality).

Existing approaches for OCR post-correction generally make use of extensive dictionaries to replace words in the OCRed text that do not occur in the dictionary with words that do. Based on the assumption that a number of characters in every word will be identified correctly, words not in the dictionary are replaced with alternatives that are as similar as possible to the text recognized, possibly taking into account word frequencies to solve ties. The main problem with these existing approaches is that they do not take into account the context in which words occur.

Deep learning techniques provide an opportunity to take this context into account. This project aims to learn a character based language model of Dutch newspaper articles. This is a model of the character sequences occurring in the text of a corpus. OCR mistakes can be viewed as deviations from this model. Mistakes can be fixed by intervening when text deviates too much from the model.

Participating organisations

Koninklijke Bibliotheek
Netherlands eScience Center
Social Sciences & Humanities
Social Sciences & Humanities

Output

Team

HJ
Hans Jansen
Principle Investigator
Koninklijke Bibliotheek
Janneke van der Zwaan
Janneke van der Zwaan
eScience Research Engineer
Netherlands eScience Center
Jisk Attema
Programme Manager
Netherlands eScience Center
AM
Adriënne Mendrik
eScience Coordinator
Netherlands eScience Center

Related projects

Bridging the gap

Digital humanities and the Arabic-Islamic corpus

Updated 15 months ago
Finished

TICCLAT

Text-induced corpus correction and lexical assessment tool

Updated 12 months ago
Finished

GlamMap

Visual analytics for the world’s library data

Updated 11 months ago
Finished

Mining Shifting Concepts Through Time (ShiCo)

Word vector text mining change and continuity in conceptual history

Updated 15 months ago
Finished

Related software

ochre

OC

A tool to clean up text generated by OCR using individual words as well as their context.

Updated 20 months ago
2 1