ochre

Janneke van der Zwaan

Description

Train character-based language models/LSTMs for OCR post-correction
Ready to use workflows for data preprocessing, training correction models, doing the post-correction, and analyzing (remaining) errors
Compare (corrected) OCR text to the gold standard based on character error rate (CER), word error rate (WER), and order independent word error rate
Analyze OCR errors on the word level
Discover OCR post-correction data sets

Ochre is experimental software for cleaning up text with OCR mistakes. The software was developed to investigate whether character-based language models can be used to remove OCR mistakes. In addition, ochre provides functionality to analyze the kinds of OCR mistakes in a corpus. This enables researchers to compare different OCR post-correction methods and find out what kinds of mistakes they are good at solving.

Keywords

Machine learning

Text analysis & natural language processing

Programming languages