ochre

A tool to clean up text generated by OCR using individual words as well as their context.

2
mentions
1
contributor

What ochre can do for you

  • Train character-based language models/LSTMs for OCR post-correction
  • Ready to use workflows for data preprocessing, training correction models, doing the post-correction, and analyzing (remaining) errors
  • Compare (corrected) OCR text to the gold standard based on character error rate (CER), word error rate (WER), and order independent word error rate
  • Analyze OCR errors on the word level
  • Discover OCR post-correction data sets

Ochre is experimental software for cleaning up text with OCR mistakes. The software was developed to investigate whether character-based language models can be used to remove OCR mistakes. In addition, ochre provides functionality to analyze the kinds of OCR mistakes in a corpus. This enables researchers to compare different OCR post-correction methods and find out what kinds of mistakes they are good at solving.

Keywords
Programming languages
  • Common Workflow Language 43%
  • Jupyter Notebook 39%
  • Python 16%
  • HTML 1%
  • Shell 1%
License
</>Source code

Participating organisations

Social Sciences & Humanities
Social Sciences & Humanities
Koninklijke Bibliotheek
Netherlands eScience Center

Mentions

Contributors

Janneke van der Zwaan
Janneke van der Zwaan

Related projects

Deep learning OCR post-correction

Evaluation and post-correction of OCR of digitised historical newspapers

Updated 20 months ago
Finished

Related software

nlppln

NL

A flexible solution to build text mining workflows that allows you to quickly combine Natural Language Processing tools from different sources.

Updated 28 months ago
7 2