ochre

A tool to clean up text generated by OCR using individual words as well as their context.

2
mentions
1
contributor

What ochre can do for you

  • Train character-based language models/LSTMs for OCR post-correction
  • Ready to use workflows for data preprocessing, training correction models, doing the post-correction, and analyzing (remaining) errors
  • Compare (corrected) OCR text to the gold standard based on character error rate (CER), word error rate (WER), and order independent word error rate
  • Analyze OCR errors on the word level
  • Discover OCR post-correction data sets

Ochre is experimental software for cleaning up text with OCR mistakes. The software was developed to investigate whether character-based language models can be used to remove OCR mistakes. In addition, ochre provides functionality to analyze the kinds of OCR mistakes in a corpus. This enables researchers to compare different OCR post-correction methods and find out what kinds of mistakes they are good at solving.

Keywords
Programming languages
  • Common Workflow Language 43%
  • Jupyter Notebook 39%
  • Python 16%
  • HTML 1%
  • Shell 1%
License
  • Apache-2.0
</>Source code

Participating organisations

Social Sciences & Humanities
Social Sciences & Humanities
Koninklijke Bibliotheek
Netherlands eScience Center

Mentions

Contributors

Contact person

Janneke van der Zwaan

Janneke van der Zwaan

Netherlands eScience Center
Mail Janneke
Janneke van der Zwaan
Janneke van der Zwaan
Netherlands eScience Center

Related projects

Deep learning OCR post-correction

Evaluation and post-correction of OCR of digitised historical newspapers

Updated 12 months ago
Finished

Related software

nlppln

NL

A flexible solution to build text mining workflows that allows you to quickly combine Natural Language Processing tools from different sources.

Updated 21 months ago
4 2