A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).


Cite this software

What htr-quality-classifier can do for you

Text Quality

A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).

The current pipeline is tuned on (historic) Dutch language, and will not perform well on other languages. However, the underlying model has been used for other (Germanic) languages, and can be adapted and applied to texts of other languages and time periods.


Good quality (not necessarily perfect):

Malacca den 29 maart 1.
door zoo veel ruijmer handen te hebben,
Siac van waar op den 5=e deeser,
na onse verschijde adhortaties, is over
eeen gekomen
zoo meede van Siac

Bad quality:

uijtkoops --
winst suijverevense versis
e ee
,, 19
1 oe
na aftrek van
5 p:s C: Commiss:s
t 1a per 't geheel t p=s lb. off @'t geheeke

What's Missing

  • Pipelines for languages other than historic Dutch
  • Automatic training procedure for creating and update pipelines
  • Additional features such as publication year.

See this notebook for a semi-automated pipeline creation process.

How to use text_quality

After installation, use the script to classify PageXML or plain text files. For instance, if you want to classify all *.xml files in the pages/ directory, use the --glob argument: --glob "page/*.xml" --output classifications.csv --output-scores

Per input file, one output line is returned in CSV table format, along with the classification result:

  1. Good quality
  2. Medium quality
  3. Bad quality

All supported parameters:

$ --help
usage: Classify the quality of a (digitized) text. [-h] [--input [FILE ...]] [--pagexml [FILE ...]] [--pagexml-glob PATTERN] [--output FILE] [--output-scores]

  -h, --help            show this help message and exit
  --output FILE, -o FILE
                        Output file; defaults to stdout.
  --output-scores       Output scores and text statistics.

  --input [FILE ...], -i [FILE ...]
                        Plain text file(s) to classify. Use '-' for stdin.
  --pagexml [FILE ...]  Input file(s) in PageXML format.
  --pagexml-glob PATTERN, --glob PATTERN
                        A pattern to find a set of PageXML files, e.g. 'pagexml/*.xml'.


The pipeline might emit warnings like this:

UserWarning: X does not have valid feature names, but MLPClassifier was fitted with feature names

This is due to the internals of the Scikit-Learn Pipeline object, and can safely be ignored.

The dependencies are pinned to specific versions. While this prevents implicit updated even for patch-level updated of required libraries, it prevents misleading warnings emitted by varying Scikit-Learn versions. Hence, requirement dependecies can be changed manually, if you are aware of these issues.

The project setup is documented in Feel free to remove this document (and/or the link to this document) if you don't need it.


To install the text_quality package:

pip install -U text-quality

Alternatively, install the package from GitHub repository:

git clone
cd htr-quality-classifier
python3 -m pip install -U .



Software Architecture

This diagram shows the class design of the text_quality package.

Software architecture


If you want to contribute to the development of text_quality, have a look at the contribution guidelines.


Logic and implementation are based on Nautilus-OCR.

This package was created with Cookiecutter and the NLeSC/python-template.


(Customize these badges with your own links, and check or to see which other badges are available.) recommendations
(1/5) code repositorygithub repo badge
(2/5) licensegithub license badge
(3/5) community registryRSD workflow pypi badge
(4/5) citationDOI
(5/5) checklistOpenSSF Best Practices
howfairisfair-software badge
Other best practices 
Static analysisworkflow scq badge
Coverageworkflow scc badge
DocumentationDocumentation Status
GitHub Actions 
Citation data consistencycffconvert
MarkDown link checkermarkdown-link-check
Programming languages
  • Jupyter Notebook 97%
  • Python 3%
</>Source code

Participating organisations

Netherlands eScience Center
KNAW Humanities Cluster
Huygens Instituut


Carsten Schnober
eScience Research Engineer
Netherlands eScience Center

Related projects


Leveraging AI for HTR post-correction

Updated 1 month ago
In progress