htr-quality-classifier
A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).
Leveraging AI for HTR post-correction
The goal of the LAHTeR project was to enhance the usability of computer-generated transcriptions of early modern handwritten archival documents from the Dutch East India Company (VOC). Conducted in collaboration with GLOBALISE, a project dedicated to improving the research potential of VOC archives, LAHTeR aimed to make these materials more accessible and easier to use for researchers.
Our project’s objectives evolved from exploring post-correction methods to developing a model for document segmentation and classification. This was due to the outcome of the first step in the project: the development of a transcription quality classifier, which revealed that the majority of transcriptions were already of high quality and that pages with lower transcription quality often contained non-textual elements, such as tables, which are inherently difficult for text-focused systems to process. So we shifted our objectives to a more pressing need.
Our transcription quality classifier is now in use beyond the project (e.g., by the National Archives of the Netherlands). This tool provides quick insights into transcription quality, helping users identify pages requiring focused attention. Additionally, the groundwork laid for a segmentation and classification model has far-reaching potential, particularly for large archival collections currently undergoing digitization. These models will help researchers navigate vast archival series more effectively.
Moving forward, we plan to build on the segmentation and classification models, incorporating visual components to improve their accuracy. We invite archival institutions, digital humanities projects, and researchers to follow our work in GLOBALISE and consider adopting the tools developed through LAHTeR. More information is available at LAHTeR GitHub repository.
Bringing the history of early globalisation and colonialism to the fingertips of researchers and the wider public.
Recognizing Extracted Entities for the Historical Database Suriname Curacao
A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).