LAHTeR

Leveraging AI for HTR post-correction

Image by Dave Straatmeyer

The goal of the LAHTeR project was to enhance the usability of computer-generated transcriptions of early modern handwritten archival documents from the Dutch East India Company (VOC). Conducted in collaboration with GLOBALISE, a project dedicated to improving the research potential of VOC archives, LAHTeR aimed to make these materials more accessible and easier to use for researchers.

Our project’s objectives evolved from exploring post-correction methods to developing a model for document segmentation and classification. This was due to the outcome of the first step in the project: the development of a transcription quality classifier, which revealed that the majority of transcriptions were already of high quality and that pages with lower transcription quality often contained non-textual elements, such as tables, which are inherently difficult for text-focused systems to process. So we shifted our objectives to a more pressing need.

Our transcription quality classifier is now in use beyond the project (e.g., by the National Archives of the Netherlands). This tool provides quick insights into transcription quality, helping users identify pages requiring focused attention. Additionally, the groundwork laid for a segmentation and classification model has far-reaching potential, particularly for large archival collections currently undergoing digitization. These models will help researchers navigate vast archival series more effectively.

Moving forward, we plan to build on the segmentation and classification models, incorporating visual components to improve their accuracy. We invite archival institutions, digital humanities projects, and researchers to follow our work in GLOBALISE and consider adopting the tools developed through LAHTeR. More information is available at LAHTeR GitHub repository.

Participating organisations

Huygens Instituut
Netherlands eScience Center
Social Sciences & Humanities
Social Sciences & Humanities

Output

Team

LP
Lodewijk Petram
Lead Applicant
Royal Netherlands Academy of Arts & Sciences (KNAW)
Jisk Attema
Programme Manager
Netherlands eScience Center
Carsten Schnober
Carsten Schnober
Lead RSE
Netherlands eScience Center

Related projects

GLOBALISE

Bringing the history of early globalisation and colonialism to the fingertips of researchers and the wider public.

Updated 2 months ago
In progress

REE-HDSC

Recognizing Extracted Entities for the Historical Database Suriname Curacao

Updated 6 months ago
Finished

Related software

htr-quality-classifier

HT

A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).

Updated 12 months ago
1