TICCLAT

Text-induced corpus correction and lexical assessment tool

The Text-Induced Corpus Clean-up tool TICCL, integral part of the CLARIN infrastructure, is globally unique in utilizing the corpus-derived word form statistics to attempt to fully-automatically post-correct texts digitized by means of Optical Character Recognition.

The NWO ‘Groot’ project Nederlab will deliver by the end of 2017 a uniformly processed and linguistically enriched diachronic corpus of Dutch containing an estimated 5-6 billion word tokens. We aim to extend TICCL’s correction capabilities with classification facilities based on specific data collected from the full Nederlab corpus: word statistics, document and time references and linguistic annotations, i.e. Part-of-Speech and Named-Entity labels. These data will complement a solid, renewed basis composed of the available validated lexicons and name lists for Dutch.

In this, TICCL as a post-correction tool will be transformed into TICCLAT, a lexical assessment tool capable of delivering not only correction candidates, but also e.g. more accurately dated diachronic Dutch word forms, more securely classified person and place names. To achieve this on scale, the TICCLAT project will seek a successful merger of TICCL’s anagram hashing with bit-vectorization techniques. TICCLAT’s capabilities will also be evaluated in comparison to human performance by an expert psycholinguist.

The data collected will be exportable for storage in a data repository, as RDF triples, for broad reuse. The project will greatly contribute to a more comprehensive overview of the lexicon of Dutch since its earliest days and of the person and place names that share its history. Its partners are the Dutch experts in Lexicology, Person Names and Toponyms.

Participating organisations

Social Sciences & Humanities
Social Sciences & Humanities
Netherlands eScience Center
Tilburg University

Impact

Output

Team

Janneke van der Zwaan
Janneke van der Zwaan
eScience Research Engineer
Netherlands eScience Center
AM
Adriënne Mendrik
eScience Coordinator
Netherlands eScience Center
Maarten van Meersbergen
Maarten van Meersbergen
eScience Research Engineer
Netherlands eScience Center
MR
Martin Reynaert
Principal investigator
Tilburg University
Patrick Bos
eScience Research Engineer
Netherlands eScience Center
PP
Pushpanjali Pawar
eScience Research Engineer
Netherlands eScience Center
TK
Tom Klaver
eScience Research Engineer
Netherlands eScience Center

Related projects

Understanding visually grounded spoken language via multi-tasking

An alternative approach for intelligent systems to understand human speech

Updated 20 months ago
Finished

Bridging the gap

Digital humanities and the Arabic-Islamic corpus

Updated 24 months ago
Finished

NEWSGAC

Advancing media history by transparent automatic genre classification

Updated 20 months ago
Finished

GlamMap

Visual analytics for the world’s library data

Updated 20 months ago
Finished

Deep learning OCR post-correction

Evaluation and post-correction of OCR of digitised historical newspapers

Updated 20 months ago
Finished

Visualizing Uncertainty and Perspectives

Strengthening the methodology of digital humanities

Updated 19 months ago
Finished

Mining Shifting Concepts Through Time (ShiCo)

Word vector text mining change and continuity in conceptual history

Updated 24 months ago
Finished

Beyond the Book

Visualizing the level of international readability of works of fiction

Updated 19 months ago
Finished

Texcavator

Facilitating and supporting large-scale text mining in the field of digital humanities

Updated 24 months ago
Finished

SPuDisc

Searching public discourse

Updated 20 months ago
Finished