TICCLAT

Text-induced corpus correction and lexical assessment tool

The Text-Induced Corpus Clean-up tool TICCL, integral part of the CLARIN infrastructure, is globally unique in utilizing the corpus-derived word form statistics to attempt to fully-automatically post-correct texts digitized by means of Optical Character Recognition.

The NWO ‘Groot’ project Nederlab will deliver by the end of 2017 a uniformly processed and linguistically enriched diachronic corpus of Dutch containing an estimated 5-6 billion word tokens. We aim to extend TICCL’s correction capabilities with classification facilities based on specific data collected from the full Nederlab corpus: word statistics, document and time references and linguistic annotations, i.e. Part-of-Speech and Named-Entity labels. These data will complement a solid, renewed basis composed of the available validated lexicons and name lists for Dutch.

In this, TICCL as a post-correction tool will be transformed into TICCLAT, a lexical assessment tool capable of delivering not only correction candidates, but also e.g. more accurately dated diachronic Dutch word forms, more securely classified person and place names. To achieve this on scale, the TICCLAT project will seek a successful merger of TICCL’s anagram hashing with bit-vectorization techniques. TICCLAT’s capabilities will also be evaluated in comparison to human performance by an expert psycholinguist.

The data collected will be exportable for storage in a data repository, as RDF triples, for broad reuse. The project will greatly contribute to a more comprehensive overview of the lexicon of Dutch since its earliest days and of the person and place names that share its history. Its partners are the Dutch experts in Lexicology, Person Names and Toponyms.

Participating organisations

Social Sciences & Humanities

Finished

TICCLAT

Participating organisations

Impact

Output

Team

Contact person

Patrick Bos

Related projects

Understanding visually grounded spoken language via multi-tasking

Bridging the gap

NEWSGAC

GlamMap

Deep learning OCR post-correction

Visualizing Uncertainty and Perspectives

Mining Shifting Concepts Through Time (ShiCo)

Beyond the Book

Texcavator

SPuDisc

TICCLAT

Participating organisations

Impact

Reports1

Other1

Output

Computer programs3

Conference papers1

Journal articles2

Team

Contact person

Patrick Bos

Related projects

Understanding visually grounded spoken language via multi-tasking

Bridging the gap

NEWSGAC

GlamMap

Deep learning OCR post-correction

Visualizing Uncertainty and Perspectives

Mining Shifting Concepts Through Time (ShiCo)

Beyond the Book

Texcavator

SPuDisc