Morphological Parser for Inflectional Languages Using Deep Learning

Morphological encoding for texts in Syriac using machine learning

Linguistic corpora are usually parsed at word level. That works fine for languages such as English. How-ever, for languages with a rich morphology, it is rewarding to take morphemes, rather than words, as the basic units. Compare the Hebrew word wayyamlichehu(2 Samuel 2:9), which corresponds to five words in English “and they made him king”. On the basis of a morphologically extensively encoded Hebrew corpus (the complete Old Testament in the database of the ETCBC, the Eep Talstra Centre for Bible and Computer) we have trained ML models to perform the morphological encoding for texts in Syriac (a related Semitic language). Since the morphological encoding of the ETCBC is concisely structured as a string, various seq2seq models have been applied. Depending on the quality and amount of the input data, we have been able to increase the accuracy of the predicted forms in some experiments up to 97%. Our plans are to apply the ML model for accelerating our encoding of large Syriac corpora, starting with the Syriac Bible (Peshitta).

Participating organisations

KU Leuven
Netherlands eScience Center
University of Notre Dame
Vrije Universiteit Amsterdam
Social Sciences & Humanities
Social Sciences & Humanities

Output

Team

WvP
W.T. van Peursen
CK
C. Kingham
co-Applicant
University of Cambridge
CS
C. J. Sikkel
co-Applicant
VU Amsterdam
Dafne van Kuppevelt
Dafne van Kuppevelt
DS
David Smiley
co-Applicant
University of Notre Dame
MC
M. Coeckelbergs
MN
M. Naaijer
co-Appllicant
Affiliated researcher ETCBC
Jisk Attema
RSE, Programme Manager
Netherlands eScience Center

Related projects

Bridging the gap

Digital humanities and the Arabic-Islamic corpus

Updated 24 months ago
Finished

EviDENce

Ego Documents Events modelling – how individuals recall mass violence

Updated 20 months ago
Finished