Morphological Parser for Inflectional Languages Using Deep Learning

Morphological encoding for texts in Syriac using machine learning

Linguistic corpora are usually parsed at word level. That works fine for languages such as English. How-ever, for languages with a rich morphology, it is rewarding to take morphemes, rather than words, as the basic units. Compare the Hebrew word wayyamlichehu(2 Samuel 2:9), which corresponds to five words in English “and they made him king”. On the basis of a morphologically extensively encoded Hebrew corpus (the complete Old Testament in the database of the ETCBC, the Eep Talstra Centre for Bible and Computer) we have trained ML models to perform the morphological encoding for texts in Syriac (a related Semitic language). Since the morphological encoding of the ETCBC is concisely structured as a string, various seq2seq models have been applied. Depending on the quality and amount of the input data, we have been able to increase the accuracy of the predicted forms in some experiments up to 97%. Our plans are to apply the ML model for accelerating our encoding of large Syriac corpora, starting with the Syriac Bible (Peshitta).