nlppln

doi:10.5281/zenodo.1116323

Cite this software

DOI:

Description

Quickly build text mining and/or nlp workflows in Python
Combine tools written in different programming languages

Digital Humanities research often involves Natural Language Processing (NLP), in which a body of natural language text, or corpus, is analyzed using software. While there are many software packages available, constructing new research
analyses by combining (parts of) existing packages remains challenging. This is due to the fact that individual software packages are designed to do a task and to do that task well; they are not primarily designed to interact with other,
complementary packages. Another problem is that there are many tools available for English, but not for other languages.

nlppln (pronounced 'NLP pipeline') is an open source Python package that helps to address these problems, by making it easy to package existing tools in a uniform way as defined in the CWL (Common Workflow Language) standard for describing data analysis workflows. nlppln includes components to do tasks that are common in NLP, such as tokenization (multiple languages), lemmatization (for Dutch), and named entity recognition (for Dutch). These components are based on existing tools. Users can easily construct new analysis workflows by combining these pre-baked components with tools of their own creation.

Besides improving interoperability, nlppln also keeps a formal record of all steps taken in a workflow. This makes the research more transparent, and improves reproducibility.

Keywords

Text analysis & natural language processing

Programming languages