CollAIte

An Artificial Intelligence Approach to Comparing Text Versions

image credits: Shutterstock

As happens during many research projects, we started the COLLaiTE project with a slightly different aim than what we eventually did. This is unsurprising, considering the novelty of our original goal: we set out to apply machine learning algorithms to improve the comparison of literary text versions. To understand why, we start with a brief explanation of why we would want to compare the different versions of a literary text in the first place. Literary texts are dynamic entities: they go through different stages of development before the first publication and continue to change afterwards. This development is clearly visible in the texts of nineteenth and twentieth century authors whose notes, drafts and typescripts contain traces of development in the form of deletions, additions and revisions. But medieval and classical texts, too, were subject to revisions and changed over time – as do contemporary literary works that have been drafted on a computer.

By comparing the different text versions scholars gain insight into the parts that have been heavily revised, deleted, or added. This helps them to understand how a literary work is subject to cultural and societal influences, for instance through censorship or adaptation. The practice of text comparison (or “collation”) dates back to the ancient libraries of Alexandria and Pergamum but has taken flight with the advent of the computer. By carefully transcribing and annotating the text versions in a machine-readable language like XML, scholars could employ software algorithms to facilitate the collation process which, to be fair, is quite tiresome and error prone. To date, a number of advanced text collation programs exist. They compare text versions on a word- or character level, outputting a detailed overview of the textual variation. Typically, the input of these programs is plain text, which means the annotated XML files need to be converted to a format that contains less scholarly information.

The COLLaiTE project set out to apply machine learning algorithms to further improve the text collation process, in particular by deploying pattern-finding algorithms to compare XML transcriptions of text versions. We expected that this would allow us to make use of the scholarly annotations in XML to further improve the collation output. However, a few months into the project we realized that in order to properly train the machine learning algorithms, we required a larger dataset which, due to the nature of our field, did not exist. We therefore decided to focus on another major challenge of the research field: visualization of collation output. As mentioned above, this output is usually very detailed, which means that it’s only understandable to a small group of text specialists. This is regrettable, because an understanding into the way a literary work changed variation can be of interest to a larger audience of both scholars and non-experts.

With the addition of a second Research Software Engineer who specializes in UX research and web design, the second half of the project focused on developing a visually engaging interface for the opensource collation software CollateX. The result, called “Collens”, was presented at the project’s Lorentz workshop and gained positive feedback. Following the shift in focus, we had decided to center this Lorentz workshop around the topic of visualization of text variation, too, and this proved to be a successful choice: it turns out that there exists a large international community of scholars interested in interested in the same topic. During the workshop, we laid the foundations of an international working group which aims to facilitate the interchange of collation software output and work on a set of guidelines for the design of text variation visualizations. And as for Collens, the next step is to apply for funding to further develop the prototype into a sustainable, engaging research environment of its own.

Participating organisations

Social Sciences & Humanities

Testimonials

I think it’s safe to say that this is one of the prettiest visualizations the field has seen

– Elli Bleeker, Project Lead Applicant

Related software

Collens

CO

Collens is a dynamic web-based tool designed for scholars to compare textual variants with annotations. Utilizing machine learning, it offers an efficient workflow for analyzing multiple versions of literary and scholarly documents.

Updated 14 months ago

3

Harmony

HA

Making harmonisation simple. Social scientists often have to compare items from different questionnaires or datasets. Harmony is a tool that uses natural language processing and generative AI models to help researchers harmonise questionnaire items quickly, even in different languages.

Updated 29 months ago

5

CollAIte

Participating organisations

Output

Testimonials

Team

Contact person

Kody Moodley

Related software

Collens

Harmony

CollAIte

Participating organisations

Output

Computer programs1

Posters1

Workshops1

Testimonials

Team

Contact person

Kody Moodley

Related software

Collens

Harmony