Making harmonisation simple. Social scientists often have to compare items from different questionnaires or datasets. Harmony is a tool that uses natural language processing and generative AI models to help researchers harmonise questionnaire items quickly, even in different languages.


What Harmony can do for you


Do you need to compare questionnaire items across studies? Do you want to find the best match for a set of items? Are there are different versions of the same questionnaire floating around and you want to make sure how compatible they are? Are the questionnaires written in different languages that you would like to compare?

Here's a walkthrough video on how you can use Harmony online at harmonydata.ac.uk. Click to view:

Harmonising questionnaires

The Harmony project is a data harmonisation project that uses Natural Language Processing to help researchers make better use of existing data from different studies by supporting them with the harmonisation of various measures and items used in different studies. Harmony is a collaboration project between Ulster University, University College London, the Universidade Federal de Santa Maria, and Fast Data Science. Harmony is funded by Wellcome as part of the Wellcome Data Prize in Mental Health.

Harmony is a project in active development and you can contribute.

If you have found a bug or would like a new feature, you can raise an issue here for issues with Harmony's natural language understanding functionality, or alternatively here for issues with Harmony's user interface and graphics. You can also join our Discord server!

What does Harmony do?

  • Psychologists and social scientists often have to match items in different questionnaires, such as "I often feel anxious" and "Feeling nervous, anxious or afraid".
  • This is called harmonisation.
  • Harmonisation is a time consuming and subjective process.
  • Going through long PDFs of questionnaires and putting the questions into Excel is no fun.
  • Enter Harmony, a tool that uses natural language processing and generative AI models to help researchers harmonise questionnaire items, even in different languages.

Quick start with the code

Read our guide to contributing to Harmony here or read CONTRIBUTING.md.

You can run the walkthrough Python notebook in Google Colab with a single click:

You can also download an R markdown notebook to run in R Studio:

You can run the walkthrough R notebook in Google Colab with a single click:

The Harmony Project

Harmony is a tool using AI which allows you to compare items from questionnaires and identify similar content. You can try Harmony at https://harmonydata.ac.uk/app and you can read our blog at https://harmonydata.ac.uk/blog/.

Who to contact?

You can contact Harmony team at https://harmonydata.ac.uk/, or Thomas Wood at https://fastdatascience.com/.

🖥 Installation instructions (video)

Installing Harmony

🖱 Looking to try Harmony in the browser?

Visit: https://harmonydata.ac.uk/app/

You can also visit our blog at https://harmonydata.ac.uk/

✅ You need Tika if you want to extract instruments from PDFs

Download and install Java if you don't have it already. Download and install Apache Tika and run it on your computer https://tika.apache.org/download.html

java -jar tika-server-standard-2.3.0.jar


You need a Windows, Linux or Mac system with

  • Python 3.8 or above
  • the requirements in requirements.txt
  • Java (if you want to extract items from PDFs)
  • Apache Tika (if you want to extract items from PDFs)

🖥 Installing Harmony Python package

You can install from PyPI.

pip install harmonydata

Loading all models

Harmony uses spaCy to help with text extraction from PDFs. spaCy models can be downloaded with the following command in Python:

import harmony

Matching example instruments

instruments = harmony.example_instruments["CES_D English"], harmony.example_instruments["GAD-7 Portuguese"]
questions, similarity, query_similarity, new_vectors_dict = harmony.match_instruments(instruments)

How to load a PDF, Excel or Word into an instrument


Participating organisations

University of Ulster
University College London


In 2023, the Australian Data Archive (ADA) i embarked on a project to harmonise a vast collection of survey questions, seeking a solution that could effectively identify and group similar items across different studies. Researchers at the ADA found Harmony, a data harmonisation tool powered by natural language processing (NLP), and the ADA recognised its potential to streamline this process. https://harmonydata.ac.uk/ada/
Australian Data Archive (ADA)


Related projects

Harmony Data - a platform to drive global mental health research forward

Using Natural Language processing for faster Data Harmonization and easier Data discoverability

Updated 5 months ago

Text mining in Dutch medical text

Five umc's - UMC Utrecht, Radboudumc, UMCG, ErasmusMC and AmsterdamUMC - work together to (co-)develop open source solutions, and validate our methods and techniques with each other, for the (re)use of free medical text present in our EHRs.

Updated 17 months ago


An Artificial Intelligence Approach to Comparing Text Versions

Updated 2 months ago
In progress

Towards next-generation scientific computing tools for diversity-aware language science and technology

Diversity-aware language technology for conversational data

Updated 2 months ago


Facilitating and supporting large-scale text mining in the field of digital humanities

Updated 20 months ago

Related software

Cross-perspective Topic Modeling


An application that uses cross-perspective topic modeling to extract topics and opinions from text and provides insight into how they change over time.

Updated 24 months ago
6 3



Deep Insight And Neural Network Analysis, DIANNA is the only Explainable AI, XAI library for scientists supporting Open Neural Network Exchange, ONNX - the de facto standard models format.

Updated 3 weeks ago
19 12



A flexible solution to build text mining workflows that allows you to quickly combine Natural Language Processing tools from different sources.

Updated 24 months ago
4 2



A visualization that shows how the meaning we attach to a given concept shifts over time.

Updated 24 months ago
2 1



Texcavator is a search engine and text mining application for creating word cloud and time line visualizations of large text corpora.

Updated 1 month ago
7 2



the eXtensible Text Analysis Suite

Updated 24 months ago
1 2