About 80% of global data are unstructured and therefore largely unsuitable for further analysis. This includes the large amounts of free medical text data that have been stored daily in electronic health records (EHRs) for many years, e.g., patient diagnosis, history, complications and adverse events, and even treatment plans. Many researchers now manually transcribe these data into their research files; a time-consuming and error-prone process. The enormous potential for information in this text data is thus underutilized.
We can make free text data FAIR by structuring the data and making it retrievable. In doing so, the data becomes suitable for research within many conceivable applications, such as recognition of side effects, automatic coding of diagnoses, prediction of diseases based on the presence of symptoms, etc. And this applies not only to use in analyses; by structuring text data, finding and including patients for both prospective and retrospective studies becomes much more efficient. Using data already collected also means that more effort can be put into retrospective research. This makes it more efficient, cheaper and more agile for umc's to conduct their research.
Software exists in the market to automate the structuring of free text. Unfortunately, this software is costly, non-transparent and difficult to connect to existing research infrastructure. However, there are major, promising developments in the fields of text mining and 'Natural Language Processing' (NLP) that are going to be a major step forward in making medical text data FAIR.
Therefore, in 2020, UMC Utrecht started the development of an open source textmining platform based on CogStack. The features of the platform create the right conditions for doing Open Science and contribute to making free medical text data FAIR.
- It is completely open source;
- It is modularly configurable to suit a specific use case;
- It aligns with different user groups;
- It aligns with data exchange standards such as FHIR;
- It takes data engineering and processing challenges out of the hands of the researcher;
- It offers the possibility of pseudonymizing text data; a prerequisite for using and sharing (text) data in research.
In collaboration with Radboudumc, UMCG, Erasmus MC and AmsterdamUMC, we want to further develop the platform so that it can be scalably deployed at other healthcare institutions. It is also an important step in the joint development of NLP methods for Dutch medical text; the 'engine' of structuring free text data. This collaboration underlines the importance of good text mining tools for all umc's.