NPLinker
Microbial natural products data mining by integrating genomics and metabolomics data
A community-supported workflow connecting microbial genes, and organisms to their molecular products
Omics datasets have become a key resource for natural products discovery, enabling the systematic exploration of specialized metabolites, the refinement of knowledge on known natural products, and the identification of novel bioactive compounds or metabolic enzymes. Paired omics analyses combine complementary genomics (e.g., biosynthetic gene clusters [BGCs]) and metabolomics (e.g., mass spectra) datasets to elucidate gene-metabolite relationships, accelerating the discovery process. However, omics data structures, preproccessing pipelines, resources, and annotation tools are continuously being improved. For example, newer releases of MIBiG contain more validated BGCs and new annotation fields, while mass spectral libraries are growing as well. Besides, newer versions of omics clustering tools have different output file formats. Together with the constant expansion of available experimental datasets, this puts a strain on downstream frameworks that integrate the data and results.
Hence, the goal of this project was to build a community-supported framework that is up to date with current development and easy to adapt to future changes, allowing researchers to perform combined genome-metabolome mining at a large scale, create effective visualisations linking unique chemistry to biosynthetic genes, and perform integrative evolutionary analyses to discover novel specialised metabolites and their biosynthetic machinery. The project has been inspired by and built on the NPLinker framework that was coined in 2021. During this project, the codebase has been completely overhauled to produce a more flexible, modular and extensible framework to link BGCs to mass spectra. Through the organization of two workshops and monthly community meetings, this project has created and fostered an integrative omics mining community in which several researchers have started to work on developing and applying the NPLinker 2.0 framework. Besides, the workshop participants have also learned relevant concepts and omics tools that are stepping stones for effectively using NPLinker 2.0.
We have been able to make good progress during the project, but some goals have been changed over time. For example, the codebase overhaul has taken more time than anticipated, leaving no time to integrate evolutionary analyses, and much less time to integrate novel scoring algorithms. Furthermore, during the project it became clear that two NPLinker modes were needed to increase its use throughout the community: the PoDP mode (making use of publicly available and registered paired omics datasets) and the local mode (making use of local data, not published or openly available anywhere at the time of analysis). In the end, the project has resulted in a strong, solid base to start paired omics mining based on correlative analysis, as well as a prototype web app to visualize NPLinker results. In particular, the well-tested NPLinker modules to handle the genomics and metabolomics data streams and combine them in one analysis framework can be reused, adapted, and build upon by the omics mining community and beyond.
The field of integrative omics mining has seen various tools for paired omics analysis that link the outputs of two (or more) omics disciplines to make new discoveries. We hope that this project contributes to this field by providing a solid base for further explorations, for example for the development and benchmarking of novel linking scores. Hence, anyone working in the life sciences with omics data is a potential user of the results or can contribute to the codebase in several ways by, e.g., adding novel scoring algorithms, integrating additional annotation results, or further extending the web app. In addition, research software engineers working in the life sciences across academia and industry could make use of the renewed codebase, or parts of it. Please have a look at https://nplinker.github.io/nplinker/latest/ for more information on the various aspects mentioned above.
We anticipate that the entire codebase can be adopted by researchers that aim to link metabolite features to biosynthetic genes, and thus molecules to their producers. However, we also stimulate the community to take advantage of the well-tested classes and modules that handle genomics and metabolomics data in using and integrating them into their pipelines. We plan to continue the NPLinker 2.0 development through various projects, including the MaGiC-MOLFUN Doctoral Training Network, of which members have been instrumental in making contributions to the codebase and organizing the 2nd NPLinker eScience workshop. We also plan to apply NPLinker 2.0 in various natural products discovery projects to prioritize biosynthetic gene cluster – molecule and gene - metabolite feature links for further validation.
Integrating data publishing principles in scientific workflows
Advancing our understanding of molecular mechanisms of health and disease
Microbial natural products data mining by integrating genomics and metabolomics data
This is the NPLinker web application, developed with Plotly Dash. It enables interactive visualization of NPLinker predictions and is designed to be run locally or in a containerized environment.