Massive Biological Data Clustering, Reporting and Visualization Tools

Sequence validation in the DNA barcoding project

Image: Morchella elata, a species of fungus in the family Morchellaceae native to Europe by Peter G. Werner

Biological data are being generated at an increasingly fast pace which can most prominently be observed in DNA sequence databases. In the CBS DNA barcoding project at the CBS-KNAW Fungal Biodiversity Centre in the Netherlands alone, researchers have sequenced around 200,000 sequences of two genes ITS (internal transcribed spacer) and LSU (large subunit) in the last five years for the classification and identification of fungal species.

In order to publish these sequences, they have to be checked and validated. However, researchers experienced that sequence validation was the most severe bottleneck of the DNA barcoding project. Despite the high degree of complexity of currently available search routines, the massive number of sequences makes the quick and correct identification of large groups of similar sequences practically impossible. The problem is more evident for much larger databases like GenBank where approximately six million fungal DNA sequences are currently available for download. There is a need for clustering tools for automatic knowledge extraction enabling the curation/validation of large-scale databases.

This project aims to build a validation tool for biological data that is efficient in terms of time and memory and is capable of dealing with large-scale datasets with high accuracy. With eScience Research Engineers’ expertise on Efficient Computing and Big Data Analytics, a complete software/tool to the end-users/researchers can be delivered that is capable of handling large-scale datasets on a normal desktop computer or on a cloud based infrastructure.

This tool will not only be used to validate and curate massive number of DNA sequences, but can also be used in all fields of data partitioning where one deals with similarity functions such as protein family detection or metagenomics for example.

Participating organisations

CBS-KNAW Fungal Biodiversity Centre
Netherlands eScience Center
Life Sciences
Life Sciences

Impact

Output

Team

VR
Vincent Robert
Principal investigator
CBS-KNAW, Westerdijk Fungal Biodiversity Institute
Sonja Georgievska
eScience Research Engineer
Netherlands eScience Center
Lars Ridder
Lars Ridder
eScience Coordinator
Netherlands eScience Center
DV
Duong Vu
Co-Applicant
Westerdijk Fungal Biodiversity Institute
Arnold Kuzniar
Arnold Kuzniar
eScience Research Engineer
Netherlands eScience Center

Related projects

FAIR is as FAIR does

Integrating data publishing principles in scientific workflows

Updated 23 months ago
Finished

Data quality in a distributed learning environment

Vast amounts of data to improve cancer treatment decisions

Updated 26 months ago
Finished

Enhancing Protein-Drug Binding Prediction

Combining molecular simulation and eScience technologies

Updated 2 months ago
Finished

3D-e-Chem

Efficient exploitation of the massive amount of modern-day life science data

Updated 22 months ago
Finished

Chemical Analytics Platform

Managing and exploiting growing data resources in chemical design

Updated 22 months ago
Finished

ODEX4all

Open discovery and exchange for all

Updated 22 months ago
Finished

Related software

Dive

DI

Interactively explore millions of 2D and 3D data points in your browser, without the need to install anything.

Updated 31 months ago
16 2

FastMLC

FA

Detect clusters in for example large collections of DNA or protein sequences, and visualize the results in a web browser.

Updated 31 months ago
11 5