Massive Biological Data Clustering, Reporting and Visualization Tools

Biological data are being generated at an increasingly fast pace which can most prominently be observed in DNA sequence databases. In the CBS DNA barcoding project at the CBS-KNAW Fungal Biodiversity Centre in the Netherlands alone, researchers have sequenced around 200,000 sequences of two genes ITS (internal transcribed spacer) and LSU (large subunit) in the last five years for the classification and identification of fungal species.

In order to publish these sequences, they have to be checked and validated. However, researchers experienced that sequence validation was the most severe bottleneck of the DNA barcoding project. Despite the high degree of complexity of currently available search routines, the massive number of sequences makes the quick and correct identification of large groups of similar sequences practically impossible. The problem is more evident for much larger databases like GenBank where approximately six million fungal DNA sequences are currently available for download. There is a need for clustering tools for automatic knowledge extraction enabling the curation/validation of large-scale databases.

This project aims to build a validation tool for biological data that is efficient in terms of time and memory and is capable of dealing with large-scale datasets with high accuracy. With eScience Research Engineers’ expertise on Efficient Computing and Big Data Analytics, a complete software/tool to the end-users/researchers can be delivered that is capable of handling large-scale datasets on a normal desktop computer or on a cloud based infrastructure.

This tool will not only be used to validate and curate massive number of DNA sequences, but can also be used in all fields of data partitioning where one deals with similarity functions such as protein family detection or metagenomics for example.

Massive Biological Data Clustering, Reporting and Visualization Tools

Participating organisations

Impact

Output

Team

Contact person

Sonja Georgievska

eScience Research Engineer

Netherlands eScience Center

0000-0002-8094-4532

Related projects

FAIR is as FAIR does

Data quality in a distributed learning environment

Enhancing Protein-Drug Binding Prediction

3D-e-Chem

Chemical Analytics Platform

ODEX4all

Related software

Dive

FastMLC

Massive Biological Data Clustering, Reporting and Visualization Tools

Participating organisations

Impact

Book section3

Conference papers4

Journal articles813

Thesis3

Other60

Output

Computer programs2

Journal articles2

Presentations2

Team

Contact person

Sonja Georgievska

eScience Research Engineer

Netherlands eScience Center

.logo-orcid_svg__st1{fill:#fff}0000-0002-8094-4532

Related projects

FAIR is as FAIR does

Data quality in a distributed learning environment

Enhancing Protein-Drug Binding Prediction

3D-e-Chem

Chemical Analytics Platform

ODEX4all

Related software

Dive

FastMLC

0000-0002-8094-4532