Transformer-based deep learning for next generation mass spectrometry-based phosphoproteomics

Cancer is associated with DNA alterations causing uncontrolled cell growth. Identifying alterations in individual tumor genomes, and changes in its functionally relevant (phospho)protein complement, increases our understanding of how such changes drive disease. For analysis of (phospho)proteomes, tandem mass spectrometry after protein digestion is the method of choice. A crucial component in the data analysis is the spectral library, which can be created from a large collection of previous experiments or predicted after training with a large set of experimental data.

We aim to mine a vast set of phosphoproteomics data at the OncoProteomics Laboratory, Amsterdam UMC to build prediction tools for spectral library creation. In that way, no precious instrument time has to be spent on projectspecific spectral library generation, and novel peptides that has not been experimentally observed can still be detected.

We have converted the data into AI-ready format to train three deep neural networks for the prediction of mass spectrometry signal (MS), retention time (RT), and collisional cross sections (CCS) from amino acid sequences of peptides. The deep learning approach has demonstrated superior performance over traditional machine learning methods for this purpose. We have assessed the performance of the transformer architecture for RT prediction, which demonstrates state-of-the-art performance of the transformer architecture.

We have developped a Python software package called aiproteomics. More detail is available at https://github.com/aiproteomics/aiproteomics.