BERTDom: Protein Domain Boundary Prediction Using BERT

Ahmad Haseeb; Maryam Bashir; Aamir Wali

doi:10.31577/cai_2023_3_667

Authors

Ahmad Haseeb FAST School of Computing, National University of Computer and Emerging Sciences, Lahore, Pakistan
Maryam Bashir FAST School of Computing, National University of Computer and Emerging Sciences, Lahore, Pakistan
Aamir Wali FAST School of Computing, National University of Computer and Emerging Sciences, Lahore, Pakistan

DOI:

https://doi.org/10.31577/cai_2023_3_667

Keywords:

Protein, protein domain boundary, BERT, BiLSTM

Abstract

The domains of a protein provide an insight on the functions that the protein can perform. Delineation of proteins using high-throughput experimental methods is difficult and a time-consuming task. Template-free and sequence-based computational methods that mainly rely on machine learning techniques can be used. However, some of the drawbacks of computational methods are low accuracy and their limitation in predicting different types of multi-domain proteins. Biological language modeling and deep learning techniques can be useful in such situations. In this study, we propose BERTDom for segmenting protein sequences. BERTDOM uses BERT for feature representation and stacked bi-directional long short term memory for classification. We pre-train BERT from scratch on a corpus of protein sequences obtained from UniProt knowledge base with reference clusters. For comparison, we also used two other deep learning architectures: LSTM and feed-forward neural networks. We also experimented with protein-to-vector (Pro2Vec) feature representation that uses word2vec to encode protein bio-words. For testing, three other bench-marked datasets were used. The experimental results on benchmarks datasets show that BERTDom produces the best F-score as compared to other template-based and template-free protein domain boundary prediction methods. Employing deep learning architectures can significantly improve domain boundary prediction. Furthermore, BERT used extensively in NLP for feature representation, has shown promising results when used for encoding bio-words. The code is available at https://github.com/maryam988/BERTDom-Code.

Downloads

Download data is not yet available.

BERTDom: Protein Domain Boundary Prediction Using BERT

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

Information

Make a Submission

Keywords