TY - JOUR
T1 - Dna-prot
T2 - Identification of dna binding proteins from protein sequence information using random forest
AU - Kumar, K. Krishna
AU - Pugalenthi, Ganesan
AU - Suganthan, P. N.
N1 - Funding Information:
G. P. and P. N. S. acknowledge the financial support offered by the A*Star (Agency for Science, Technology and Research). K. K. K. acknowledges Mr. Rajeev Gan-gal, CEO, Insilico division, Systems Biology India Pvt Ltd, Maharashtra, India for providing various resources to accomplish this research work and Prof. Thomas Martinetz and Dr. Stefen Moller, Institute for Neuro-and Bioinformatics, University of Luebeck, Germany for their support and critical reading of the manuscript.
PY - 2009/6
Y1 - 2009/6
N2 - DNA-binding proteins (DNABPs) are important for various cellular processes, such as transcriptional regulation, recombination, replication, repair, and DNA modification. So far various bioinformatics and machine learning techniques have been applied for identification of DNA-binding proteins from protein structure. Only few methods are available for the identification of DNA binding proteins from protein sequence. In this work, we report a random forest method, DNA-Prot, to identify DNA binding proteins from protein sequence. Training was performed on the dataset containing 146 DNA-binding proteins and 250 non DNA-binding proteins. The algorithm was tested on the dataset containing 92 DNA-binding proteins and 100 non DNA-binding proteins. We obtained 80.31% accuracy from training and 84.37% accuracy from testing. Benchmarking analysis on the independent of 823 DNA-binding proteins and 823 non DNA-binding proteins shows that our approach can distinguish DNA-binding proteins from non DNA-binding proteins with more than 80% accuracy. We also compared our method with DNAbinder method on test dataset and two independent datasets. Comparable performance was observed from both methods on test dataset. In the benchmark dataset containing 823 DNA-binding proteins and 823 non DNA-binding proteins, we obtained significantly better performance from DNA-Prot with 81.83% accuracy whereas DNAbinder achieved only 61.42% accuracy using amino acid composition and 63.5% using PSSM profile. Similarly, DNA-Prot achived better performance rate from the benchmark dataset containing 88 DNA-binding proteins and 233 non DNA-binding proteins. This result shows DNA-Prot can be efficiently used to identify DNA binding proteins from sequence information. The dataset and standalone version of DNA-Prot software can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/dnaprot.htm.
AB - DNA-binding proteins (DNABPs) are important for various cellular processes, such as transcriptional regulation, recombination, replication, repair, and DNA modification. So far various bioinformatics and machine learning techniques have been applied for identification of DNA-binding proteins from protein structure. Only few methods are available for the identification of DNA binding proteins from protein sequence. In this work, we report a random forest method, DNA-Prot, to identify DNA binding proteins from protein sequence. Training was performed on the dataset containing 146 DNA-binding proteins and 250 non DNA-binding proteins. The algorithm was tested on the dataset containing 92 DNA-binding proteins and 100 non DNA-binding proteins. We obtained 80.31% accuracy from training and 84.37% accuracy from testing. Benchmarking analysis on the independent of 823 DNA-binding proteins and 823 non DNA-binding proteins shows that our approach can distinguish DNA-binding proteins from non DNA-binding proteins with more than 80% accuracy. We also compared our method with DNAbinder method on test dataset and two independent datasets. Comparable performance was observed from both methods on test dataset. In the benchmark dataset containing 823 DNA-binding proteins and 823 non DNA-binding proteins, we obtained significantly better performance from DNA-Prot with 81.83% accuracy whereas DNAbinder achieved only 61.42% accuracy using amino acid composition and 63.5% using PSSM profile. Similarly, DNA-Prot achived better performance rate from the benchmark dataset containing 88 DNA-binding proteins and 233 non DNA-binding proteins. This result shows DNA-Prot can be efficiently used to identify DNA binding proteins from sequence information. The dataset and standalone version of DNA-Prot software can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/dnaprot.htm.
UR - http://www.scopus.com/inward/record.url?scp=67249090413&partnerID=8YFLogxK
U2 - 10.1080/07391102.2009.10507281
DO - 10.1080/07391102.2009.10507281
M3 - Article
C2 - 19385697
AN - SCOPUS:67249090413
SN - 0739-1102
VL - 26
SP - 679
EP - 686
JO - Journal of Biomolecular Structure and Dynamics
JF - Journal of Biomolecular Structure and Dynamics
IS - 6
ER -