TY - GEN
T1 - NeuralTE
T2 - 15th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2024
AU - Hu, Kang
AU - Xu, Minghua
AU - Gao, Xin
AU - Wang, Jianxin
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/12/16
Y1 - 2024/12/16
N2 - Transposable Elements (TEs), which make up a significant portion of the genomes in most eukaryotic organisms, can be classified into various superfamilies based on their sequence and structural characteristics. Accurate TE classification at the superfamily level can reveal their distribution and abundance across various genomes, providing deeper insights into species variation and evolution. Recent advancements in third-generation sequencing technologies have made a large number of genomes from non-model species available. However, existing TE classification methods suffer from several limitations, including the necessity to train multiple hierarchical classification models, the incapacity to perform classification at the superfamily level, and deficiencies in both accuracy and robustness. Therefore, there is an urgent need for an accurate TE classification method to improve genome annotation. In this study, we develop NeuralTE, a deep learning method designed to classify TEs at the superfamily level. To achieve accurate TE classification, we identify various structural features of TEs and use different combinations of k-mers for terminal repeats and internal sequences to uncover distinct patterns. Evaluation on all TEs from Repbase shows that NeuralTE outperforms existing machine learning and homology-based methods in classifying TEs. Testing on TEs from novel species highlights the superior performance of NeuralTE compared to existing methods. We also conduct TE annotation experiments on rice using different classification tools, and the results show that NeuralTE achieves annotations nearly identical to the gold standard, highlighting its robustness and accuracy in classifying TEs. NeuralTE is publicly available at https://github.com/CSU-KangHu/NeuralTE.
AB - Transposable Elements (TEs), which make up a significant portion of the genomes in most eukaryotic organisms, can be classified into various superfamilies based on their sequence and structural characteristics. Accurate TE classification at the superfamily level can reveal their distribution and abundance across various genomes, providing deeper insights into species variation and evolution. Recent advancements in third-generation sequencing technologies have made a large number of genomes from non-model species available. However, existing TE classification methods suffer from several limitations, including the necessity to train multiple hierarchical classification models, the incapacity to perform classification at the superfamily level, and deficiencies in both accuracy and robustness. Therefore, there is an urgent need for an accurate TE classification method to improve genome annotation. In this study, we develop NeuralTE, a deep learning method designed to classify TEs at the superfamily level. To achieve accurate TE classification, we identify various structural features of TEs and use different combinations of k-mers for terminal repeats and internal sequences to uncover distinct patterns. Evaluation on all TEs from Repbase shows that NeuralTE outperforms existing machine learning and homology-based methods in classifying TEs. Testing on TEs from novel species highlights the superior performance of NeuralTE compared to existing methods. We also conduct TE annotation experiments on rice using different classification tools, and the results show that NeuralTE achieves annotations nearly identical to the gold standard, highlighting its robustness and accuracy in classifying TEs. NeuralTE is publicly available at https://github.com/CSU-KangHu/NeuralTE.
KW - genome annotation
KW - multi-feature fusion
KW - superfamily level
KW - Transposable Element
UR - http://www.scopus.com/inward/record.url?scp=85216418273&partnerID=8YFLogxK
U2 - 10.1145/3698587.3701346
DO - 10.1145/3698587.3701346
M3 - Conference contribution
AN - SCOPUS:85216418273
T3 - ACM-BCB 2024 - 15th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
BT - ACM-BCB 2024 - 15th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
PB - Association for Computing Machinery, Inc
Y2 - 22 November 2024 through 25 November 2024
ER -