Abstract
The rise of single-cell genomics is an attractive opportunity for data-hungry machine learning algorithms. The scBERT method, inspired by the success of BERT (‘bidirectional encoder representations from transformers’) in natural language processing, was recently introduced by Yang et al. as a data-driven tool to annotate cell types in single-cell genomics data. Analogous to contextual embedding in BERT, scBERT leverages pretraining and self-attention mechanisms to learn the ‘transcriptional grammar’ of cells. Here we investigate the reusability beyond the original datasets, assessing the generalizability of natural language techniques in single-cell genomics. The degree of imbalance in the cell-type distribution substantially influences the performance of scBERT. Anticipating an increased utilization of transformers, we highlight the necessity to consider data distribution carefully and introduce a subsampling technique to mitigate the influence of an imbalanced distribution. Our analysis serves as a stepping stone towards understanding and optimizing the use of transformers in single-cell genomics.
Original language | English (US) |
---|---|
Pages (from-to) | 1437-1446 |
Number of pages | 10 |
Journal | Nature Machine Intelligence |
Volume | 5 |
Issue number | 12 |
DOIs | |
State | Published - Dec 2023 |
ASJC Scopus subject areas
- Software
- Human-Computer Interaction
- Computer Vision and Pattern Recognition
- Computer Networks and Communications
- Artificial Intelligence