Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers

Sumeer Ahmad Khan, Alberto Maillo, Vincenzo Lagani, Robert Lehmann, Narsis A. Kiani, David Gomez-Cabrero, Jesper Tegner*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The rise of single-cell genomics is an attractive opportunity for data-hungry machine learning algorithms. The scBERT method, inspired by the success of BERT (‘bidirectional encoder representations from transformers’) in natural language processing, was recently introduced by Yang et al. as a data-driven tool to annotate cell types in single-cell genomics data. Analogous to contextual embedding in BERT, scBERT leverages pretraining and self-attention mechanisms to learn the ‘transcriptional grammar’ of cells. Here we investigate the reusability beyond the original datasets, assessing the generalizability of natural language techniques in single-cell genomics. The degree of imbalance in the cell-type distribution substantially influences the performance of scBERT. Anticipating an increased utilization of transformers, we highlight the necessity to consider data distribution carefully and introduce a subsampling technique to mitigate the influence of an imbalanced distribution. Our analysis serves as a stepping stone towards understanding and optimizing the use of transformers in single-cell genomics.

Original languageEnglish (US)
Pages (from-to)1437-1446
Number of pages10
JournalNature Machine Intelligence
Volume5
Issue number12
DOIs
StatePublished - Dec 2023

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers'. Together they form a unique fingerprint.

Cite this