S-leaping: an efficient downsampling method for large high-throughput sequencing data

Hiroyuki Kuwahara, Xin Gao

Research output: Contribution to journalArticlepeer-review

Abstract

Motivation: Sequencing coverage is among key determinants considered in the design of omics studies. To help estimate cost-effective sequencing coverage for specific downstream analysis, downsampling, a technique to sample subsets of reads with a specific size, is routinely used. However, as the size of sequencing becomes larger and larger, downsampling becomes computationally challenging. Results: Here, we developed an approximate downsampling method called s-leaping that was designed to efficiently and accurately process large-size data. We compared the performance of s-leaping with state-of-the-art downsampling methods in a range of practical omics-study downsampling settings and found s-leaping to be up to 39% faster than the second-fastest method, with comparable accuracy to the exact downsampling methods. To apply s-leaping on FASTQ data, we developed a light-weight tool called fadso in C. Using whole genome sequencing data with 208 million reads, we compared fadso’s performance with that of a commonly used FASTQ tool with the same downsampling feature and found fadso to be up to 12% faster with 21% lower memory usage, suggesting fadso to have up to 40% higher throughput in a parallel computing setting.
Original languageEnglish (US)
JournalBioinformatics
DOIs
StatePublished - Jun 24 2023

ASJC Scopus subject areas

  • Biochemistry
  • Computational Theory and Mathematics
  • Computational Mathematics
  • Molecular Biology
  • Statistics and Probability
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'S-leaping: an efficient downsampling method for large high-throughput sequencing data'. Together they form a unique fingerprint.

Cite this