TY - JOUR
T1 - S-leaping: an efficient downsampling method for large high-throughput sequencing data
AU - Kuwahara, Hiroyuki
AU - Gao, Xin
N1 - KAUST Repository Item: Exported on 2023-07-13
Acknowledged KAUST grant number(s): FCC/1/1976-44-01, FCC/1/1976-45-01, REI/1/4940-01-01, REI/1/5202-01-01, RGC/3/4816-01-01, URF/1/4663-01-01
Acknowledgements: This publication is based upon work supported by the King Abdullah University of Science and Technology (KAUST) Office of Research Administration (ORA) under Award Nos FCC/1/1976-44-01, FCC/1/1976-45-01, URF/1/4663-01-01, REI/1/5202-01-01, REI/1/4940-01-01, and RGC/3/4816-01-01.
PY - 2023/6/24
Y1 - 2023/6/24
N2 - Motivation: Sequencing coverage is among key determinants considered in the design of omics studies. To help estimate cost-effective sequencing coverage for specific downstream analysis, downsampling, a technique to sample subsets of reads with a specific size, is routinely used. However, as the size of sequencing becomes larger and larger, downsampling becomes computationally challenging.
Results: Here, we developed an approximate downsampling method called s-leaping that was designed to efficiently and accurately process large-size data. We compared the performance of s-leaping with state-of-the-art downsampling methods in a range of practical omics-study downsampling settings and found s-leaping to be up to 39% faster than the second-fastest method, with comparable accuracy to the exact downsampling methods. To apply s-leaping on FASTQ data, we developed a light-weight tool called fadso in C. Using whole genome sequencing data with 208 million reads, we compared fadso’s performance with that of a commonly used FASTQ tool with the same downsampling feature and found fadso to be up to 12% faster with 21% lower memory usage, suggesting fadso to have up to 40% higher throughput in a parallel computing setting.
AB - Motivation: Sequencing coverage is among key determinants considered in the design of omics studies. To help estimate cost-effective sequencing coverage for specific downstream analysis, downsampling, a technique to sample subsets of reads with a specific size, is routinely used. However, as the size of sequencing becomes larger and larger, downsampling becomes computationally challenging.
Results: Here, we developed an approximate downsampling method called s-leaping that was designed to efficiently and accurately process large-size data. We compared the performance of s-leaping with state-of-the-art downsampling methods in a range of practical omics-study downsampling settings and found s-leaping to be up to 39% faster than the second-fastest method, with comparable accuracy to the exact downsampling methods. To apply s-leaping on FASTQ data, we developed a light-weight tool called fadso in C. Using whole genome sequencing data with 208 million reads, we compared fadso’s performance with that of a commonly used FASTQ tool with the same downsampling feature and found fadso to be up to 12% faster with 21% lower memory usage, suggesting fadso to have up to 40% higher throughput in a parallel computing setting.
UR - http://hdl.handle.net/10754/692923
UR - https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad399/7206878
U2 - 10.1093/bioinformatics/btad399
DO - 10.1093/bioinformatics/btad399
M3 - Article
C2 - 37354496
SN - 1367-4803
JO - Bioinformatics
JF - Bioinformatics
ER -