Tools for Sorting BAM Files

Comments · 105 Views

Bioinformatics is a rapidly evolving field filled with a plethora of tools for processing and manipulating next-generation sequencing (NGS) data. The performance of NGS alignment tools continues to improve, but it now takes more time to sort aligned data than to align it. SAMtools and samb

 

Bioinformaticsis a rapidly evolving field filled with a plethora of tools for processing and manipulating next-generation sequencing (NGS) data. The performance of NGS alignment tools continues to improve, but it now takes more time to sort aligned data than to align it. SAMtools and sambamba are two commonly used tools for sorting binary alignment map (BAM) files.

 

SAMtools

SAMtools, developed by the 1000 Genomes Project, has been the tool of choice for processing SAM, BAM, and more recently CRAM files. The comprehensive set of utilities for manipulating comparisons in SAMtools makes it an indispensable tool in any bioinformatician's toolkit. However, SAMtools has only recently incorporated support for parallel processing, which is becoming increasingly important with the explosion of NGS data.

 

Sambamba

Sambamba is a powerful alternative to SAMtools, designed from the ground up to take full advantage of parallel processing. Sambamba not only mirrors most of the features of SAMtools but also introduces new features such as coverage analysis and powerful filtering. It is written in the D programming language, which is known for its C-like runtime performance and powerful parallel computing abstractions.

 

Comparative Analysis: SAMtools vs. Sambamba

Although both SAMtools and sambamba are designed for processing SAM/BAM files, key differences in their performance and functionality have become apparent, especially when it comes to sorting BAM files.

 

Single-threaded Performance

In single-threaded operations, sambamba outperforms SAMtools. In experiments using 1.8 GB BAM files, sambamba performs SAM to BAM conversion and sorting in 7m9.870s real-time, while SAMtools takes 18m52.374s to complete the same task. The experiments show that SAMtools consumes a significant amount of RAM in the process, which suggests that samba consumes a lot of RAM. a significant amount of RAM in this process, suggesting that sambamba may be the tool of choice in memory-limited situations.

 

Multi-threaded Performance

Performance results vary when running in multiple threads. For example, in a multi-threaded SAM to BAM conversion and sorting test involving 16 threads, SAMtools took 6m24.779s to complete the task in real-time, while sambamba took a little longer, at 7m9.870s. However, when considering the overall efficiency and RAM usage, sambamba still shows promising results.

 

Indexing Speed

A key aspect of processing BAM files is indexing, and sambamba demonstrates impressive speed in this regard. In a direct comparison, SAMtools took 37.755 seconds to index BAM files in real-time, while sambamba took nearly half that time, 15.180 seconds.

 

Acceleration during load changes

A noteworthy advantage of sambamba is its scalability with respect to the number of cores. As the load increases, sambamba can efficiently scale the computation, thereby significantly reducing processing time. For example, by replacing certain Picard and SAMtools commands with sambamba, the bioinformatics processing time for the human cancer exome SNV call pipeline was reduced from 2 hours to 30 minutes.

 

Which Tool is Faster for Sorting BAM Files?

The speed of SAMtools and sambamba varies depending on the specific task, the number of threads used, and the type of BAM file. In single-threaded operations, sambamba usually performs faster, and even in multi-threaded operations, it shows promising results, especially when memory is the limiting factor.

 

However, it is important to note that the "best" tool for the job will depend on the situation and the requirements of the task at hand. In some cases, SAMtools may outperform sambamba, especially in multi-threaded operations. Also, for tasks such as indexing, sambamba is often the faster choice.

 

Reference:

  1. Tarasov A, Vilella AJ, Cuppen E, et al. Sambamba: fast processing of NGS alignment formats. Bioinformatics. 2015 Jun 15;31(12):2032-4.