Genome Differential Compressor

Compression and decompression speedThe compression speed of GDC varies depending on data, but for the mentioned data sets is from 95 to 200MB/s at a computer equipped with 6-core Intel i7 4930K 3.4GHz processor. The decompression speed is dependent on the disk speed and is up to 1000 MB/s on the mentioned system.GDC

GDC—What is it?

Genome Differential Compressor is a utility designed for compression of genome collections from the same species. The amount of such collections can be huge, e.g., a few (or tens) of gigabytes, so a need for a robust data compression tool is clear. Universal compression programs like gzip or bzip2 might be used for this purpose, but it is obvious that a specialized tool can work much better, since a universal compressor does not use the properties of such data sets, e.g., long approximate repetitions at long distances.

GDC 2

The architecture of GDC 2

GDC 2 is designed as a C++ application. The key features of the software are:

  • compression of collections of genomes in FASTA format,
  • decompression of the whole collection,
  • decompression of only a single genome without decompressing the complete collection,

How good is GDC?

Compression factor

In terms of compression factor (the ability to reduce the file size), GDC is usually much better than universal compressors and other specialized genome compressors. Its compression factor for some test data sets are:

Compression and decompression speed

The compression speed of GDC varies depending on data, but for the mentioned data sets is from 95 to 200MB/s at a computer equipped with 6-core Intel i7 4930K 3.4GHz processor. The decompression speed is dependent on the disk speed and is up to 1000 MB/s on the mentioned system.

 

Raony Guimaraes