Introduction
There is a lot of benchmarks of compression utilities used to compress text/binary/other data. But I was curious what is the best compression method for compressing big chemical databases stored in mol/sdf/mol2 files. Those are plain text files, but has a quite specific and well defined structure. Below are test results.Methods
All test was conducted on Debian GNU/Linux (Linux debian 3.0.0-1-686-pae #1 SMP Sun Jul 24 14:27:32 UTC 2011 i686 GNU/Linux) with 2xIntel(R) (quad core) Xeon(R) CPU E5620 @ 2.40GHz.Following version of software were used:
RAR 4.00 beta 3 Copyright (c) 1993-2010 Alexander Roshal 17 Dec 2010 lzma 4.32.0beta3 Copyright (C) 2006 Ville Koskinen ARJ32 v 3.10, Copyright (c) 1998-2004, ARJ Software Russia. [08 Mar 2011] Zip 3.0 (July 5th 2008) 7-Zip 9.20, p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,16 CPUs) bzip2, a block-sorting file compressor. Version 1.0.5, 10-Dec-2007 gzip 1.4All was run with default settings:
rar a zinc.rar zinc.mol2 lzma zinc.mol2 arj a zinc.arj zinc.mol2 zip zinc.zip zinc.mol2 7z a zinc.7z zinc.mol2 bzip2 zinc.mol2 gzip zinc.mol2The execution time was measured this way:
/usr/bin/time -o lzma.log lzma zinc.mol2 1> /dev/null 2> /dev/nullTwo chemical databases was used for tests:
- ChemBridge express pick SDF: 1340727 kb
- ZINC fragments usual MOL2: 97658 kb
Results
Compression ratio:Compression speed:
Speed to ratio (I don't know if this makes sense ;)
chembridge express pick (SDF) | zinc frags (MOL2) | |||
---|---|---|---|---|
method | kb/s | ratio | kb/s | ratio |
bz2 | 2608,42 | 5,42% | 4650,38 | 12,21% |
lzma | 447,95 | 5,55% | 434,04 | 11,64% |
7z | 2343,93 | 6,67% | 864,23 | 12,98% |
rar | 5002,71 | 8,08% | 2959,33 | 16,74% |
arj | 11970,78 | 10,51% | 6103,63 | 22,34% |
gz | 26288,76 | 10,72% | 9765,80 | 22,08% |
zip | 23115,98 | 10,72% | 9765,80 | 22,08% |
Raw data (size in kb):
chembridge express pick (SDF)72663 bz2 74456 lzma 89441 7z 108303 rar 140892 arj 143695 gz 143695 zip 1340727 sdfzinc fragments usual (mol2):
11923 bz2 12673 7z 16346 rar 21560 mol2.gz 21560 zip 21821 arj 97658 mol2
Conclusion
The winner in the compression ratio competition is lzma, however the compression speed is disappointing. The second place is bz2, with better compression speed. I was surprised, that 7z had a bit worse compression ratio than bz2, as the first is said to be the best among modern compression utilities. In the speed/compresion comparison th inner is gzipSo, my reccomendation (formed for myself, but feel free to apply them too:) are:
- If you need good compression with reasonable speed, use bzip2
- If you need fast compression with reasonable compression ratio, use gzip