Introduction
There is a lot of benchmarks of compression utilities used to compress text/binary/other data. But I was curious what is the best compression method for compressing big chemical databases stored in mol/sdf/mol2 files. Those are plain text files, but has a quite specific and well defined structure. Below are test results.
Methods
All test was conducted on Debian GNU/Linux (Linux debian 3.0.0-1-686-pae #1 SMP Sun Jul 24 14:27:32 UTC 2011 i686 GNU/Linux) with 2xIntel(R) (quad core) Xeon(R) CPU E5620 @ 2.40GHz.
Following version of software were used:
RAR 4.00 beta 3 Copyright (c) 1993-2010 Alexander Roshal 17 Dec 2010
lzma 4.32.0beta3 Copyright (C) 2006 Ville Koskinen
ARJ32 v 3.10, Copyright (c) 1998-2004, ARJ Software Russia. [08 Mar 2011]
Zip 3.0 (July 5th 2008)
7-Zip 9.20, p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,16 CPUs)
bzip2, a block-sorting file compressor. Version 1.0.5, 10-Dec-2007
gzip 1.4
All was run with default settings:
rar a zinc.rar zinc.mol2
lzma zinc.mol2
arj a zinc.arj zinc.mol2
zip zinc.zip zinc.mol2
7z a zinc.7z zinc.mol2
bzip2 zinc.mol2
gzip zinc.mol2
The execution time was measured this way:
/usr/bin/time -o lzma.log lzma zinc.mol2 1> /dev/null 2> /dev/null
Two chemical databases was used for tests:
- ChemBridge express pick SDF: 1340727 kb
- ZINC fragments usual MOL2: 97658 kb
Results
Compression ratio:
Compression speed:
Speed to ratio (I don't know if this makes sense ;)
| chembridge express pick (SDF) | | zinc frags (MOL2) | |
method | kb/s | ratio | kb/s | ratio |
bz2 | 2608,42 | 5,42% | 4650,38 | 12,21% |
lzma | 447,95 | 5,55% | 434,04 | 11,64% |
7z | 2343,93 | 6,67% | 864,23 | 12,98% |
rar | 5002,71 | 8,08% | 2959,33 | 16,74% |
arj | 11970,78 | 10,51% | 6103,63 | 22,34% |
gz | 26288,76 | 10,72% | 9765,80 | 22,08% |
zip | 23115,98 | 10,72% | 9765,80 | 22,08% |
What is
Compression ratio?
Raw data (size in kb):
chembridge express pick (SDF)
72663 bz2
74456 lzma
89441 7z
108303 rar
140892 arj
143695 gz
143695 zip
1340727 sdf
zinc fragments usual (mol2):
11923 bz2
12673 7z
16346 rar
21560 mol2.gz
21560 zip
21821 arj
97658 mol2
Conclusion
The winner in the
compression ratio competition is lzma, however the compression speed is disappointing. The second place is
bz2, with better compression speed. I was surprised, that 7z had a bit worse compression ratio than bz2, as the first is said to be the best among modern compression utilities. In the speed/compresion comparison th inner is gzip
So, my reccomendation (formed for myself, but feel free to apply them too:) are:
- If you need good compression with reasonable speed, use bzip2
- If you need fast compression with reasonable compression ratio, use gzip