czwartek, 29 września 2011

How to compress chemical mol2/sdf files?


There is a lot of benchmarks of compression utilities used to compress text/binary/other data. But I was curious what is the best compression method for compressing big chemical databases stored in mol/sdf/mol2 files. Those are plain text files, but has a quite specific and well defined structure. Below are test results.


All test was conducted on Debian GNU/Linux (Linux debian 3.0.0-1-686-pae #1 SMP Sun Jul 24 14:27:32 UTC 2011 i686 GNU/Linux) with 2xIntel(R) (quad core) Xeon(R) CPU E5620 @ 2.40GHz.
Following version of software were used:
RAR 4.00 beta 3   Copyright (c) 1993-2010 Alexander Roshal   17 Dec 2010
lzma 4.32.0beta3 Copyright (C) 2006 Ville Koskinen
ARJ32 v 3.10, Copyright (c) 1998-2004, ARJ Software Russia. [08 Mar 2011]
Zip 3.0 (July 5th 2008)
7-Zip 9.20, p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,16 CPUs)
bzip2, a block-sorting file compressor.  Version 1.0.5, 10-Dec-2007
gzip 1.4
All was run with default settings:
rar a zinc.rar zinc.mol2
lzma zinc.mol2 
arj a zinc.arj zinc.mol2
zip zinc.mol2
7z a zinc.7z zinc.mol2
bzip2 zinc.mol2
gzip zinc.mol2
The execution time was measured this way:
/usr/bin/time -o lzma.log lzma zinc.mol2 1> /dev/null 2> /dev/null
Two chemical databases was used for tests:
  • ChemBridge express pick SDF: 1340727 kb
  • ZINC fragments usual MOL2: 97658 kb


Compression ratio:
Compression speed:
Speed to ratio (I don't know if this makes sense ;)
chembridge express pick (SDF)zinc frags (MOL2)
What is Compression ratio?

Raw data (size in kb):

chembridge express pick (SDF)
72663   bz2
74456   lzma
89441   7z
108303  rar
140892  arj
143695  gz
143695  zip
1340727 sdf
zinc fragments usual (mol2):
11923   bz2
12673   7z
16346   rar
21560   mol2.gz
21560   zip
21821   arj
97658   mol2


The winner in the compression ratio competition is lzma, however the compression speed is disappointing. The second place is bz2, with better compression speed. I was surprised, that 7z had a bit worse compression ratio than bz2, as the first is said to be the best among modern compression utilities. In the speed/compresion comparison th inner is gzip
So, my reccomendation (formed for myself, but feel free to apply them too:) are:
  • If you need good compression with reasonable speed, use bzip2
  • If you need fast compression with reasonable compression ratio, use gzip

poniedziałek, 26 września 2011

How to count number of molecules in chemical file?

How to count number of molecules in chemical file (mol2, sdf, smi or mol) in bash?


echo "Count number of molecules in a file"

[[ -n "$1" ]] || { echo; echo "Usage:"; 
echo "$0 chemical_database.ext";
exit 0 ; }


echo -n "Number of molecules in $1 file: "

case $ext in
"sdf" )
        cat $1 | grep "ISIS" | wc -l
"smi" )
        cat $1 | wc -l
"mol2" )
        cat $1 | grep '@MOLECULE' | wc -l
"mol" )
        cat $1 | grep "M  END" | wc -l
* )
        echo "Extension not supported :("

EDIT 2016.10: there is much easier way to do it:
obabel "filename.sdf" -onul --count 2>&1 | grep "molecules converted" | cut -d" " -f 1

tar.7z to store permissions

As we can conclude from some internet avaliable bemchmarks (eg here) or from my internal tests, 7z compressor wins among others, in compresion ratio as well as (de)compression time. One of the drawbacks of 7z is the fact it doesn't store linux permisions.
So the simplest solution for this problem is to:
  • use tar to create an umcompressed archive (with proper permissions etc archived), then
  • use 7z to compress tar file
So, to do that, just enter following commands:
tar -cf out.tar /path/to/archive
7z a out.tar.7z out.tar