Simple Chemoinformatics (and more): 2011

piątek, 4 listopada 2011

OpenBabel compiling error

Frome time to time, during OpenBabel compilation, a strange error occurs:

cc1plus: warning: command line option '-Wstrict-prototypes'
 is valid for Ada/C/ObjC but not for C++ [enabled by default]
In file included from /root/OPENBABEL/openbabel/scripts/python/
openbabel-python.cpp:3358:0:
/root/OPENBABEL/openbabel/scripts/python/../../
include/openbabel/math/align.h:26:22: fatal error: Eigen/Core: No such file or directory
compilation terminated.
error: command 'gcc' failed with exit status 1
make[2]: *** [scripts/CMakeFiles/_openbabel] Error 1
make[1]: *** [scripts/CMakeFiles/_openbabel.dir/all] Error 2
make: *** [all] Error 2

Ant one may assume, that something is wrong in include/openbabel/math/align.h file. To be precise, there is a wrong path to the Eigen/Core files. To fix this problem, you have to open include/openbabel/math/align.h in the source directory, and change the line:

#include <Eigen/Core>

#include <your_path_to_Eigen_Core>

eg:

#include </root/OPENBABEL/eigen/eigen-eigen-2.0.16/Eigen/Core>

good luck!

czwartek, 29 września 2011

How to compress chemical mol2/sdf files?

Introduction

There is a lot of benchmarks of compression utilities used to compress text/binary/other data. But I was curious what is the best compression method for compressing big chemical databases stored in mol/sdf/mol2 files. Those are plain text files, but has a quite specific and well defined structure. Below are test results.

Methods

All test was conducted on Debian GNU/Linux (Linux debian 3.0.0-1-686-pae #1 SMP Sun Jul 24 14:27:32 UTC 2011 i686 GNU/Linux) with 2xIntel(R) (quad core) Xeon(R) CPU E5620 @ 2.40GHz.
Following version of software were used:

RAR 4.00 beta 3   Copyright (c) 1993-2010 Alexander Roshal   17 Dec 2010
lzma 4.32.0beta3 Copyright (C) 2006 Ville Koskinen
ARJ32 v 3.10, Copyright (c) 1998-2004, ARJ Software Russia. [08 Mar 2011]
Zip 3.0 (July 5th 2008)
7-Zip 9.20, p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,16 CPUs)
bzip2, a block-sorting file compressor.  Version 1.0.5, 10-Dec-2007
gzip 1.4

All was run with default settings:

rar a zinc.rar zinc.mol2
lzma zinc.mol2 
arj a zinc.arj zinc.mol2
zip zinc.zip zinc.mol2
7z a zinc.7z zinc.mol2
bzip2 zinc.mol2
gzip zinc.mol2

The execution time was measured this way:

/usr/bin/time -o lzma.log lzma zinc.mol2 1> /dev/null 2> /dev/null

Two chemical databases was used for tests:

ChemBridge express pick SDF: 1340727 kb
ZINC fragments usual MOL2: 97658 kb

Results

Compression ratio:

Compression speed:

Speed to ratio (I don't know if this makes sense ;)

	chembridge express pick (SDF)		zinc frags (MOL2)
method	kb/s	ratio	kb/s	ratio
bz2	2608,42	5,42%	4650,38	12,21%
lzma	447,95	5,55%	434,04	11,64%
7z	2343,93	6,67%	864,23	12,98%
rar	5002,71	8,08%	2959,33	16,74%
arj	11970,78	10,51%	6103,63	22,34%
gz	26288,76	10,72%	9765,80	22,08%
zip	23115,98	10,72%	9765,80	22,08%

What is Compression ratio?

Raw data (size in kb):

chembridge express pick (SDF)

72663   bz2
74456   lzma
89441   7z
108303  rar
140892  arj
143695  gz
143695  zip
1340727 sdf

zinc fragments usual (mol2):

11923   bz2
12673   7z
16346   rar
21560   mol2.gz
21560   zip
21821   arj
97658   mol2

Conclusion

The winner in the compression ratio competition is lzma, however the compression speed is disappointing. The second place is bz2, with better compression speed. I was surprised, that 7z had a bit worse compression ratio than bz2, as the first is said to be the best among modern compression utilities. In the speed/compresion comparison th inner is gzip
So, my reccomendation (formed for myself, but feel free to apply them too:) are:

If you need good compression with reasonable speed, use bzip2
If you need fast compression with reasonable compression ratio, use gzip

poniedziałek, 26 września 2011

How to count number of molecules in chemical file?

How to count number of molecules in chemical file (mol2, sdf, smi or mol) in bash?

#!/bin/bash

echo
echo "Count number of molecules in a file"
echo

[[ -n "$1" ]] || { echo; echo "Usage:"; 
echo "$0 chemical_database.ext";
exit 0 ; }

ext=${1#*.}

echo -n "Number of molecules in $1 file: "

case $ext in
"sdf" )
        cat $1 | grep "ISIS" | wc -l
        ;;
"smi" )
        cat $1 | wc -l
        ;;
"mol2" )
        cat $1 | grep '@MOLECULE' | wc -l
        ;;
"mol" )
        cat $1 | grep "M  END" | wc -l
        ;;
* )
        echo "Extension not supported :("
        ;;
esac

EDIT 2016.10: there is much easier way to do it:

obabel "filename.sdf" -onul --count 2>&1 | grep "molecules converted" | cut -d" " -f 1

tar.7z to store permissions

As we can conclude from some internet avaliable bemchmarks (eg here) or from my internal tests, 7z compressor wins among others, in compresion ratio as well as (de)compression time. One of the drawbacks of 7z is the fact it doesn't store linux permisions.
So the simplest solution for this problem is to:

use tar to create an umcompressed archive (with proper permissions etc archived), then
use 7z to compress tar file

So, to do that, just enter following commands:

tar -cf out.tar /path/to/archive
7z a out.tar.7z out.tar

Strony