Strony

Pokazywanie postów oznaczonych etykietą linux. Pokaż wszystkie posty
Pokazywanie postów oznaczonych etykietą linux. Pokaż wszystkie posty

czwartek, 10 lipca 2014

How to split any chemical file to single files with file names = compounds' titles

Here is a simple python script for splitting (almost) any chemical multi-structures file (mol2, sdf, smi...) to single ones, with compounds names as new file names. It uses pybel - an openbabel python interface.

Usage:
split_mol.py -i big.mol2

The code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import os
import getopt
import pybel
from random import randint


### Options

errMsg = "Invalid operation specified:\t%s\nAllowed options are:\n\t-i\tinput file\n"

options, arguments = getopt.getopt( sys.argv[1:], 'i:' )
inputFile = ''

for opt in options:
 if opt[0] == '-i':
  inputFile = opt[1]

  
if not os.path.exists( inputFile ):
 print ( errMsg % "No such file.\n" )
 sys.exit( -1 )

### output names and dirs

basename = os.path.splitext(os.path.basename(inputFile))[0]
ext = os.path.splitext(os.path.basename(inputFile))[1][1:]

outDir = basename + ".splitted"
i = 0

if not os.path.exists(outDir):
    os.makedirs(outDir)

### Splitting

for mol in pybel.readfile(ext, inputFile):
  if mol.title == "":
    i += 1
    title = "%s_%09d" % (basename, i)
  else:
    title = mol.title

  print "%30s" % title,
  
  outFile = "%s/%s.%s" % (outDir, title, ext)
  
  if os.path.exists( outFile ): # if desired file exists
    outFile = "%s/%s_dup%03d.%s" % (outDir, title, randint(0,1000), ext) # attach _dup suffix with 3-digit random number
    print " duplicated: written to file: ", outFile
  else:
    print " ok"

  mol.write(ext, outFile)

wtorek, 10 czerwca 2014

Notes on Open Babel compiling and local install...

Installing openbabel locally sometimes is not as straightforward as it could be. Here are some notes on compiling it on the system without root permissions.

Compiling OB from sources


Fetching / updating sources from git repositories. Please notice the patching eem charges model with eem_fix.patch (optional).

#!/bin/bash

# for the first time use:
#mkdir -p openbabel
# cd openbabel
#git clone git://github.com/openbabel/openbabel.git

cd openbabel
git pull
cd ..

# optional for patching EEM
# echo patching....
# patch openbabel/src/charges/eem.cpp < eem_fix.patch

if [ -d build ]; then
    echo "********* Please delete build dir"
    exit
fi

mkdir build
cd build
cmake ../openbabel -DPYTHON_BINDINGS=ON -DRUN_SWIG=ON \
    -DPYTHON_INCLUDE_DIR=/usr/include/python2.6 \
    -DPYTHON_LIBRARY=/usr/lib/python2.6/config/libpython2.6.so \
    -DCMAKE_INSTALL_PREFIX=~/bin/openbabel-install

make -j14

make test

After successful compilation you should have a complete openbabel structure under your bin/openbabel-install:

~/bin/openbabel-install/
   |-bin
   |-include
   |---inchi
   |---openbabel-2.0
   |-----openbabel
   |-------json
   |-------math
   |-------stereo
   |-lib
   |---cmake
   |-----openbabel2
   |---openbabel
   |-----2.3.90
   |---pkgconfig
   |---python2.6
   |-----site-packages
   |-share
   |---man
   |-----man1
   |---openbabel
   |-----2.3.90


Compiling align-it


Fetch the newest version of silicos align-it, extract it to directory of your choice (eg. align-it-1.0.4). Then you can use following script to setup environmental variables and compile the program:

#!/bin/bash

export BABEL_LIBDIR=/home/fstefaniak/bin/openbabel-install/lib/ 
export BABEL_INCLUDEDIR=/home/fstefaniak/bin/openbabel-install/include/openbabel-2.0/openbabel/
export BABEL_DATADIR=/home/fstefaniak/bin/openbabel-install/share/openbabel/2.3.90/

cd align-it-1.0.4

mkdir build
cd build
cmake ..
make


To be continued


Some more notes


Update 2017/12/01


compiling the older version of OpenBabel (2.3.1), I get the error

Scanning dependencies of target openbabel
[  0%] Building CXX object src/CMakeFiles/openbabel.dir/alias.o
In file included from /home/fstefaniak/sources/openbabel/openbabel-2.3.1/include/openbabel/alias.h:19:0,
                 from /home/fstefaniak/sources/openbabel/openbabel-2.3.1/src/alias.cpp:17:
/home/fstefaniak/sources/openbabel/openbabel-2.3.1/include/openbabel/shared_ptr.h:27:14: error: ‘std::tr1’ has not been declared
   using std::tr1::shared_ptr;
              ^
src/CMakeFiles/openbabel.dir/build.make:62: recipe for target 'src/CMakeFiles/openbabel.dir/alias.o' failed
make[2]: *** [src/CMakeFiles/openbabel.dir/alias.o] Error 1
CMakeFiles/Makefile2:1232: recipe for target 'src/CMakeFiles/openbabel.dir/all' failed
make[1]: *** [src/CMakeFiles/openbabel.dir/all] Error 2
Makefile:138: recipe for target 'all' failed
make: *** [all] Error 2

Thanks to https://diane-be-unique.com/2016/10/19/openbabel-error-in-installing/ the solution is: in openbabel-2.3.1/include/openbabel/shared_ptr.h add:

#include <tr1/memory>

piątek, 30 maja 2014

encfs under KDE

Mounting and unmounting encfs is easy in command line, but I was looking for a solution of doing it in KDE during system startup/shutdown. Here are two scripts which uses KDialog for asking for password and displaying notices.


encfsmount.sh
#!/bin/bash

encrypted=~/.encr
unencrypted=~/myplace

encfs --extpass="kdialog --password 'encfs password'" $encrypted $unencrypted

if  mount|grep $unencrypted > /dev/null 2>&1; then
    kdialog --title "Mounted :)" --passivepopup "$unencrypted mounted succesfully" 10
else
    kdialog --error "$unencrypted not mounted :("
fi


encfsumount.sh
#!/bin/bash

unencrypted=~/myplace

fusermount -u $unencrypted

if  mount|grep $unencrypted > /dev/null 2>&1; then
    kdialog --error "$unencrypted NOT unmounted :("
else
    kdialog --title "Unmounted :)" --passivepopup "$unencrypted unmounted succesfully" 10

fi


Add those two scripts to autostart in KDE (or to your launcher/desktop icons if you prefer to mount/unmount it manually):


It will ask for the password when mounting:


When successful, the notification in the corner should appear:

wtorek, 17 grudnia 2013

Using sequential symmetric gpg encryption with different ciphers.

This method is good for encrypting short messages (while it uses variables for storing information and generates plain text output), but can be easily modified to encrypt larger files (using temporary files instead of variable).

(1) Encryption


First, define which ciphers and in which order you want to use. For more information about ciphers avaliable, type:
gpg --version
and jump to the section "ciphers" or "symmetric":

Symetryczne: IDEA, 3DES, CAST5, BLOWFISH, AES, AES192, AES256,
             TWOFISH, CAMELLIA128, CAMELLIA192, CAMELLIA256

Enter them in the config section of our "encrypt-multiple.sh" script:

#!/bin/bash

algos="TWOFISH AES256 CAMELLIA256 BLOWFISH CAST5" # list of ciphers to use

# -----------------------------------------------------#

# clearing variables
pass=""
pass2=""

# entering passwords
echo -n "Password: "
read -s pass
echo
echo -n "Re-enter password: "
read -s pass2
echo

# does passwords match?
if [ "$pass" == "$pass2" ]; then
    echo "Passwords mach. Encrypting."
    echo

input=`cat "$1"`

for algo in $algos
do
    ((i++))
    echo "*** ($i) $algo"
    input=`echo "$input" | gpg --no-tty --batch -a --symmetric --cipher-algo "$algo" --passphrase "$pass" -o-`
done

echo "$input" > "$1".asc.$i
echo "Encrypted message saved to $1.asc.$i"

# clearing passwords and inputs
input=""
pass=""
pass2=""

else
    echo "Passwords doesn't match"
fi


So now if you want to encrypt message in file.txt, just run:

encrypt-multiple.sh file.txt

After entering passphases (twice) you will get the encrypted file "file.txt.n" where n is a number of used ciphers (n will be necesary while during decryption).


(2) Decryption


For decrypting above message we just need to enter valid password. We don't need the names and order of used ciphers as gpg detects it automagically. The n - number of passes (used ciphers) is "encoded" in file extension.

#!/bin/bash
pass=""

# entering passwords
echo -n "Password: "
read -s pass
echo
input=`cat "$1"`

# list of Ciphers are not necesary as gpg detects it; read from file extension
algos="${1##*.}"
echo "Encrypted $algos times. Decrypting..."

for i in `seq 1 $algos`
do
    echo "*** $i"
    input=`echo "$input" | gpg --no-tty --batch -d --passphrase "$pass" -o-`
done

echo "Decrypted message:"
echo "---------------------------------------"
echo "$input"

# clearing passwords and inputs
input=""
pass=""
pass2=""


(3) Output file sizes.


Output file sizes inceases as more ciphers are used. Here is an example of file sizes (uncompressed and compressed with bzip2). Cipher used are:
TWOFISH AES256 CAMELLIA256 BLOWFISH CAST5 TWOFISH AES256 CAMELLIA256 BLOWFISH CAST5.


More reading about ciphers and symmetric encryption: GPG Encryption Guide - Part 4 (Symmetric Encryption).

(4) Bonus


If you want to try decoding, here is 5-fold encrypted text (n=5). The password is chemoinformatics.

piątek, 17 maja 2013

(simple) webcam image analysis

Here I present simple webcam image analysis and it's variation in time.

Intro


I have a webcam, connected to my Raspberry Pi server. Photos are taken every five minutes with fswebcam. Such collection of photos you can use for creating eg. time lapse video (here is an example of one-day video taken from my window). But you can also perform an analysis of image composition and it's variation over time.

Howto


Gathering data


Wen we have photos taken during a day gargered in one directory, we can use a simple script to analyse subsequent frames. ImageMagick is used to convert photos to 1x1 px size and get their HSB and RGB values.

#!/bin/bash

OUT=webcam.dat

echo "#i        h       s       v       r       g       b" > $OUT

for image in *.jpg
do

((i++))

hsb=`convert "$image" -colorspace hsb  -resize 1x1  txt:- | sed 1d  | tr '()%,:' ' '`

h=`echo $hsb | awk '{print $8}'`
s=`echo $hsb | awk '{print $9}'`
v=`echo $hsb | awk '{print $10}'`

rgb=`convert "$image" -colorspace rgb  -resize 1x1  txt:- | sed 1d  | tr '()%,:' ' '`

r=`echo $rgb | awk '{print $3}'`
g=`echo $rgb | awk '{print $4}'`
b=`echo $rgb | awk '{print $5}'`

echo "$i        $h      $s      $v      $r      $g      $b" >> $OUT

done

Data visualisation


Then we use a gnuplot script to get pretty plots:

set pointsize 2
# Line style for axes
set style line 80 lt rgb "#808080"
# Line style for grid
set style line 81 lt 0  # dashed
set style line 81 lt rgb "#808080"  # grey
set grid back linestyle 81
set border 3 back linestyle 80
set xtics nomirror
set ytics nomirror
set xlabel "frame number"
set terminal png nocrop enhanced font "Gill Sans,8" size 700,500

plik='webcam.dat'

set ylabel "HSB [%]"

set output "out/hsv-csplines.png"
plot plik using 1:2 smooth csplines title "Hue" lt rgb "red" lw 1,\
plik using 1:3 smooth csplines title "Saturation" lt rgb "green" lw 1,\
plik using 1:4 smooth csplines title "Brightness" lt rgb "blue" lw 1

set output "out/hsv-bezier.png"
plot plik using 1:2 smooth bezier title "Hue" lt rgb "red" lw 1,\
plik using 1:3 smooth bezier title "Saturation" lt rgb "green" lw 1,\
plik using 1:4 smooth bezier title "Brightness" lt rgb "blue" lw 1

set ylabel "RGB"

set output "out/rgb-csplines.png"
plot plik using 1:5 smooth csplines title "Red" lt rgb "red" lw 1,\
plik using 1:6 smooth csplines title "Green" lt rgb "green" lw 1,\
plik using 1:7 smooth csplines title "Blue" lt rgb "blue" lw 1

set output "out/rgb-bezier.png"
plot plik using 1:5 smooth bezier title "Red" lt rgb "red" lw 1,\
plik using 1:6 smooth bezier title "Green" lt rgb "green" lw 1,\
plik using 1:7 smooth bezier title "Blue" lt rgb "blue" lw 1


Output: plots


HSB, with different methods of curves smoothing

 

RGB, with different methods of curves smoothing

 

Live monitoring


To get the plots almost live, we can use rrdtool to gather, store and plot those data. First, create rrd database:


#!/bin/bash
rrdtool create hsb-rgb.rrd \
--step 300 \
DS:h:GAUGE:600:0:100 \
DS:s:GAUGE:600:0:100 \
DS:j:GAUGE:600:0:100 \
DS:r:GAUGE:600:0:255 \
DS:g:GAUGE:600:0:255 \
DS:b:GAUGE:600:0:255 \
RRA:AVERAGE:0.5:1:288                   \
RRA:AVERAGE:0.5:6:336                   \
RRA:AVERAGE:0.5:24:732                  \
RRA:AVERAGE:0.5:144:14600       \
RRA:MIN:0.5:1:288                       \
RRA:MIN:0.5:6:336                       \
RRA:MIN:0.5:24:732                      \
RRA:MIN:0.5:144:14600   \
RRA:MAX:0.5:1:288                       \
RRA:MAX:0.5:6:336                       \
RRA:MAX:0.5:24:732                      \
RRA:MAX:0.5:144:14600



Then add such lines to your webcam script (launched from cron every 5 minutes). $OUT_FILE is variable pointing to a just taken photo.

hsb=`convert "$OUT_FILE" -colorspace hsb  -resize 1x1  txt:- | sed 1d  | tr '()%,:' ' '`

h=`echo $hsb | awk '{print $8}'`
s=`echo $hsb | awk '{print $9}'`
v=`echo $hsb | awk '{print $10}'`

rgb=`convert "$OUT_FILE" -colorspace rgb  -resize 1x1  txt:- | sed 1d  | tr '()%,:' ' '`

r=`echo $rgb | awk '{print $3}'`
g=`echo $rgb | awk '{print $4}'`
b=`echo $rgb | awk '{print $5}'`

# update rrd database:
rrdtool update hsb-rgb.rrd "N:$h:$s:$v:$r:$g:$b"


Finally, after some data are collected, we can plot a nice graph (actually, we can plot it every five minutes, as new data arrives):

#!/bin/bash

BAZA=hsb-rgb.rrd

rrdtool graph hsb-24h.png  \
        --imgformat PNG \
        --start "now-24h"\
        --title="HSB [%]"       \
        --height 240    \
        --width 420    \
        --lazy \
        --alt-autoscale \
        --alt-y-grid \
        DEF:h=$BAZA:h:AVERAGE \
        DEF:s=$BAZA:s:AVERAGE \
        DEF:j=$BAZA:j:AVERAGE \
        LINE2:h#FF0000:"Hue"\
        LINE2:s#00FF00:"Saturation"\
        LINE2:j#0000FF:"Brightness"</pre>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFCZoyRIhEXYQEgSiiI3RGEZFtSfnJog6hNDprEf8JzjU7_1v8DCDAnxSUuxBcq3tqSYGJkfdr4UoZ2rHV-BBREHFU8KE4Flnob8XnD7_5PhWOzuxG9B8sidzWmhX8PGtMGFU0Rlo_l7XL/s1600/HSB-24h.png" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFCZoyRIhEXYQEgSiiI3RGEZFtSfnJog6hNDprEf8JzjU7_1v8DCDAnxSUuxBcq3tqSYGJkfdr4UoZ2rHV-BBREHFU8KE4Flnob8XnD7_5PhWOzuxG9B8sidzWmhX8PGtMGFU0Rlo_l7XL/s320/HSB-24h.png" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyGserUs6dUyqdudZA5_UagqTuZTj_weClykD_LKd_6ENAYSKw5tuqsugFVGbGH73rxJAH1kDhs2dCF_DBLB_H_dxZsD6k7XRRYFcdzP9uVgJKxIdA675z53hNIAeX0BXE2UsswMLBRK9D/s1600/RGB-24h.png" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyGserUs6dUyqdudZA5_UagqTuZTj_weClykD_LKd_6ENAYSKw5tuqsugFVGbGH73rxJAH1kDhs2dCF_DBLB_H_dxZsD6k7XRRYFcdzP9uVgJKxIdA675z53hNIAeX0BXE2UsswMLBRK9D/s320/RGB-24h.png" /></a>


Conclusions

I don't know, if this makes any sense, but looks cool :)

sobota, 20 kwietnia 2013

Bash script to generate knock sequences

Here is a simple bash script to generate random knock/knockd sequences with desired length, port ranges and types:

#!/bin/bash

# ---- config.
knocks_number=5
knocks_types=('udp' 'tcp');

port_min=1000
port_max=2000

# ---- config end

dport=$(( $port_max - $port_min ))

for i in `seq 1 $knocks_number`
do
    sequence=$sequence$[`shuf -i $port_min-$port_max -n 1` ]":"${knocks_types[ $[($RANDOM % 2)] ]}","
done

echo
echo "# A sequence for knockd.conf:"
echo "sequence = $sequence" | sed 's/,$//g'
echo "# A sequence to use with knock"
echo "knock \$ADDRESS $sequence" | sed 's/,/ /g'



The output may looks like:

# A sequence for knockd.conf:
sequence = 1460:udp,1344:tcp,1997:tcp,1712:udp,1304:udp
# A sequence to use with knock
knock $ADDRESS 1460:udp 1344:tcp 1997:tcp 1712:udp 1304:udp

poniedziałek, 15 kwietnia 2013

Control your computer with jabber/google talk and PHP

(I doubt this would be very useful, however maybe you will find it interesting)

Intro

When your linux box is hidden behind nat/firewall with no ssh/www/... access, you can use jabber protocol to control your computer (or do it for other purposes).

(hint: when you have access to ssh server you can use SSH tunneling for connection to your machine)


Prerequisites

You will need:

  • two jabber account (can be gmail account): one used only on remote computer, second can be your regular gmail account. Add each accounts to their contact lists.
  • on the linux box you want to connect with: php-cli installed
  • XMPPHP (download and unpack) with applied bug fix described here in XMLStream.php.
  • optionally: account on the server dedicated for this purpose (eg: adduser jabber). 

Script

here is simple script which connects to your jabber account (which is probably gmail/google talk), listen to your commands, executes it and return to you:


To run this script in a background, check periodically if the script is running (and client is connected to a jabber server) you would also need a wrapper


To execute above wrapper periodically, just add a line to your crontab:

*/5 * * * * /home/jabber/bin/jabber/wrapper.sh 1> /dev/null 2> /dev/null

here is a screenshot from session with my raspberry pi server:



piątek, 5 kwietnia 2013

Simple SSH tunnel with auto resuming

Intro 

There is a lot of tutorials about SSH tunnels. Also there is a number of tutorials about keeping/resuming/checking SSH tunnels. For me all those solutions has a big drawback: either they didn't work when system was rebooted, or they required root privilegs (or both, or simply they didn't work for me).

Here is a simple description how to set up such tunnel with quite smart resuming.

Idea


Nothing new here. We want be able to SSH from compA to compB, but compB is behind firewall/NAT, so it is not possible in a normal way (ie ssh compB - red arrow). Thus we have to set up SSH tunnel. Still: nothing new:

On compB:
ssh -R 19999:localhost:22 username@compA

Then on compA we can ssh to compB this way:
ssh localhost -p 19999 -v -l username 
(where 19999 is a port, can be another big number)

But what if SSH connection from compB to compA is lost? We will not be able to connect from compA to compB. Thus, we have to make sure, that this connection (B to A) will be resumed in case of connection lost or system reboot.

How to

On compB:
  1. Enable ssh logging from compB to compA without password (there is a lot of howtos: google it)
  2. Create auto connection script
    #!/bin/bash
    
    REMOTEUSER=username
    REMOTEHOST=compA   #compA IP
    
    SSH_REMOTEPORT=22
    SSH_LOCALPORT=19999
    
    COMMAND="$SSH_LOCALPORT:localhost:22 $REMOTEUSER@$REMOTEHOST"
    
    a=`ps -fe | grep "$COMMAND" | grep -v grep`
    if [ ! "$a" ]; then
        echo "No connection"
        ssh -o TCPKeepAlive=yes -fN -R $COMMAND
        date >> tunnel.log
    else
        echo "Connected"
    fi
  3. Add it to crontab, eg:
    */5 * * * * ssh_check.sh 1> /dev/null 2> /dev/null
that's it. Now you probably will be able to connect from compA to compB. Even when compB was restarted or connection was lost, as soon as the cron is triggered, the connection will be resumed.



piątek, 28 września 2012

Open Babel with progress bar

I've tried to evaluate progress of Open Babel conversion from sdf (or mol/mol2 etc) to smi. First I've used bar for this, worked fine, but slightly different that I've expected:
obabel in.sdf -osmi | bar -of out.smi
Output was:
44.9KB at   11.2KB/s  elapsed:   0:00:04
only cnversion rate and time elapsed... Then I found a pv program, which can be configured to show the regular progress bar with some additional information (ETA, conversion rate etc):
obabel in.sdf -osmi | pv -lperat -s `count_mol.sh in.sdf` > out.smi
(this script uses count_mol.sh script I've mentioned in one of my previous posts.) The output from this command looks similar to:
0:00:01 [ 267/s] [ 267/s] [====>                   ]  9% ETA 0:00:09
and it does everything I need (ie progress and ETA).

piątek, 4 listopada 2011

OpenBabel compiling error

Frome time to time, during OpenBabel compilation, a strange error occurs:
cc1plus: warning: command line option '-Wstrict-prototypes'
 is valid for Ada/C/ObjC but not for C++ [enabled by default]
In file included from /root/OPENBABEL/openbabel/scripts/python/
openbabel-python.cpp:3358:0:
/root/OPENBABEL/openbabel/scripts/python/../../
include/openbabel/math/align.h:26:22: fatal error: Eigen/Core: No such file or directory
compilation terminated.
error: command 'gcc' failed with exit status 1
make[2]: *** [scripts/CMakeFiles/_openbabel] Error 1
make[1]: *** [scripts/CMakeFiles/_openbabel.dir/all] Error 2
make: *** [all] Error 2
Ant one may assume, that something is wrong in include/openbabel/math/align.h file. To be precise, there is a wrong path to the Eigen/Core files. To fix this problem, you have to open include/openbabel/math/align.h in the source directory, and change the line:
#include <Eigen/Core>
to
#include <your_path_to_Eigen_Core>
eg:
#include </root/OPENBABEL/eigen/eigen-eigen-2.0.16/Eigen/Core>
good luck!

czwartek, 29 września 2011

How to compress chemical mol2/sdf files?

Introduction

There is a lot of benchmarks of compression utilities used to compress text/binary/other data. But I was curious what is the best compression method for compressing big chemical databases stored in mol/sdf/mol2 files. Those are plain text files, but has a quite specific and well defined structure. Below are test results.

Methods

All test was conducted on Debian GNU/Linux (Linux debian 3.0.0-1-686-pae #1 SMP Sun Jul 24 14:27:32 UTC 2011 i686 GNU/Linux) with 2xIntel(R) (quad core) Xeon(R) CPU E5620 @ 2.40GHz.
Following version of software were used:
RAR 4.00 beta 3   Copyright (c) 1993-2010 Alexander Roshal   17 Dec 2010
lzma 4.32.0beta3 Copyright (C) 2006 Ville Koskinen
ARJ32 v 3.10, Copyright (c) 1998-2004, ARJ Software Russia. [08 Mar 2011]
Zip 3.0 (July 5th 2008)
7-Zip 9.20, p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,16 CPUs)
bzip2, a block-sorting file compressor.  Version 1.0.5, 10-Dec-2007
gzip 1.4
All was run with default settings:
rar a zinc.rar zinc.mol2
lzma zinc.mol2 
arj a zinc.arj zinc.mol2
zip zinc.zip zinc.mol2
7z a zinc.7z zinc.mol2
bzip2 zinc.mol2
gzip zinc.mol2
The execution time was measured this way:
/usr/bin/time -o lzma.log lzma zinc.mol2 1> /dev/null 2> /dev/null
Two chemical databases was used for tests:
  • ChemBridge express pick SDF: 1340727 kb
  • ZINC fragments usual MOL2: 97658 kb

Results

Compression ratio:
Compression speed:
Speed to ratio (I don't know if this makes sense ;)
chembridge express pick (SDF)zinc frags (MOL2)
methodkb/sratiokb/sratio
bz22608,425,42%4650,3812,21%
lzma447,955,55%434,0411,64%
7z2343,936,67%864,2312,98%
rar5002,718,08%2959,3316,74%
arj11970,7810,51%6103,6322,34%
gz26288,7610,72%9765,8022,08%
zip23115,9810,72%9765,8022,08%
What is Compression ratio?

Raw data (size in kb):

chembridge express pick (SDF)
72663   bz2
74456   lzma
89441   7z
108303  rar
140892  arj
143695  gz
143695  zip
1340727 sdf
zinc fragments usual (mol2):
11923   bz2
12673   7z
16346   rar
21560   mol2.gz
21560   zip
21821   arj
97658   mol2

Conclusion

The winner in the compression ratio competition is lzma, however the compression speed is disappointing. The second place is bz2, with better compression speed. I was surprised, that 7z had a bit worse compression ratio than bz2, as the first is said to be the best among modern compression utilities. In the speed/compresion comparison th inner is gzip
So, my reccomendation (formed for myself, but feel free to apply them too:) are:
  • If you need good compression with reasonable speed, use bzip2
  • If you need fast compression with reasonable compression ratio, use gzip

poniedziałek, 26 września 2011

How to count number of molecules in chemical file?

How to count number of molecules in chemical file (mol2, sdf, smi or mol) in bash?


#!/bin/bash

echo
echo "Count number of molecules in a file"
echo

[[ -n "$1" ]] || { echo; echo "Usage:"; 
echo "$0 chemical_database.ext";
exit 0 ; }

ext=${1#*.}

echo -n "Number of molecules in $1 file: "

case $ext in
"sdf" )
        cat $1 | grep "ISIS" | wc -l
        ;;
"smi" )
        cat $1 | wc -l
        ;;
"mol2" )
        cat $1 | grep '@MOLECULE' | wc -l
        ;;
"mol" )
        cat $1 | grep "M  END" | wc -l
        ;;
* )
        echo "Extension not supported :("
        ;;
esac


EDIT 2016.10: there is much easier way to do it:
obabel "filename.sdf" -onul --count 2>&1 | grep "molecules converted" | cut -d" " -f 1

tar.7z to store permissions

As we can conclude from some internet avaliable bemchmarks (eg here) or from my internal tests, 7z compressor wins among others, in compresion ratio as well as (de)compression time. One of the drawbacks of 7z is the fact it doesn't store linux permisions.
So the simplest solution for this problem is to:
  • use tar to create an umcompressed archive (with proper permissions etc archived), then
  • use 7z to compress tar file
So, to do that, just enter following commands:
tar -cf out.tar /path/to/archive
7z a out.tar.7z out.tar