wtorek, 21 października 2014

How many PAINS are there in commercial libraries?

Introduction

PAINS is a hot topic recently. Some people estimates, that those compounds are 5-12% of all commercial libraries [1]. Here I present results of assesing percentage of PAINS in a various popular commercial libraries as well as in ZINC-all-now database.

Libraries was filtered with SMARTS patterns prepared by Rajarshi Guha [2] and provided by filter-it software [3]. For comparison, I've included also ZINC database (which is a filtered and curated collection of ligands from commercial libraries), Chembl (v. 19; small molecules from scientific literature) and SureChemBl (structures from patents). There is also StructuralAlerts filter delivered by silicos-it and based on [4].


Results


How many PAINS are there?

To sum up: it's not so bad: maximal percentage of PAINS is <3% and 1.65% on average (1.82% for typical libraries).






databaseFilterFamily AFilterFamily BFilterFamily CTotal PAINS (A+B+C)StructuralAlerts
chembl_191.56%0.62%0.21%2.39%47.03%
Enamine Advanced0.30%0.10%0.08%0.48%21.17%
Enamine HTS0.65%0.10%0.13%0.89%26.75%
LifeChemicals stock1.70%0.15%0.11%1.95%25.11%
Maybridge Screening1.56%0.76%0.62%2.94%48.62%
SureChEMBL0.02%0.00%0.00%0.02%0.65%
Zelinsky HTS1.66%0.86%0.32%2.83%47.36%
ZINC All_now1.20%0.31%0.21%1.73%29.53%
Average - %PAINS1.08%0.36%0.21%1.65%30.78%



What are those PAINS?


Here are results of top 20 alerts (for all screened libraries) and a percentage of all alerts:


Here we have SMARTS of top PAINS pollutants (do you recognize your hits here?;):





ruleCountPercent
azo_A(324)6393415.65
ene_rhod_A(235)6172715.11
anil_di_alk_D(198)4410310.79
anil_di_alk_C(246)4393610.75
imine_one_A(321)283306.93
ene_five_het_G(10)253336.20
anil_di_alk_B(251)212855.21
ene_five_het_B(90)146693.59
imine_one_isatin(189)133423.27
ene_five_hetA1(201A)131363.21
thio_ketone(43)81922.00
anil_alk_ene(51)70631.73
ene_one_hal(17)49491.21
thiophene_amino_Aa(45)49331.21
ene_five_het_C(85)47621.17
ene_one_ene_A(57)44471.09
imine_one_fives(89)43231.06
amino_acridine_A(46)40961.00
ene_five_het_D(46)38700.95
keto_keto_beta_A(68)37900.93
rhod_sat_A(33)22070.54
ene_cyano_A(19)21040.51
ene_five_one_A(55)16990.42
het_thio_66_one(8)16330.40
imidazole_A(19)13950.34
diazox_sulfon_A(36)13910.34
quinone_B(5)12660.31
keto_phenone_A(11)12620.31
acyl_het_A(9)12450.30
thiaz_ene_D(8)12190.30
keto_keto_gamma(5)10890.27
anil_di_alk_F(14)9690.24
styrene_A(13)9670.24
imine_imine_A(9)8930.22
cyano_cyano_A(23)7660.19
keto_keto_beta_B(12)6440.16
het_6666_A(2)5720.14
steroid_A(2)4330.11
imine_one_sixes(27)3440.08
ene_five_het_E(44)2630.06
keto_phenone_B(1)2410.06
het_65_C(6)2160.05
styrene_B(8)1860.05
het_5_A(7)1330.03
imine_one_fives_B(9)1280.03
het_thio_5_imine_A(1)1140.03
ene_misc_A(5)1080.03
het_pyridiniums_B(2)910.02
cyano_cyano_B(3)860.02



References


[1] http://cen.acs.org/articles/92/i35/Getting-Rid-Painful-Compounds.html, http://pipeline.corante.com/archives/2014/09/26/pains_go_mainstream.php
[2] http://blog.rguha.net/?p=850
[3] Unfortunatelly, no logner on the web
[4] Brenk et al. (2008) ChemMedChem 3, 435-444

czwartek, 10 lipca 2014

How to split any chemical file to single files with file names = compounds' titles

Here is a simple python script for splitting (almost) any chemical multi-structures file (mol2, sdf, smi...) to single ones, with compounds names as new file names. It uses pybel - an openbabel python interface.

Usage:
split_mol.py -i big.mol2

The code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import os
import getopt
import pybel
from random import randint


### Options

errMsg = "Invalid operation specified:\t%s\nAllowed options are:\n\t-i\tinput file\n"

options, arguments = getopt.getopt( sys.argv[1:], 'i:' )
inputFile = ''

for opt in options:
 if opt[0] == '-i':
  inputFile = opt[1]

  
if not os.path.exists( inputFile ):
 print ( errMsg % "No such file.\n" )
 sys.exit( -1 )

### output names and dirs

basename = os.path.splitext(os.path.basename(inputFile))[0]
ext = os.path.splitext(os.path.basename(inputFile))[1][1:]

outDir = basename + ".splitted"
i = 0

if not os.path.exists(outDir):
    os.makedirs(outDir)

### Splitting

for mol in pybel.readfile(ext, inputFile):
  if mol.title == "":
    i += 1
    title = "%s_%09d" % (basename, i)
  else:
    title = mol.title

  print "%30s" % title,
  
  outFile = "%s/%s.%s" % (outDir, title, ext)
  
  if os.path.exists( outFile ): # if desired file exists
    outFile = "%s/%s_dup%03d.%s" % (outDir, title, randint(0,1000), ext) # attach _dup suffix with 3-digit random number
    print " duplicated: written to file: ", outFile
  else:
    print " ok"

  mol.write(ext, outFile)

wtorek, 10 czerwca 2014

Notes on Open Babel compiling and local install...

Installing openbabel locally sometimes is not as straightforward as it could be. Here are some notes on compiling it on the system without root permissions.

Compiling OB from sources


Fetching / updating sources from git repositories. Please notice the patching eem charges model with eem_fix.patch (optional).

#!/bin/bash

# for the first time use:
#mkdir -p openbabel
# cd openbabel
#git clone git://github.com/openbabel/openbabel.git

cd openbabel
git pull
cd ..

# optional for patching EEM
# echo patching....
# patch openbabel/src/charges/eem.cpp < eem_fix.patch

if [ -d build ]; then
    echo "********* Please delete build dir"
    exit
fi

mkdir build
cd build
cmake ../openbabel -DPYTHON_BINDINGS=ON -DRUN_SWIG=ON \
    -DPYTHON_INCLUDE_DIR=/usr/include/python2.6 \
    -DPYTHON_LIBRARY=/usr/lib/python2.6/config/libpython2.6.so \
    -DCMAKE_INSTALL_PREFIX=~/bin/openbabel-install

make -j14

make test

After successful compilation you should have a complete openbabel structure under your bin/openbabel-install:

~/bin/openbabel-install/
   |-bin
   |-include
   |---inchi
   |---openbabel-2.0
   |-----openbabel
   |-------json
   |-------math
   |-------stereo
   |-lib
   |---cmake
   |-----openbabel2
   |---openbabel
   |-----2.3.90
   |---pkgconfig
   |---python2.6
   |-----site-packages
   |-share
   |---man
   |-----man1
   |---openbabel
   |-----2.3.90


Compiling align-it


Fetch the newest version of silicos align-it, extract it to directory of your choice (eg. align-it-1.0.4). Then you can use following script to setup environmental variables and compile the program:

#!/bin/bash

export BABEL_LIBDIR=/home/fstefaniak/bin/openbabel-install/lib/ 
export BABEL_INCLUDEDIR=/home/fstefaniak/bin/openbabel-install/include/openbabel-2.0/openbabel/
export BABEL_DATADIR=/home/fstefaniak/bin/openbabel-install/share/openbabel/2.3.90/

cd align-it-1.0.4

mkdir build
cd build
cmake ..
make


To be continued


:)

piątek, 30 maja 2014

encfs under KDE

Mounting and unmounting encfs is easy in command line, but I was looking for a solution of doing it in KDE during system startup/shutdown. Here are two scripts which uses KDialog for asking for password and displaying notices.


encfsmount.sh
#!/bin/bash

encrypted=~/.encr
unencrypted=~/myplace

encfs --extpass="kdialog --password 'encfs password'" $encrypted $unencrypted

if  mount|grep $unencrypted > /dev/null 2>&1; then
    kdialog --title "Mounted :)" --passivepopup "$unencrypted mounted succesfully" 10
else
    kdialog --error "$unencrypted not mounted :("
fi


encfsumount.sh
#!/bin/bash

unencrypted=~/myplace

fusermount -u $unencrypted

if  mount|grep $unencrypted > /dev/null 2>&1; then
    kdialog --error "$unencrypted NOT unmounted :("
else
    kdialog --title "Unmounted :)" --passivepopup "$unencrypted unmounted succesfully" 10

fi


Add those two scripts to autostart in KDE (or to your launcher/desktop icons if you prefer to mount/unmount it manually):


It will ask for the password when mounting:


When successful, the notification in the corner should appear:

czwartek, 20 marca 2014

Resetting usb devices/hubs from command line

To reset "hung" usb device (ie to simulate unplugging and replugging it) I use nice program usbreset. It can be downloaded from GitHub (and I suppose from many other places). After compiling (gcc usbreset.c -o usbreset) it is ready to use.

For mass-resseting usb devices I use this small script:

#!/bin/bash

for d in `find /dev/bus/usb/ -type c`
do
    echo $d
    /usr/local/bin/usbreset $d
done


poniedziałek, 3 marca 2014

Display full dataset name in Weka Experiment Environment

When analyzing results in Weka Experiment Environment, the common problem is that default dataset name's width is 20. Thus, in the result page you can't see the full name of the dataset:


But fortunatelly you can change it quickly: just klick: Output format: Select -> ResultMatrixPlainText -> row name width and change it to a desired value (eg 120):

wtorek, 17 grudnia 2013

Using sequential symmetric gpg encryption with different ciphers.

This method is good for encrypting short messages (while it uses variables for storing information and generates plain text output), but can be easily modified to encrypt larger files (using temporary files instead of variable).

(1) Encryption


First, define which ciphers and in which order you want to use. For more information about ciphers avaliable, type:
gpg --version
and jump to the section "ciphers" or "symmetric":

Symetryczne: IDEA, 3DES, CAST5, BLOWFISH, AES, AES192, AES256,
             TWOFISH, CAMELLIA128, CAMELLIA192, CAMELLIA256

Enter them in the config section of our "encrypt-multiple.sh" script:

#!/bin/bash

algos="TWOFISH AES256 CAMELLIA256 BLOWFISH CAST5" # list of ciphers to use

# -----------------------------------------------------#

# clearing variables
pass=""
pass2=""

# entering passwords
echo -n "Password: "
read -s pass
echo
echo -n "Re-enter password: "
read -s pass2
echo

# does passwords match?
if [ "$pass" == "$pass2" ]; then
    echo "Passwords mach. Encrypting."
    echo

input=`cat "$1"`

for algo in $algos
do
    ((i++))
    echo "*** ($i) $algo"
    input=`echo "$input" | gpg --no-tty --batch -a --symmetric --cipher-algo "$algo" --passphrase "$pass" -o-`
done

echo "$input" > "$1".asc.$i
echo "Encrypted message saved to $1.asc.$i"

# clearing passwords and inputs
input=""
pass=""
pass2=""

else
    echo "Passwords doesn't match"
fi


So now if you want to encrypt message in file.txt, just run:

encrypt-multiple.sh file.txt

After entering passphases (twice) you will get the encrypted file "file.txt.n" where n is a number of used ciphers (n will be necesary while during decryption).


(2) Decryption


For decrypting above message we just need to enter valid password. We don't need the names and order of used ciphers as gpg detects it automagically. The n - number of passes (used ciphers) is "encoded" in file extension.

#!/bin/bash
pass=""

# entering passwords
echo -n "Password: "
read -s pass
echo
input=`cat "$1"`

# list of Ciphers are not necesary as gpg detects it; read from file extension
algos="${1##*.}"
echo "Encrypted $algos times. Decrypting..."

for i in `seq 1 $algos`
do
    echo "*** $i"
    input=`echo "$input" | gpg --no-tty --batch -d --passphrase "$pass" -o-`
done

echo "Decrypted message:"
echo "---------------------------------------"
echo "$input"

# clearing passwords and inputs
input=""
pass=""
pass2=""


(3) Output file sizes.


Output file sizes inceases as more ciphers are used. Here is an example of file sizes (uncompressed and compressed with bzip2). Cipher used are:
TWOFISH AES256 CAMELLIA256 BLOWFISH CAST5 TWOFISH AES256 CAMELLIA256 BLOWFISH CAST5.


More reading about ciphers and symmetric encryption: GPG Encryption Guide - Part 4 (Symmetric Encryption).

(4) Bonus


If you want to try decoding, here is 5-fold encrypted text (n=5). The password is chemoinformatics.