Strony

czwartek, 10 lipca 2014

How to split any chemical file to single files with file names = compounds' titles

Here is a simple python script for splitting (almost) any chemical multi-structures file (mol2, sdf, smi...) to single ones, with compounds names as new file names. It uses pybel - an openbabel python interface.

Usage:
split_mol.py -i big.mol2

The code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import os
import getopt
import pybel
from random import randint


### Options

errMsg = "Invalid operation specified:\t%s\nAllowed options are:\n\t-i\tinput file\n"

options, arguments = getopt.getopt( sys.argv[1:], 'i:' )
inputFile = ''

for opt in options:
 if opt[0] == '-i':
  inputFile = opt[1]

  
if not os.path.exists( inputFile ):
 print ( errMsg % "No such file.\n" )
 sys.exit( -1 )

### output names and dirs

basename = os.path.splitext(os.path.basename(inputFile))[0]
ext = os.path.splitext(os.path.basename(inputFile))[1][1:]

outDir = basename + ".splitted"
i = 0

if not os.path.exists(outDir):
    os.makedirs(outDir)

### Splitting

for mol in pybel.readfile(ext, inputFile):
  if mol.title == "":
    i += 1
    title = "%s_%09d" % (basename, i)
  else:
    title = mol.title

  print "%30s" % title,
  
  outFile = "%s/%s.%s" % (outDir, title, ext)
  
  if os.path.exists( outFile ): # if desired file exists
    outFile = "%s/%s_dup%03d.%s" % (outDir, title, randint(0,1000), ext) # attach _dup suffix with 3-digit random number
    print " duplicated: written to file: ", outFile
  else:
    print " ok"

  mol.write(ext, outFile)