Strony

Pokazywanie postów oznaczonych etykietą format conversion. Pokaż wszystkie posty
Pokazywanie postów oznaczonych etykietą format conversion. Pokaż wszystkie posty

czwartek, 10 lipca 2014

How to split any chemical file to single files with file names = compounds' titles

Here is a simple python script for splitting (almost) any chemical multi-structures file (mol2, sdf, smi...) to single ones, with compounds names as new file names. It uses pybel - an openbabel python interface.

Usage:
split_mol.py -i big.mol2

The code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import os
import getopt
import pybel
from random import randint


### Options

errMsg = "Invalid operation specified:\t%s\nAllowed options are:\n\t-i\tinput file\n"

options, arguments = getopt.getopt( sys.argv[1:], 'i:' )
inputFile = ''

for opt in options:
 if opt[0] == '-i':
  inputFile = opt[1]

  
if not os.path.exists( inputFile ):
 print ( errMsg % "No such file.\n" )
 sys.exit( -1 )

### output names and dirs

basename = os.path.splitext(os.path.basename(inputFile))[0]
ext = os.path.splitext(os.path.basename(inputFile))[1][1:]

outDir = basename + ".splitted"
i = 0

if not os.path.exists(outDir):
    os.makedirs(outDir)

### Splitting

for mol in pybel.readfile(ext, inputFile):
  if mol.title == "":
    i += 1
    title = "%s_%09d" % (basename, i)
  else:
    title = mol.title

  print "%30s" % title,
  
  outFile = "%s/%s.%s" % (outDir, title, ext)
  
  if os.path.exists( outFile ): # if desired file exists
    outFile = "%s/%s_dup%03d.%s" % (outDir, title, randint(0,1000), ext) # attach _dup suffix with 3-digit random number
    print " duplicated: written to file: ", outFile
  else:
    print " ok"

  mol.write(ext, outFile)