#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    45
    Rep Power
    5

    text to cluster format conversion


    Hi,

    I have asked this question once, but still I could not get though this particular problem, I would appciate any help in this. I have also attached a picturial representation of the problem for better understanding.

    My tab separated input file (input.txt) looks like (First line is header):

    Code:
    Query_Id	SubjectIid	% Identity	Alignment_length	Mismatches	Gap_openings	Q._start	Q._end	S._start	
    
    S._end	E-value	Bit_score	% Coverage
    or1|NP_ER56.2	or2|315163404	100	590	0	0	1	590	1	590	0	1209.5	100
    or1|NP_ER56.2	or3|715578848	100	674	0	0	1	674	1	674	0	1297.3	100
    or1|NP_ER56.2	or2|987578843	100	649	0	0	1	649	1	649	0	1290	100
    or3|315578895	or1|DJKKWEYIQ	100	517	0	0	1	517	1	517	0	1040.8	100
    or2|NPTE78916	or2|995163276	100	561	0	0	1	561	1	561	0	1109	100
    I want to calculate clusters from the first two columns using following rules: (plz see picture)

    i) each cluster is build by taking one of repeated "or x|some_number" value from column1, and adding corresponding "or y|some_number" values from column2, in the same rows where "or x|some_number" is located (repeatedly)
    ii) or_x| can not be taken twice from the second column; i.e, if I match ''or1|123's from the first column , I can't take 'or1|345' from second column for the same cluster anymore; similarly, if or2| is taken once from the second column, or2| can not be taken anymore from that column

    Untitled.jpg
    or
    pic url: https://sites.google.com/site/iicbbi...attredirects=0

    For my (inputfile.txt), the resulting clusterfile (output.txt) would look like:

    Code:
    Cluster1:<tab> or1|NP_ER56.2 or2|315163404 or3|715578848
    Cluster2:<tab> or3|315578895 or1|DJKKWEYIQ 
    Cluster3:<tab> or2|NPTE78916
    Thanks for your consideration
    Last edited by utpalmtbi; August 9th, 2016 at 02:33 PM. Reason: url add
  2. #2
  3. Contributing User
    Devshed God 1st Plane (5500 - 5999 posts)

    Join Date
    Aug 2011
    Posts
    5,893
    Rep Power
    509
    Code:
    '''
        reads from input.txt
        writes to stdout
    '''
    
    import sys
    w = sys.stdout.write
    
    class FunkyWunky:
    
        def __init__(self, s):
            self.s = s
            self.key, self.value = s.split('|')
    
        def __hash__(self):
            return hash(self.key)
    
        def __eq__(self, other):
            assert isinstance(other, self.__class__)
            return self.key == other.key
    
        def __str__(self):
            return self.s
    
    vital_data = []
    
    # load the file
    with open('input.txt', 'rt') as inf:
        for line in inf:
            if line.startswith('or'):
                vital_data.append(line.split('\t')[:2])
    
    # sort the data
    vital_data.sort()
    
    
    # arrange into clusters and display
    cluster = 0
    i = j = 0
    while i < len(vital_data):
        j = i
        cluster += 1
        head = FunkyWunky(vital_data[i][0])
        tail = {head}
        while (j < len(vital_data)) and (vital_data[i][0] == vital_data[j][0]):
            tail.add(FunkyWunky(vital_data[j][1]))
            j += 1
        w('Cluster{}:\t{}'.format(cluster, head))
        tail.remove(head)
        for member in sorted(tail, key = str):
            w('\t{}'.format(member))
        w('\n')
        i = j

    Comments on this post

    • utpalmtbi agrees
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo