#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    22
    Rep Power
    0

    Help needed in random sampling


    Hi all,

    I have an input file (input.txt) in tab separated format.
    A small part of the input file is as follows:

    Gr1 Gr2 Gr3 Gr4 ..............
    row1 1 1 1 0 ..............
    row2 0 1 0 1 ...............
    row3 1 1 0 1 ..............
    ...................
    .................


    From this input,random sampling of all possible Group combinations needed to be taken, and print in a output file.
    From the above input, the output file should be as follows
    (the three columns are: Number of groups used in analysis, Group names, Number shared between these groups;;;
    Shared number are measured if 1 is present in all the measuring groups).

    output:

    2 Gr1-Gr2 2
    2 Gr1-Gr3 1
    2 Gr1-Gr4 1
    2 Gr2-Gr3 1
    2 Gr2-Gr4 2
    2 Gr3-Gr4 0
    3 Gr1-Gr2-Gr3 1
    3 Gr1-Gr2-Gr4 1
    3 Gr1-Gr3-Gr4 0
    3 Gr2-Gr3-Gr4 0
    4 Gr1-Gr2-Gr3-Gr4 0

    Any idea how to do it??? thanks in advance..
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,703
    Rep Power
    480
    I cannot tell what constitutes a group in your input.
    From where came the 2 in your output?

    A python programmer can transpose rows of columns using zip . And use itertools.combinations to find all combinations of groups of a specific length. You could use repeated calls for random numbers to reduce the dimensionality until you have a single random sample, or you could combine the lists and use a single random selection. Put all this within a loop of number of groups from 2 to n.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    22
    Rep Power
    0
    left 2 (first line) represents the two groups (Gr1-Gr2) used in the analysis at the first line.. and the 2 in the right is the number shared between two groups (as Gr1-Gr2 is 1 and 1 in row1 and row3)
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,703
    Rep Power
    480
    I think I now understand. Random sampling is irrelevant (at least for this part of the algorithm). Sorry, this code is dense.
    Code:
    import pprint
    import itertools
    
    data = '''
        name	Gr1	Gr2	Gr3	Gr4
        row1	1	1	1	0
        row2	0	1	0	1
        row3	1	1	0	1
    '''
    ROWS = [r.strip() for r in data.strip().split('\n')]
    print('The data:')
    pprint.pprint(ROWS)                        # Yes, I could delete row 0
    print('\nTransposed') # transposition makes the problem easier for me to visualize, and zip eliminates data irregularities by returning a "rectangular" list of lists.  Now that I've justified zip, you're right, it's irrelevant.
    #dictionaries with keys might be more straightforward.
    GROUPS = list(zip(*(r.split('\t') for r in ROWS)))
    pprint.pprint(GROUPS)
    for n in range(2,len(GROUPS)+1):
        #print('count of items all 1 considering all combinations of {} groups'.format(n))
        for Gs in itertools.combinations(GROUPS,n):
            same = sum(all('1' == G[i] for G in Gs) for i in range(len(Gs[0]))) # Note that I've used string '1'
            print('{} {} {}'.format(n,'-'.join(G[0] for G in Gs),same))
    [code]Code tags[/code] are essential for python code and Makefiles!
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    22
    Rep Power
    0

    wont work in input file


    Thank u b49P23TIvg for ur response.. ur code works fine for the above table ,, but when I try to use it in my own data file, it wont work.. my input file is (new.txt)and is in

    https://sites.google.com/site/iicbbioinformatics/share..

    thank u again for ur consideration..
  10. #6
  11. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,703
    Rep Power
    480
    1) open the file in text mode.
    2) You're asking for 3e13 lines of output.
    Code:
    $ /usr/local/j64-801/bin/jconsole 
       2^45
    3.51844e13
    Still, the task might be possible. Eliminate combinations of groups known to have no matches, and reduce the indices that need to be tested using set intersection of the indexes without corresponding 0 data values. Latter idea sounds like a better algorithm than the rather straightforward idea I've implemented.
    Code:
    import pprint
    import itertools
    
    FILENAME = '/tmp/new.txt'
    
    DATA = open(FILENAME,'rt').read()  # open the file in text mode
    ROWS = [R.strip() for R in DATA.split('\n')]
    GROUPS = list(zip(*(r.split('\t') for r in ROWS)))
    for n in range(2,len(GROUPS)+1):
        #print('count of items all 1 considering all combinations of {} groups'.format(n))
        for Gs in itertools.combinations(GROUPS,n):
            same = sum(all('1' == G[i] for G in Gs) for i in range(len(Gs[0]))) # Note that I've used string '1'
            print('{} {} {}'.format(n,'-'.join(G[0] for G in Gs),same))
    [code]Code tags[/code] are essential for python code and Makefiles!
  12. #7
  13. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,703
    Rep Power
    480

    Sets are faster.


    set intersection runs 5 times faster using combinations length 2 and 3. If you want to consider combinations of up to 4 groups change len(GROUPS)+1 in
    for n in range(2,len(GROUPS)+1)
    to 4
    Code:
    import itertools
    from functools import reduce
    
    FILENAME = '/tmp/new.txt'
    
    DATA = open(FILENAME,'rt').read()  # open the file in text mode
    ROWS = [R.strip() for R in DATA.split('\n')]
    GROUPS = list(zip(*(r.split('\t') for r in ROWS)))
    del GROUPS[0]
    
    if True:                                  # new algorithm 5 times faster
        INDEXES = {G[0]:{I for (I,FIELD,) in enumerate(G[1:]) if int(FIELD)} for G in GROUPS}
        KEYS = list(INDEXES.keys())
        for n in range(2,len(GROUPS)+1):
            for Ks in itertools.combinations(KEYS,n):
                KEY = '{} {}'.format(n,'-'.join(Ks))
                INDEXES_INTERSECTION = reduce(set.intersection,(INDEXES[K] for K in Ks))
                # INDEXES[KEY] = INDEXES_INTERSECTION  # store some of these for reuse
                print('{} {}'.format(KEY,len(INDEXES_INTERSECTION)))
    else:                                     # original SLOW algorithm.
        del GROUPS[0]
        for n in range(2,len(GROUPS)+1):
            for Gs in itertools.combinations(GROUPS,n):
                same = sum(all('1' == G[i] for G in Gs) for i in range(len(Gs[0]))) # Note that I've used string '1'
                print('{} {} {}'.format(n,'-'.join(G[0] for G in Gs),same))
    [code]Code tags[/code] are essential for python code and Makefiles!
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    22
    Rep Power
    0

    error


    Sorry but still cant get it.. it yields following error..

    C:\Users\Utpal\Desktop\qw\test>python 1.py >> out.txt
    Traceback (most recent call last):
    File "1.py", line 6, in <module>
    DATA = open(FILENAME,'rt').read() # open the file in text mode
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/new.txt'
  16. #9
  17. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,703
    Rep Power
    480
    Look for the FILENAME variable near the top of the code, and use the name of your file.
    [code]Code tags[/code] are essential for python code and Makefiles!
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    22
    Rep Power
    0

    thanks


    okk.. got it.. it takes 30 min :-) ..but thank u very much......

IMN logo majestic logo threadwatch logo seochat tools logo