February 11th, 2013, 02:08 PM

Help needed in random sampling
Hi all,
I have an input file (input.txt) in tab separated format.
A small part of the input file is as follows:
Gr1 Gr2 Gr3 Gr4 ..............
row1 1 1 1 0 ..............
row2 0 1 0 1 ...............
row3 1 1 0 1 ..............
...................
.................
From this input,random sampling of all possible Group combinations needed to be taken, and print in a output file.
From the above input, the output file should be as follows
(the three columns are: Number of groups used in analysis, Group names, Number shared between these groups;;;
Shared number are measured if 1 is present in all the measuring groups).
output:
2 Gr1Gr2 2
2 Gr1Gr3 1
2 Gr1Gr4 1
2 Gr2Gr3 1
2 Gr2Gr4 2
2 Gr3Gr4 0
3 Gr1Gr2Gr3 1
3 Gr1Gr2Gr4 1
3 Gr1Gr3Gr4 0
3 Gr2Gr3Gr4 0
4 Gr1Gr2Gr3Gr4 0
Any idea how to do it??? thanks in advance..
February 11th, 2013, 03:21 PM

I cannot tell what constitutes a group in your input.
From where came the 2 in your output?
A python programmer can transpose rows of columns using zip . And use itertools.combinations to find all combinations of groups of a specific length. You could use repeated calls for random numbers to reduce the dimensionality until you have a single random sample, or you could combine the lists and use a single random selection. Put all this within a loop of number of groups from 2 to n.
[code]
Code tags[/code] are essential for python code and Makefiles!
February 11th, 2013, 03:30 PM

left 2 (first line) represents the two groups (Gr1Gr2) used in the analysis at the first line.. and the 2 in the right is the number shared between two groups (as Gr1Gr2 is 1 and 1 in row1 and row3)
February 11th, 2013, 04:43 PM

I think I now understand. Random sampling is irrelevant (at least for this part of the algorithm). Sorry, this code is dense.
Code:
import pprint
import itertools
data = '''
name Gr1 Gr2 Gr3 Gr4
row1 1 1 1 0
row2 0 1 0 1
row3 1 1 0 1
'''
ROWS = [r.strip() for r in data.strip().split('\n')]
print('The data:')
pprint.pprint(ROWS) # Yes, I could delete row 0
print('\nTransposed') # transposition makes the problem easier for me to visualize, and zip eliminates data irregularities by returning a "rectangular" list of lists. Now that I've justified zip, you're right, it's irrelevant.
#dictionaries with keys might be more straightforward.
GROUPS = list(zip(*(r.split('\t') for r in ROWS)))
pprint.pprint(GROUPS)
for n in range(2,len(GROUPS)+1):
#print('count of items all 1 considering all combinations of {} groups'.format(n))
for Gs in itertools.combinations(GROUPS,n):
same = sum(all('1' == G[i] for G in Gs) for i in range(len(Gs[0]))) # Note that I've used string '1'
print('{} {} {}'.format(n,''.join(G[0] for G in Gs),same))
[code]
Code tags[/code] are essential for python code and Makefiles!
February 12th, 2013, 04:27 AM

wont work in input file
Thank u b49P23TIvg for ur response.. ur code works fine for the above table ,, but when I try to use it in my own data file, it wont work.. my input file is (new.txt)and is in
https://sites.google.com/site/iicbbioinformatics/share..
thank u again for ur consideration..
February 12th, 2013, 11:06 AM

1) open the file in text mode.
2) You're asking for 3e13 lines of output.
Code:
$ /usr/local/j64801/bin/jconsole
2^45
3.51844e13
Still, the task might be possible. Eliminate combinations of groups known to have no matches, and reduce the indices that need to be tested using set intersection of the indexes without corresponding 0 data values. Latter idea sounds like a better algorithm than the rather straightforward idea I've implemented.
Code:
import pprint
import itertools
FILENAME = '/tmp/new.txt'
DATA = open(FILENAME,'rt').read() # open the file in text mode
ROWS = [R.strip() for R in DATA.split('\n')]
GROUPS = list(zip(*(r.split('\t') for r in ROWS)))
for n in range(2,len(GROUPS)+1):
#print('count of items all 1 considering all combinations of {} groups'.format(n))
for Gs in itertools.combinations(GROUPS,n):
same = sum(all('1' == G[i] for G in Gs) for i in range(len(Gs[0]))) # Note that I've used string '1'
print('{} {} {}'.format(n,''.join(G[0] for G in Gs),same))
[code]
Code tags[/code] are essential for python code and Makefiles!
February 12th, 2013, 12:06 PM

Sets are faster.
set intersection runs 5 times faster using combinations length 2 and 3. If you want to consider combinations of up to 4 groups change len(GROUPS)+1 in
for n in range(2,len(GROUPS)+1)
to 4
Code:
import itertools
from functools import reduce
FILENAME = '/tmp/new.txt'
DATA = open(FILENAME,'rt').read() # open the file in text mode
ROWS = [R.strip() for R in DATA.split('\n')]
GROUPS = list(zip(*(r.split('\t') for r in ROWS)))
del GROUPS[0]
if True: # new algorithm 5 times faster
INDEXES = {G[0]:{I for (I,FIELD,) in enumerate(G[1:]) if int(FIELD)} for G in GROUPS}
KEYS = list(INDEXES.keys())
for n in range(2,len(GROUPS)+1):
for Ks in itertools.combinations(KEYS,n):
KEY = '{} {}'.format(n,''.join(Ks))
INDEXES_INTERSECTION = reduce(set.intersection,(INDEXES[K] for K in Ks))
# INDEXES[KEY] = INDEXES_INTERSECTION # store some of these for reuse
print('{} {}'.format(KEY,len(INDEXES_INTERSECTION)))
else: # original SLOW algorithm.
del GROUPS[0]
for n in range(2,len(GROUPS)+1):
for Gs in itertools.combinations(GROUPS,n):
same = sum(all('1' == G[i] for G in Gs) for i in range(len(Gs[0]))) # Note that I've used string '1'
print('{} {} {}'.format(n,''.join(G[0] for G in Gs),same))
[code]
Code tags[/code] are essential for python code and Makefiles!
February 12th, 2013, 12:51 PM

error
Sorry but still cant get it.. it yields following error..
C:\Users\Utpal\Desktop\qw\test>python 1.py >> out.txt
Traceback (most recent call last):
File "1.py", line 6, in <module>
DATA = open(FILENAME,'rt').read() # open the file in text mode
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/new.txt'
February 12th, 2013, 12:57 PM

Look for the FILENAME variable near the top of the code, and use the name of your file.
[code]
Code tags[/code] are essential for python code and Makefiles!
February 12th, 2013, 01:55 PM

thanks
okk.. got it.. it takes 30 min :) ..but thank u very much......