### Thread: Help needed in random sampling

1. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Feb 2013
Posts
23
Rep Power
0

#### Help needed in random sampling

Hi all,

I have an input file (input.txt) in tab separated format.
A small part of the input file is as follows:

Gr1 Gr2 Gr3 Gr4 ..............
row1 1 1 1 0 ..............
row2 0 1 0 1 ...............
row3 1 1 0 1 ..............
...................
.................

From this input,random sampling of all possible Group combinations needed to be taken, and print in a output file.
From the above input, the output file should be as follows
(the three columns are: Number of groups used in analysis, Group names, Number shared between these groups;;;
Shared number are measured if 1 is present in all the measuring groups).

output:

2 Gr1-Gr2 2
2 Gr1-Gr3 1
2 Gr1-Gr4 1
2 Gr2-Gr3 1
2 Gr2-Gr4 2
2 Gr3-Gr4 0
3 Gr1-Gr2-Gr3 1
3 Gr1-Gr2-Gr4 1
3 Gr1-Gr3-Gr4 0
3 Gr2-Gr3-Gr4 0
4 Gr1-Gr2-Gr3-Gr4 0

Any idea how to do it??? thanks in advance..
2. I cannot tell what constitutes a group in your input.
From where came the 2 in your output?

A python programmer can transpose rows of columns using zip . And use itertools.combinations to find all combinations of groups of a specific length. You could use repeated calls for random numbers to reduce the dimensionality until you have a single random sample, or you could combine the lists and use a single random selection. Put all this within a loop of number of groups from 2 to n.
3. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Feb 2013
Posts
23
Rep Power
0
left 2 (first line) represents the two groups (Gr1-Gr2) used in the analysis at the first line.. and the 2 in the right is the number shared between two groups (as Gr1-Gr2 is 1 and 1 in row1 and row3)
4. I think I now understand. Random sampling is irrelevant (at least for this part of the algorithm). Sorry, this code is dense.
Code:
```import pprint
import itertools

data = '''
name	Gr1	Gr2	Gr3	Gr4
row1	1	1	1	0
row2	0	1	0	1
row3	1	1	0	1
'''
ROWS = [r.strip() for r in data.strip().split('\n')]
print('The data:')
pprint.pprint(ROWS)                        # Yes, I could delete row 0
print('\nTransposed') # transposition makes the problem easier for me to visualize, and zip eliminates data irregularities by returning a "rectangular" list of lists.  Now that I've justified zip, you're right, it's irrelevant.
#dictionaries with keys might be more straightforward.
GROUPS = list(zip(*(r.split('\t') for r in ROWS)))
pprint.pprint(GROUPS)
for n in range(2,len(GROUPS)+1):
#print('count of items all 1 considering all combinations of {} groups'.format(n))
for Gs in itertools.combinations(GROUPS,n):
same = sum(all('1' == G[i] for G in Gs) for i in range(len(Gs[0]))) # Note that I've used string '1'
print('{} {} {}'.format(n,'-'.join(G[0] for G in Gs),same))```
5. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Feb 2013
Posts
23
Rep Power
0

#### wont work in input file

Thank u b49P23TIvg for ur response.. ur code works fine for the above table ,, but when I try to use it in my own data file, it wont work.. my input file is (new.txt)and is in

thank u again for ur consideration..
6. 1) open the file in text mode.
2) You're asking for 3e13 lines of output.
Code:
```\$ /usr/local/j64-801/bin/jconsole
2^45
3.51844e13```
Still, the task might be possible. Eliminate combinations of groups known to have no matches, and reduce the indices that need to be tested using set intersection of the indexes without corresponding 0 data values. Latter idea sounds like a better algorithm than the rather straightforward idea I've implemented.
Code:
```import pprint
import itertools

FILENAME = '/tmp/new.txt'

DATA = open(FILENAME,'rt').read()  # open the file in text mode
ROWS = [R.strip() for R in DATA.split('\n')]
GROUPS = list(zip(*(r.split('\t') for r in ROWS)))
for n in range(2,len(GROUPS)+1):
#print('count of items all 1 considering all combinations of {} groups'.format(n))
for Gs in itertools.combinations(GROUPS,n):
same = sum(all('1' == G[i] for G in Gs) for i in range(len(Gs[0]))) # Note that I've used string '1'
print('{} {} {}'.format(n,'-'.join(G[0] for G in Gs),same))```
7. #### Sets are faster.

set intersection runs 5 times faster using combinations length 2 and 3. If you want to consider combinations of up to 4 groups change len(GROUPS)+1 in
for n in range(2,len(GROUPS)+1)
to 4
Code:
```import itertools
from functools import reduce

FILENAME = '/tmp/new.txt'

DATA = open(FILENAME,'rt').read()  # open the file in text mode
ROWS = [R.strip() for R in DATA.split('\n')]
GROUPS = list(zip(*(r.split('\t') for r in ROWS)))
del GROUPS[0]

if True:                                  # new algorithm 5 times faster
INDEXES = {G[0]:{I for (I,FIELD,) in enumerate(G[1:]) if int(FIELD)} for G in GROUPS}
KEYS = list(INDEXES.keys())
for n in range(2,len(GROUPS)+1):
for Ks in itertools.combinations(KEYS,n):
KEY = '{} {}'.format(n,'-'.join(Ks))
INDEXES_INTERSECTION = reduce(set.intersection,(INDEXES[K] for K in Ks))
# INDEXES[KEY] = INDEXES_INTERSECTION  # store some of these for reuse
print('{} {}'.format(KEY,len(INDEXES_INTERSECTION)))
else:                                     # original SLOW algorithm.
del GROUPS[0]
for n in range(2,len(GROUPS)+1):
for Gs in itertools.combinations(GROUPS,n):
same = sum(all('1' == G[i] for G in Gs) for i in range(len(Gs[0]))) # Note that I've used string '1'
print('{} {} {}'.format(n,'-'.join(G[0] for G in Gs),same))```
8. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Feb 2013
Posts
23
Rep Power
0

#### error

Sorry but still cant get it.. it yields following error..

C:\Users\Utpal\Desktop\qw\test>python 1.py >> out.txt
Traceback (most recent call last):
File "1.py", line 6, in <module>
DATA = open(FILENAME,'rt').read() # open the file in text mode
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/new.txt'
9. Look for the FILENAME variable near the top of the code, and use the name of your file.
10. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Feb 2013
Posts
23
Rep Power
0

#### thanks

okk.. got it.. it takes 30 min :-) ..but thank u very much......