#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    20
    Rep Power
    0

    Help in matrix file processing and random sampling


    Hi all,
    I am a newbee in python and currently working on a matrix file processing and get stuck and desparately need help.
    I notice that a somewhat similar problem is already been mentioned in the forum (in the post: Help needed in random sampling). Although my matrix (and problem) is bit different, I present it in the form of 1 and 0 (presence/absence) notion, as par as the previous problem in the hope that modification of the previous program might do the trick.
    okk.. now to problem,

    I have an large input matrix file as:

    Group1 Group2 Group3 Group4 .............
    First 1 0 1 1 .............
    Second 0 0 1 1 .............
    Third 1 0 1 0 .............
    Forth 1 1 0 1 .............
    .............
    ........

    I want to randomly recombine each column (with values), for each possible random combinations.. but as terabyte of combinations will form, I want to restrict 500 combinations for each number.

    After combining , I want shared_count, non_shared_count and total_count for each combination (explained below):
    .................................................................................................... .................................................................
    For example, in start, the combination of 2 (here 2 means 2 group combinations) will be:

    Group1-Group2 Group1-Group3 Group1-Group4 Group2-Group3 ..... (upto 500 random com)

    and calculations of shared, non_shared and total_count for each combination is as follows :

    Group1-Group2, Group1-Group3 .... (upto 500 random com)
    1-0 , 1-1 .............
    0-0 , 0-1 .................
    1-0 , 1-1 ........................
    1-1 , 1-0 ..................
    ----------- ---------------
    shared_count= 1, shared_count= 2 ..............

    (shared count means total count of 1-1 in each column)

    non_shared count=2, non_shared count=2.............

    (count of 1-0/0-1 sharing)

    total_count= 3 ,total_count= 4 ..................

    (total_count=shared_count+non_shared count)

    (notice that 0-0 count is rejected)

    thus after combination of each number of groups (as number 2 here), three output files will generate..

    output file 1: (shared_count_2.txt) contain shared_count result of 500 combinations.. eg..

    (shared_count ) Group1-Group2 1
    (shared_count ) Group1-Group3 2
    ......

    output file 2: (non_shared_count_2.txt) contain non_shared_count result of 500 combinations.. eg..

    (non_shared_count ) Group1-Group2 2
    (non_shared_count ) Group1-Group3 2
    ......

    output file 3: (total_count_2.txt) contain total_count result of 500 combinations.. eg..

    (total_count ) Group1-Group2 3
    (total_count ) Group1-Group3 4
    ......

    .................................................................................................... ...................................................................
    with the same input file , the combination of 3 (here 3 means 3 group combinations) will be (here 1-1-1 combinations are considered shared and 1-0-0,0-0-1,0-1-0 etc as non_shared; 0-0-0 count is rejected):

    output file 1: (shared_count_3.txt) contain shared_count result of 500 combinations.. eg..

    (shared_count ) Group1-Group2-Group3 0
    (shared_count ) Group1-Group3-Group4 1
    ......

    output file 2: (non_shared_count_3.txt) contain non_shared_count result of 500 combinations.. eg..
    (non_shared_count ) Group1-Group2-Group3 4
    (non_shared_count ) Group1-Group3-Group4 3
    ......

    output file 3: (total_count_3.txt) contain total_count result of 500 combinations.. eg..

    (total_count ) Group1-Group2-Group3 4
    (total_count ) Group1-Group3-Group4 4
    ......

    .................................................................................................... ...............................................................
    and so on for 4, 5..... combinations....
    .................................................................................................... ...............................................................

    thus for each number combinations, there will three output files.. for (say, upto) 50 number combinations, there will be 50x3=150 output files...

    Any kind of help for solving this problem is highly appreciated.. and thank you for your consideration...
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,711
    Rep Power
    480
    What has randomization got to do with this problem?
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    20
    Rep Power
    0

    right


    Errrr.. my bad...Actually combinations of columns are needed here (upto 500),, and, you are right,, randomization is not the proper term here..
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,711
    Rep Power
    480
    Your "(count of 1-0/0-1 sharing)"

    (1 ^ 0) == (0 ^ 1) == 1

    numpy.logical_xor((1,0,1,0),(1,1,0,0)).sum()

    which in J is the generalized dot product
    (+/ . (2b0110 b.)) /|:#:i.4
    sum DotProduct xor


    Taking columns as row vectors,
    shared_count = numpy.dot(A,B)


    for example
    numpy.dot((1,0,1,0),(1,1,0,0))
    Last edited by b49P23TIvg; February 21st, 2013 at 09:29 AM.
    [code]Code tags[/code] are essential for python code and Makefiles!
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    20
    Rep Power
    0
    ???

IMN logo majestic logo threadwatch logo seochat tools logo