#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    19
    Rep Power
    0

    Need help with perl program.


    i have file which contains data:

    chr1 10 12
    chr1 10 15
    chr1 10 16


    Output to be generated:

    chr1 10 12 3
    chr1 12 15 2
    chr1 15 16 1

    that means it calculates how many times the coverage that is 10-12 has appeared(overlap) in the file which is 3(in 10-12,10-15,10,16). sililarly, since 12-15 has appeared(overlap) in the coverage 10-15 and 10-16 so the count displayed should be 2.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    Is your file big and are your ranges large?

    I am asking because the simple solution that comes to my mind consists in listing all integers within the ranges and count their occurrence and, at the end, summarize the results. This is OK is the ranges are small to medium size, but not for very large ranges.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    19
    Rep Power
    0
    Originally Posted by Laurent_R
    Is your file big and are your ranges large?

    I am asking because the simple solution that comes to my mind consists in listing all integers within the ranges and count their occurrence and, at the end, summarize the results. This is OK is the ranges are small to medium size, but not for very large ranges.
    the range is not that big. you can have a look at it.
    for e.g:

    241525932 241526132(range is between 1000-2000)


    I have created a hash with chr as key and the start and end of range as value. i am trying to figure out a way to get the overlap.

    please feel free to ask for more detailed explanation.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    OK, very interesting information, good that I asked. Given the size of your numbers, the first solution I was thinking of is gone, since an array indexed on your numbers is more or less excluded (it would probably blow up your memory). You might not realize, but when you give an example, it has to be somewhat realistic. You gave an example where the numbers are in the 10-20 range, and your actual numbers are in the 100 million range. The initial solution I was thinking of is thus not possible, not because of the range size (very manageable), but because your numbers themselves are very large.

    Having said that, using an array is probably no longer possible with such large numbers (or, at best, very inefficient), but we can still use a hash, something somewhat perhaps slightly less practical in the context, but still quite easily workable in principle.

    Can you provide a realistic example of your data before I come out with another solution that might also not be workable with real data?
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    19
    Rep Power
    0
    Originally Posted by Laurent_R
    OK, very interesting information, good that I asked. Given the size of your numbers, the first solution I was thinking of is gone, since an array indexed on your numbers is more or less excluded (it would probably blow up your memory). You might not realize, but when you give an example, it has to be somewhat realistic. You gave an example where the numbers are in the 10-20 range, and your actual numbers are in the 100 million range. The initial solution I was thinking of is thus not possible, not because of the range size (very manageable), but because your numbers themselves are very large.

    Having said that, using an array is probably no longer possible with such large numbers (or, at best, very inefficient), but we can still use a hash, something somewhat perhaps slightly less practical in the context, but still quite easily workable in principle.

    Can you provide a realistic example of your data before I come out with another solution that might also not be workable with real data?

    here is the link of an example of the kind of data i am dealing with.

    www.genome.ucsc.edu/cgi-bin/hgTables?hgsid=351909771&boolshad.hgta_printCustomTrackHeaders=0&hgta_ctName=tb_knownGene&hgta_ctDes c=table+browser+query+on+knownGene&hgta_ctVis=pack&hgta_ctUrl=&fbQual=upstreamAll&fbUpBases=200&fbEx onBases=0&fbIntronBases=0&fbDownBases=200&hgta_doGetBed=get+BED

    we are concerned with first three columns.

IMN logo majestic logo threadwatch logo seochat tools logo