#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2006
    Posts
    19
    Rep Power
    0

    Value from one file, range from another - determine if value is in range


    Hi there,

    Ok so I have 2 files, one contains a list of values in the form:

    1 23
    1 567
    1 345670
    2 45
    etc...

    The second contains ranges in the the following form:

    1 12 35
    1 435 675
    2 23 67
    etc...


    If the value from file1 falls within the range from file2 AND the number in the first column matches in both files (i.e. 23 falls within range 12-35 and the 1 matches), then this is noted.

    So I've written code that works using nested foreach loops but it's not very quick. My smaller set of data has approx 200k values in file1 and 100k ranges in file2. The next data set will contain approximately 17 million values in file1. That will take days to run using my current code.

    I was just wondering if anyone had some ideas on how to speed it?

    Thanks in advance!
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,947
    Rep Power
    1225
    Please post your code.

    My assumption is that you're storing both files in separate arrays and you're looping over over one or both of those arrays multiple time.

    Using a hash instead of arrays will speed things up.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    10
    Rep Power
    0
    Originally Posted by jim_lad
    Hi there,

    Ok so I have 2 files, one contains a list of values in the form:

    1 23
    1 567
    1 345670
    2 45
    etc...

    The second contains ranges in the the following form:

    1 12 35
    1 435 675
    2 23 67
    etc...


    If the value from file1 falls within the range from file2 AND the number in the first column matches in both files (i.e. 23 falls within range 12-35 and the 1 matches), then this is noted.

    Here is a solution using Set::IntSpan which is optimized for ranges of numbers. (I used Inline::Files also but that was just for this presentation - to create the 2 sample files you described).
    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    use Set::IntSpan;
    use Inline::Files;
    
    my %span;
    
    while (<DATA1>) {
    	my ($id, $lo, $hi) = split;
    	if (exists $span{$id}) {
    		$span{$id} += "$lo-$hi";
    	}
    	else {
    		$span{$id} = Set::IntSpan->new("$lo-$hi");	
    	}
    }
    
    while (<DATA2>) {
    	my ($id, $val) = split;
    	if ($span{$id}->member($val)) {
    		print "$val contained in set $id\n";	
    	}
    	else {
    		print "$val not contained in set $id\n";		
    	}
    }
    
    __DATA1__
    1 12 35
    1 435 675
    2 23 67
    __DATA2__
    1 23
    1 567
    1 345670
    2 45
    The output produced was:
    Code:
    23 contained in set 1
    567 contained in set 1
    345670 not contained in set 1
    45 contained in set 2
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    How large can your ranges be?
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2006
    Posts
    19
    Rep Power
    0
    Originally Posted by FishMonger
    Please post your code.

    My assumption is that you're storing both files in separate arrays and you're looping over over one or both of those arrays multiple time.

    Using a hash instead of arrays will speed things up.
    Firstly, sorry for the delay in responding - been busy with other things.

    Your assumption is correct regarding my current code, I didn't post it directly because I have simplified certain aspects of the problem to make it more easily explainable. Looking at my post again though, I agree that I should have posted some pseudo code at least. So here goes...

    Read file1 and file2 into arrays.

    Code:
    foreach(@values){
    #get first column number
    #get second column number
    
    foreach(@ranges){ #get first column number #get start of range #get end of range
    if(first column number from values = first column value from ranges AND second column number from values is between ranges numbers){ #Print out something }
    else{ print out something different }
    }
    }

    So anyway, I thought about using a hash, but I couldn't decide what key to use that would actually make it quicker. I'm pretty rubbish at using hashes though to be fair and would appreciate any tips.
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2006
    Posts
    19
    Rep Power
    0
    Hi Chris,

    That's a nice module, but it doesn't actually solve the problem I've put forward. But I'll take a look at it and see if it can be adapted - thanks.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,947
    Rep Power
    1225
    Originally Posted by jim_lad
    Hi Chris,

    That's a nice module, but it doesn't actually solve the problem I've put forward. But I'll take a look at it and see if it can be adapted - thanks.
    How does Chris's solution fail to solve the problem? As far as I can see it does exactly what you said you needed. The Set::IntSpan module is new to me and seems to be a good solution.

    Here's another approach without that module (but I'll borrow the use of Inline::Files).

    Code:
    #!/usr/bin/perl
    
    use strict;
    use warnings;
    use Inline::Files;
    
    my %range;
    
    while (<DATA1>) {
        chomp;
        my ($key, @range) = split;
        push @{ $range{$key} }, \@range;
    }
    
    LINE: while (<DATA2>) {
        chomp;
        my ($key, $value) = split;
        next LINE unless exists $range{$key};
        
        foreach my $range (@{ $range{$key} }) {
            if ($value >= $range->[0] and $value <= $range->[1]) {
                print "$value is in set $key between range $range->[0] .. $range->[1]\n";
                next LINE;
            }
        }
    }
    
    
    __DATA1__
    1 12 35
    1 435 675
    2 23 67
    __DATA2__
    1 23
    1 567
    1 345670
    2 45
    Output of jim_lad.pl
    23 is in set 1 between range 12 .. 35
    567 is in set 1 between range 435 .. 675
    45 is in set 2 between range 23 .. 67
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2006
    Posts
    19
    Rep Power
    0
    Hi there,

    Apologies, I just scan read Chris' solution - it does work, it was just the phrasing of the print out statement led me to think it was comparing something else.

    Thanks for the hash version, it's similar to what I'd been trying except I'd read in the file1 to the first hash, your way round makes much more sense - thanks!

IMN logo majestic logo threadwatch logo seochat tools logo