#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2005
    Posts
    55
    Rep Power
    30

    Longest possible match without testing them all?


    I am trying to get the longest match from a regex with numerous wildcards that would match if a character in the string was accidentally omitted or repeated. I got something to "work" but I can't help thinking that there has to be a more efficient way.

    Code:
    $item = "the nation of Ferm wants me to check this information";
    $matcher = "f{0,2}e{0,2}r{0,2}m{0,2}a{0,2}t{0,2}i{0,2}o{0,2}n{0,2}";
    
    # matcher is "fermation" with each letter being,
    # for lack of better word, 'optional' and/or repeated.
    
                    while ($item =~ /($matcher)/gi){ my $i = $1;
                        if ($test < length $i){
                           $test = length $i;
                           $biggest = $i;
                           }
                        }
    right now, $biggest would be "rmation" ...I would actually like it to be "frmation" but if that makes this even less efficient, I would be happy with what it's giving.

    Efficiency is important because $item is generally quite a bit longer than this example, and this snippet is part of a bigger loop. Also, there might be an instance where the longest match occurs early in the string and it seems like a waste to keep testing after finding it.

    So, any improvement to the way the wild cards are constructed, the lengths are tested, or a better way to call the match in the first place, maybe a tag I never learned about, would be greatly appreciated.

    Thanks in advance.
    If you can read this, I'm probably very confused and disorientated.
  2. #2
  3. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,054
    Rep Power
    9398
    1. How can you get "frmation" with that string?
    2. You can't possibly know whether you have the longest match until you find all the matches. Think about it. Computers aren't psychic yet.

    You'll have to get all the matches and scan for the longest one. And you can write a nicer looking expression as
    Code:
    (?=.)f?f?e?e?r?r?m?m?a?a?t?t?i?i?o?o?n?n?
    Note the assertion: so it can't match lots of empty strings.

    $item can be as long as you want - it's the size of $matcher that really matters. (Yes, both matter, but $matcher's matters more... Woah, tongue-twister.)

    Comments on this post

    • Arty Ziff agrees
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2005
    Posts
    55
    Rep Power
    30
    Originally Posted by requinix
    1. How can you get "frmation" with that string?
    Oh, I'm sure it could be done, somehow, but definitely not in one expression or while loop.
    Originally Posted by requinix
    2. You can't possibly know whether you have the longest match until you find all the matches. Think about it. Computers aren't psychic yet.
    Yeah, I know, but I was wondering if there was a flag like g,i,m,s,etc that would just, you know, do it, or maybe some other odd trick involving .*, a bracket, and some duct tape.

    Originally Posted by requinix
    You'll have to get all the matches and scan for the longest one. And you can write a nicer looking expression as
    Code:
    (?=.)f?f?e?e?r?r?m?m?a?a?t?t?i?i?o?o?n?n?
    Note the assertion: so it can't match lots of empty strings.
    Cool, I was getting a lot of those. That should speed things up a bit.

    Originally Posted by requinix
    $item can be as long as you want - it's the size of $matcher that really matters. (Yes, both matter, but $matcher's matters more... Woah, tongue-twister.)
    the limit I've already imposed was 32 characters for the $matcher phrase before wildcards... i think... I'll have to look that up. Thanks!!
    Last edited by flhu; May 29th, 2009 at 08:27 AM.
    If you can read this, I'm probably very confused and disorientated.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2007
    Posts
    765
    Rep Power
    929
    Tried benchmarking this. I wonder if study() would perform better on larger strings?

    Code:
                 Rate  question   studied  original    tricky lookahead
    question   9483/s        --      -26%      -26%      -56%      -72%
    studied   12800/s       35%        --       -0%      -41%      -62%
    original  12800/s       35%       -0%        --      -41%      -62%
    tricky    21692/s      129%       69%       69%        --      -36%
    lookahead 33698/s      255%      163%      163%       55%        --
    Code:
    use Benchmark qw( :all );
    
    $item = qr"the nation of Ferm wants me to check this information";
    $original = qr"f{0,2}e{0,2}r{0,2}m{0,2}a{0,2}t{0,2}i{0,2}o{0,2}n{0,2}";
    $question = qr"(?=.)f?f?e?e?r?r?m?m?a?a?t?t?i?i?o?o?n?n?";
    $lookahead = qr"(?=[fermation])f{0,2}e{0,2}r{0,2}m{0,2}a{0,2}t{0,2}i{0,2}o{0,2}n{0,2}";
    
    
    cmpthese( 40_000, {
        'original' => sub { original( ) },
        'question' => sub { question(  ) },
        'lookahead' => sub { lookahead(  ) },
        'tricky' => sub { tricky() },
        'studied' => sub { studied(  ) },
    }
    );
    
    sub original {
        my $biggest = 0;
        while ($item =~ /($original)/gi)
        {
            my $i = $1;
            if ($test < length $i){
               $test = length $i;
               $biggest = $i;
            }
        }
    }
    
    sub question {
        my $biggest = 0;
        while ($item =~ /($question)/gi)
        {
            my $i = $1;
            if ($test < length $i){
               $test = length $i;
               $biggest = $i;
            }
        }
    }
    
    sub lookahead {
        my $biggest = 0;
        while ($item =~ /($lookahead)/gi)
        {
            my $i = $1;
            if ($test < length $i){
               $test = length $i;
               $biggest = $i;
            }
        }
    }
    
    sub tricky {
        my $biggest = 1;
        while ($item =~ /(?=[fermation]{$biggest})($exp)/gi)
        {
            my $i = $1;
            if ($test < length $i){
               $test = length $i;
               $biggest = $i;
            }
        }
    }
    
    sub studied {
        my $biggest = 1;
        study $item;
        while ($item =~ /($original)/gi)
        {
            my $i = $1;
            if ($test < length $i){
               $test = length $i;
               $biggest = $i;
            }
        }
    }
    sub{*{$::{$_}}{CODE}==$_[0]&& print for(%:: )}->(\&Meh);
  8. #5
  9. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,054
    Rep Power
    9398
    Perl's study() is a way of compiling a regex. Doing so gives a huge advantage when the pattern will be used many times.

    And just so you know (I'm sure you do already) the lookahead in your $lookahead only checks that the next character is in that set. So it doesn't ensure (a) that they all are those characters, (b) that there's only 0, 1, or 2 of each character, and (c) that the characters come in the right order.
    Similar issues with the tricky one.

    Since tricky and lookahead are solutions to a different problem I guess your original expression is the best.

    (Seems the a?a?b?b? style is the worst. Didn't expect that.)
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2005
    Posts
    55
    Rep Power
    30
    Cool. Thanks for the benchmarks. After I posted, I also tried to remove all the words in $item that couldn't possibly match before hand... $item =~ s/\b[^$match ]+\b//gi before placing wildcards... and got pretty much the same results as the lookahead. I also tried study at it appeared there was zero sum gain.

    Thinking about the matches I'm getting from this, if the defect was in the middle of the word, it would be so much less in length than if it was at the beginning or the end. And I really would need to have all the matching characters to make the length variable actually mean anything.

    After I tried a bunch of other solutions I'm leaning towards going back to the one I had before which was to replace the 1st character with a wildcard .{0,2}, test, reset the variable, replace the second character, test, reset, replace, etc... ...Which doesn't get the over-all length of matching characters and looks ugly like a processing hog should, but at least matches one character omitted or repeated or a repeated typo, and for some reason didn't take that much longer.

    Thanks!
    If you can read this, I'm probably very confused and disorientated.

IMN logo majestic logo threadwatch logo seochat tools logo