Regex Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming Languages - MoreRegex Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old May 28th, 2009, 05:25 PM
flhu flhu is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2005
Posts: 55 flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Day 12 m 16 sec
Reputation Power: 29
Longest possible match without testing them all?

I am trying to get the longest match from a regex with numerous wildcards that would match if a character in the string was accidentally omitted or repeated. I got something to "work" but I can't help thinking that there has to be a more efficient way.

Code:
$item = "the nation of Ferm wants me to check this information";
$matcher = "f{0,2}e{0,2}r{0,2}m{0,2}a{0,2}t{0,2}i{0,2}o{0,2}n{0,2}";

# matcher is "fermation" with each letter being,
# for lack of better word, 'optional' and/or repeated.

                while ($item =~ /($matcher)/gi){ my $i = $1;
                    if ($test < length $i){
                       $test = length $i;
                       $biggest = $i;
                       }
                    }


right now, $biggest would be "rmation" ...I would actually like it to be "frmation" but if that makes this even less efficient, I would be happy with what it's giving.

Efficiency is important because $item is generally quite a bit longer than this example, and this snippet is part of a bigger loop. Also, there might be an instance where the longest match occurs early in the string and it seems like a waste to keep testing after finding it.

So, any improvement to the way the wild cards are constructed, the lengths are tested, or a better way to call the match in the first place, maybe a tag I never learned about, would be greatly appreciated.

Thanks in advance.
__________________
If you can read this, I'm probably very confused and disorientated.

Reply With Quote
  #2  
Old May 28th, 2009, 07:38 PM
requinix's Avatar
requinix requinix is offline
Still alive
Dev Shed God 16th Plane (12500 - 12999 posts)
 
Join Date: Mar 2007
Location: Washington, USA
Posts: 12,872 requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)  Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 5 Months 1 Week 5 Days 6 h 51 m 41 sec
Reputation Power: 8977
Send a message via AIM to requinix Send a message via MSN to requinix Send a message via Yahoo to requinix Send a message via Google Talk to requinix
1. How can you get "frmation" with that string?
2. You can't possibly know whether you have the longest match until you find all the matches. Think about it. Computers aren't psychic yet.

You'll have to get all the matches and scan for the longest one. And you can write a nicer looking expression as
Code:
(?=.)f?f?e?e?r?r?m?m?a?a?t?t?i?i?o?o?n?n?

Note the assertion: so it can't match lots of empty strings.

$item can be as long as you want - it's the size of $matcher that really matters. (Yes, both matter, but $matcher's matters more... Woah, tongue-twister.)

Reply With Quote
  #3  
Old May 28th, 2009, 07:57 PM
flhu flhu is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2005
Posts: 55 flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Day 12 m 16 sec
Reputation Power: 29
Quote:
Originally Posted by requinix
1. How can you get "frmation" with that string?

Oh, I'm sure it could be done, somehow, but definitely not in one expression or while loop.
Quote:
Originally Posted by requinix
2. You can't possibly know whether you have the longest match until you find all the matches. Think about it. Computers aren't psychic yet.

Yeah, I know, but I was wondering if there was a flag like g,i,m,s,etc that would just, you know, do it, or maybe some other odd trick involving .*, a bracket, and some duct tape.

Quote:
Originally Posted by requinix
You'll have to get all the matches and scan for the longest one. And you can write a nicer looking expression as
Code:
(?=.)f?f?e?e?r?r?m?m?a?a?t?t?i?i?o?o?n?n?

Note the assertion: so it can't match lots of empty strings.

Cool, I was getting a lot of those. That should speed things up a bit.

Quote:
Originally Posted by requinix
$item can be as long as you want - it's the size of $matcher that really matters. (Yes, both matter, but $matcher's matters more... Woah, tongue-twister.)

the limit I've already imposed was 32 characters for the $matcher phrase before wildcards... i think... I'll have to look that up. Thanks!!

Last edited by flhu : May 29th, 2009 at 08:27 AM.

Reply With Quote
  #4  
Old May 29th, 2009, 09:08 PM
OmegaZero OmegaZero is offline
Contributing User
Dev Shed Novice (500 - 999 posts)
 
Join Date: May 2007
Posts: 738 OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level)OmegaZero User rank is General (90000 - 100000 Reputation Level) 
Time spent in forums: 3 Weeks 5 Days 6 h 3 m
Reputation Power: 928
Tried benchmarking this. I wonder if study() would perform better on larger strings?

Code:
             Rate  question   studied  original    tricky lookahead
question   9483/s        --      -26%      -26%      -56%      -72%
studied   12800/s       35%        --       -0%      -41%      -62%
original  12800/s       35%       -0%        --      -41%      -62%
tricky    21692/s      129%       69%       69%        --      -36%
lookahead 33698/s      255%      163%      163%       55%        --


Code:
use Benchmark qw( :all );

$item = qr"the nation of Ferm wants me to check this information";
$original = qr"f{0,2}e{0,2}r{0,2}m{0,2}a{0,2}t{0,2}i{0,2}o{0,2}n{0,2}";
$question = qr"(?=.)f?f?e?e?r?r?m?m?a?a?t?t?i?i?o?o?n?n?";
$lookahead = qr"(?=[fermation])f{0,2}e{0,2}r{0,2}m{0,2}a{0,2}t{0,2}i{0,2}o{0,2}n{0,2}";


cmpthese( 40_000, {
    'original' => sub { original( ) },
    'question' => sub { question(  ) },
    'lookahead' => sub { lookahead(  ) },
    'tricky' => sub { tricky() },
    'studied' => sub { studied(  ) },
}
);

sub original {
    my $biggest = 0;
    while ($item =~ /($original)/gi)
    {
        my $i = $1;
        if ($test < length $i){
           $test = length $i;
           $biggest = $i;
        }
    }
}

sub question {
    my $biggest = 0;
    while ($item =~ /($question)/gi)
    {
        my $i = $1;
        if ($test < length $i){
           $test = length $i;
           $biggest = $i;
        }
    }
}

sub lookahead {
    my $biggest = 0;
    while ($item =~ /($lookahead)/gi)
    {
        my $i = $1;
        if ($test < length $i){
           $test = length $i;
           $biggest = $i;
        }
    }
}

sub tricky {
    my $biggest = 1;
    while ($item =~ /(?=[fermation]{$biggest})($exp)/gi)
    {
        my $i = $1;
        if ($test < length $i){
           $test = length $i;
           $biggest = $i;
        }
    }
}

sub studied {
    my $biggest = 1;
    study $item;
    while ($item =~ /($original)/gi)
    {
        my $i = $1;
        if ($test < length $i){
           $test = length $i;
           $biggest = $i;
        }
    }
}
__________________
sub{*{$::{$_}}{CODE}==$_[0]&& print for(%:: )}->(\&Meh);

Reply With Quote
  #5  
Old May 30th, 2009, 01:38 AM
requinix's Avatar
requinix requinix is offline
Still alive
Dev Shed God 16th Plane (12500 - 12999 posts)
 
Join Date: Mar 2007
Location: Washington, USA
Posts: 12,872 requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)  Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 5 Months 1 Week 5 Days 6 h 51 m 41 sec
Reputation Power: 8977
Send a message via AIM to requinix Send a message via MSN to requinix Send a message via Yahoo to requinix Send a message via Google Talk to requinix
Perl's study() is a way of compiling a regex. Doing so gives a huge advantage when the pattern will be used many times.

And just so you know (I'm sure you do already) the lookahead in your $lookahead only checks that the next character is in that set. So it doesn't ensure (a) that they all are those characters, (b) that there's only 0, 1, or 2 of each character, and (c) that the characters come in the right order.
Similar issues with the tricky one.

Since tricky and lookahead are solutions to a different problem I guess your original expression is the best.

(Seems the a?a?b?b? style is the worst. Didn't expect that.)

Reply With Quote
  #6  
Old June 1st, 2009, 08:58 AM
flhu flhu is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2005
Posts: 55 flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level)flhu User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Day 12 m 16 sec
Reputation Power: 29
Cool. Thanks for the benchmarks. After I posted, I also tried to remove all the words in $item that couldn't possibly match before hand... $item =~ s/\b[^$match ]+\b//gi before placing wildcards... and got pretty much the same results as the lookahead. I also tried study at it appeared there was zero sum gain.

Thinking about the matches I'm getting from this, if the defect was in the middle of the word, it would be so much less in length than if it was at the beginning or the end. And I really would need to have all the matching characters to make the length variable actually mean anything.

After I tried a bunch of other solutions I'm leaning towards going back to the one I had before which was to replace the 1st character with a wildcard .{0,2}, test, reset the variable, replace the second character, test, reset, replace, etc... ...Which doesn't get the over-all length of matching characters and looks ugly like a processing hog should, but at least matches one character omitted or repeated or a repeated typo, and for some reason didn't take that much longer.

Thanks!

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreRegex Programming > Longest possible match without testing them all?

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap