#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    42
    Rep Power
    11

    Need Help optimizing regex


    Hello,
    I am writing a sort of glossary script, you know the kind linking words from a list and giving a popup with an explanation.

    the important part is the regex that parses the keywords in the text.
    Example ( just using bold tags to show the parsed output):

    PHP Code:
    $word 'test';
    $textstring 'this is a test <a class="test" href="www.test.com">some test examples</a> and another test.';

    $newword '<b> ' $word '</b>;

    $regex = '
    /\b(?!<.*?)'.trim($word).'(?![^<>]*?>)\b/siU'; 

    preg_replace($regex, $newword, $textstring); 
    works fine, and the output is:

    this is a test <a class="test" href="www.test.com">some test examples</a> and another test.

    Up to here everything is OK.
    --------------

    Now what I am trying todo, is exclude the links completely, not only inside the anchor tag, but also inbetween (the innerHTML of the anchor).
    To explain, I want the output to be:

    this is a test <a class="test" href="www.test.com">some test examples</a> and another test.
    (the text : "some test examples" not being parsed.)

    I have tried nearly everything.
    like
    $regex = '/\b(?!<a.*?)|(?!<.*?)'.trim($word).'(?![^<>]*?>)|(?![^<>]*?</a>)\b/siU';
    but oviously it doesnt work
    Anybody got any ideas to put me in the right direction?

    Help would be very appreciated.

    Luc
    Last edited by Luciano; August 24th, 2012 at 06:35 AM.
  2. #2
  3. /*
    Devshed Novice (500 - 999 posts)

    Join Date
    Mar 2007
    Location
    Sydney, Australia
    Posts
    729
    Rep Power
    620
    I generally try to avoid using regex to parse html as there are so many different variables to consider. Could you use the DOMDocument class to help isolate text nodes?
    */
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    42
    Rep Power
    11
    Hi benno
    thank you for your reply.

    In this case it has to be done in this way as it is integrated in a forum software (smf 2.02) Another solution would require to apply modifications that can not be installed, de installed automatically.
    the txtstring is the message (post).
    As it is a running working forum, i want to change as little as possible.
    I wrote the existing regex and it works fine. only problem is when the text of a link contains an keyword (this only happens when a user post a text bbcode link [ url=http:mylink] keyword [/url] that is why i want to exclude the parsing of the keyword there.)

    That is why i do it this way.
    actually it is pretty fast and 10 to 15 glossary terms per post work fine.

    Luc

    PS: The parsing could be done of course by replacing all links in the message with a coded string, and replace them after the parsing again.. but that would be much much more ressources.
    Last edited by Luciano; August 24th, 2012 at 07:49 AM.
  6. #4
  7. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Location
    spaceBAR Central
    Posts
    229
    Rep Power
    42
    Try this regex:
    Code:
    \b(test)\b(?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)(?![^<>]*>)
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    42
    Rep Power
    11
    WoW!!!
    Thank you a bunch...
    That works exactly as expected!
    (but i must admit I dont understand every part of it.
    why the h in <\/?[ha] for example.)
    I tried it without the h, and it works also

    but on my live board i wouldn't dare remove the h until i know exactly what I am doing.

    Thank you again... for helping so fast in this extraordinary way!

    Luc
  10. #6
  11. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Location
    spaceBAR Central
    Posts
    229
    Rep Power
    42
    Can't remember why I had 'h' in the list must have been something in the my data that I had needed it for at that time when I created the regex, The regex descr is below:
    Code:
    ## Ignore if search item found in HTML tag
    
    \b(test)\b(?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)(?![^<>]*>)
    
    Assert position at a word boundary
    Match the regular expression below and capture its match into backreference number 1
       Match the characters 'test' literally
    Assert position at a word boundary
    Assert that it is impossible to match the regex below starting at this position (negative lookahead): (?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)
       Match the regular expression below: (?:(?!<\/?[ha].*?>).)*
          Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
          Assert that it is impossible to match the regex below starting at this position (negative lookahead): (?!<\/?[ha].*?>)
             Match the character '<' literally
             Match the character '/' literally
                Between zero and one times, as many times as possible, giving back as needed (greedy)
             Match a single character present in the list 'ha'
             Match any single character that is not a line break character
                Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
             Match the character '>' literally
          Match any single character that is not a line break character
       Match the character '<' literally
       Match the character '/' literally
       Match a single character present in the list 'ha'
       Match any single character that is not a line break character
          Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
       Match the character '>' literally
    Assert that it is impossible to match the regex below starting at this position (negative lookahead): (?![^<>]*>)
       Match a single character NOT present in the list '<>'
          Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
       Match the character '>' literally

IMN logo majestic logo threadwatch logo seochat tools logo