#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2008
    Posts
    11
    Rep Power
    0

    Regex to clean up some markup


    Hiya,

    I've got a bit a problem and am hoping one of you regex wizards can help.

    I've got a script that highlights search terms in text with <mark></mark> tags.

    The problem is that when the search is for multiple words, we end up with nested mark tags... so...

    <mark>search</mark> is fine...

    <mark><mark>search</mark> <mark>phrase</mark></mark> isn't...

    neither is :

    <mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>

    This is because the script that marks the phrases matches both the phrase and each word.

    Is there a simple preg_replace that will match the inner nested tags and not the outermost ones?

    I had a good go at it but ended up being able to remove the outer ones and not the inners!

    I could do it by generating the literal strings to replace by counting the words, but surely there's a more elegant solution?

    Any help much appreciated.

    Cheers


    P.S.

    I know regex and markup shouldn't really be used together, but this markup is simple and 100% always the same pattern as it's generated by script, never by human hand.
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Posts
    29
    Rep Power
    0
    You should use an HTML parser, not a regex. In general, it's not possible to parse nested tags with regular expressions, because what you described is a context-free language, but regular expressions allow only regular languages. There are some tricks such as recursive patterns, but generally, regex is a wrong tool for this job.

    The best solution is to modify the original script, so that it will not generate the nested tags. Another idea is to parse the page using DOMDocument::loadHTML or html5lib, recursively look for the nested tags and replace the inner ones.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2008
    Posts
    11
    Rep Power
    0
    I get ya, I know.

    The thing is that really, I don't view it as mark up until it's displayed. Up until then it's a just a string. A string that's generated by a recursive script.

    I would have thought a regex pattern replace would be a better solution than recursively building literal strings and replacing those. Which is what I'll do if I can't work out a regex for it.

    You're right. In an ideal world the script wouldn't produce markup like that, but it does... and does whilst doing a number of other things which it does rather well. The alternative there is to take the part of the script out of there and write another specifically for the markup.

    Both of these alternatives I think will end up with the whole thing taking longer to execute.
  6. #4
  7. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    To offer a different perspective and potential solution:

    Is there a simple preg_replace that will match the inner nested tags and not the outermost ones?
    I'm a big fan of recursive regex, and, as abareplace pointed out, at first this sounds like a recursive regex problem. The issue is that a recursive regex for nested expressions will return an overall match, but you cannot grab the innermost match as your regex engine won't let you generate capture "variable" groups on the fly.

    However, if I have understood the problem, there is a simple solution with lookaheads. Here is a php example that replaces the inner text with its capitalized version. If you use a different language, you should be able to adapt the code as long as the regex flavor supports lookarounds.

    Input:

    <mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>


    Code:
    Code:
    <?php
    $string='<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>
    ';
    $regex=',<mark>(((?!<mark)(?!</mark).)*)</mark,';
    $string=preg_replace_callback($regex,function($m){return '<mark>'.strtoupper($m[1]).'</mark';},$string);
    echo htmlentities($string);
    ?>
    Output:
    <mark><mark><mark><mark>THIS</mark> <mark>IS</mark> <mark>THE</mark> <mark>PROBLEM</mark></mark></mark></mark>

    Let me know if I've understood the problem and if you have any questions!

  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    1
    Rep Power
    0
    Originally Posted by ragax

    Input:

    <mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>


    Code:
    Code:
    <?php
    $string='<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>
    ';
    $regex=',<mark>(((?!<mark)(?!</mark).)*)</mark,';
    $string=preg_replace_callback($regex,function($m){return '<mark>'.strtoupper($m[1]).'</mark';},$string);
    echo htmlentities($string);
    ?>
    Output:
    <mark><mark><mark><mark>THIS</mark> <mark>IS</mark> <mark>THE</mark> <mark>PROBLEM</mark></mark></mark></mark>

    Let me know if I've understood the problem and if you have any questions!

    Thanks, this was helpful. Your script works great -
    Live PHP Version
  10. #6
  11. CSS & JS/DOM Adept
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jul 2004
    Location
    USA (verifiably)
    Posts
    20,127
    Rep Power
    4304
    Originally Posted by php6
    Thanks, this was helpful. Your script works great -
    Live PHP Version
    Welcome to DevShed Forums, php6.

    Your link was stripped out, so here it is so people can see it: http://init.me/191605/regex-to-clean-up-some-markup

    New users are restricted from posting URLs until they have made 5 posts. You may need to get around this by leaving out the "http://" and putting a space before each ".". Yes this rule is annoying, but the administrators say it's necessary for limiting spam.

    Comments on this post

    • requinix agrees : lol "administrators say"
    Spreading knowledge, one newbie at a time.

    Check out my blog. | Learn CSS. | PHP includes | X/HTML Validator | CSS validator | Common CSS Mistakes | Common JS Mistakes

    Remember people spend most of their time on other people's sites (so don't violate web design conventions).
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Location
    London
    Posts
    40
    Rep Power
    15
    Stop, NOW!

    Please read the obligatory:

    Parsing HTML the cthulhu way

    and this beautiful stack overflow post

    Comments on this post

    • Kravvitz agrees

IMN logo majestic logo threadwatch logo seochat tools logo