The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages - More
> Regex Programming
|
PHP - Regex to clean up some markup
Discuss Regex to clean up some markup in the Regex Programming forum on Dev Shed. Regex to clean up some markup Regular expressions forum covering PCRE and POSIX techniques, practices, and standards. Regular expressions help shorten coding time by providing the ability to compact many lines of code into one string.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

February 28th, 2012, 04:43 AM
|
|
Registered User
|
|
Join Date: Dec 2008
Posts: 11
Time spent in forums: 5 h 14 m 45 sec
Reputation Power: 0
|
|
|
PHP - Regex to clean up some markup
Hiya,
I've got a bit a problem and am hoping one of you regex wizards can help.
I've got a script that highlights search terms in text with <mark></mark> tags.
The problem is that when the search is for multiple words, we end up with nested mark tags... so...
<mark>search</mark> is fine...
<mark><mark>search</mark> <mark>phrase</mark></mark> isn't...
neither is :
<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>
This is because the script that marks the phrases matches both the phrase and each word.
Is there a simple preg_replace that will match the inner nested tags and not the outermost ones?
I had a good go at it but ended up being able to remove the outer ones and not the inners!
I could do it by generating the literal strings to replace by counting the words, but surely there's a more elegant solution?
Any help much appreciated.
Cheers
P.S.
I know regex and markup shouldn't really be used together, but this markup is simple and 100% always the same pattern as it's generated by script, never by human hand.
|

February 28th, 2012, 11:37 PM
|
|
|
You should use an HTML parser, not a regex. In general, it's not possible to parse nested tags with regular expressions, because what you described is a context-free language, but regular expressions allow only regular languages. There are some tricks such as recursive patterns, but generally, regex is a wrong tool for this job.
The best solution is to modify the original script, so that it will not generate the nested tags. Another idea is to parse the page using DOMDocument::loadHTML or html5lib, recursively look for the nested tags and replace the inner ones.
|

March 1st, 2012, 06:56 AM
|
|
Registered User
|
|
Join Date: Dec 2008
Posts: 11
Time spent in forums: 5 h 14 m 45 sec
Reputation Power: 0
|
|
|
I get ya, I know.
The thing is that really, I don't view it as mark up until it's displayed. Up until then it's a just a string. A string that's generated by a recursive script.
I would have thought a regex pattern replace would be a better solution than recursively building literal strings and replacing those. Which is what I'll do if I can't work out a regex for it.
You're right. In an ideal world the script wouldn't produce markup like that, but it does... and does whilst doing a number of other things which it does rather well. The alternative there is to take the part of the script out of there and write another specifically for the markup.
Both of these alternatives I think will end up with the whole thing taking longer to execute.
|

March 1st, 2012, 04:40 PM
|
 |
Turn left at the third duck
|
|
Join Date: Dec 2011
Location: Nelson, NZ
|
|
To offer a different perspective and potential solution:
Quote: | Is there a simple preg_replace that will match the inner nested tags and not the outermost ones? |
I'm a big fan of recursive regex, and, as abareplace pointed out, at first this sounds like a recursive regex problem. The issue is that a recursive regex for nested expressions will return an overall match, but you cannot grab the innermost match as your regex engine won't let you generate capture "variable" groups on the fly.
However, if I have understood the problem, there is a simple solution with lookaheads. Here is a php example that replaces the inner text with its capitalized version. If you use a different language, you should be able to adapt the code as long as the regex flavor supports lookarounds.
Input:
<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>
Code:
Code:
<?php
$string='<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>
';
$regex=',<mark>(((?!<mark)(?!</mark).)*)</mark,';
$string=preg_replace_callback($regex,function($m){return '<mark>'.strtoupper($m[1]).'</mark';},$string);
echo htmlentities($string);
?>
Output:
<mark><mark><mark><mark>THIS</mark> <mark>IS</mark> <mark>THE</mark> <mark>PROBLEM</mark></mark></mark></mark>
Let me know if I've understood the problem and if you have any questions!

|

April 5th, 2012, 09:05 PM
|
|
Registered User
|
|
Join Date: Apr 2012
Posts: 1
Time spent in forums: 18 m 26 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by ragax
Input:
<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>
Code:
Code:
<?php
$string='<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>
';
$regex=',<mark>(((?!<mark)(?!</mark).)*)</mark,';
$string=preg_replace_callback($regex,function($m){return '<mark>'.strtoupper($m[1]).'</mark';},$string);
echo htmlentities($string);
?>
Output:
<mark><mark><mark><mark>THIS</mark> <mark>IS</mark> <mark>THE</mark> <mark>PROBLEM</mark></mark></mark></mark>
Let me know if I've understood the problem and if you have any questions!
 |
Thanks, this was helpful. Your script works great -
Live PHP Version
|

April 5th, 2012, 11:31 PM
|
 |
CSS & JS/DOM Adept
|
|
Join Date: Jul 2004
Location: USA
|
|
|

April 11th, 2012, 11:37 AM
|
|
Contributing User
|
|
Join Date: Apr 2012
Location: London
|
|
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|