Regex Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming Languages - MoreRegex Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old February 28th, 2012, 04:43 AM
inogen inogen is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2008
Posts: 11 inogen User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 5 h 14 m 45 sec
Reputation Power: 0
PHP - Regex to clean up some markup

Hiya,

I've got a bit a problem and am hoping one of you regex wizards can help.

I've got a script that highlights search terms in text with <mark></mark> tags.

The problem is that when the search is for multiple words, we end up with nested mark tags... so...

<mark>search</mark> is fine...

<mark><mark>search</mark> <mark>phrase</mark></mark> isn't...

neither is :

<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>

This is because the script that marks the phrases matches both the phrase and each word.

Is there a simple preg_replace that will match the inner nested tags and not the outermost ones?

I had a good go at it but ended up being able to remove the outer ones and not the inners!

I could do it by generating the literal strings to replace by counting the words, but surely there's a more elegant solution?

Any help much appreciated.

Cheers


P.S.

I know regex and markup shouldn't really be used together, but this markup is simple and 100% always the same pattern as it's generated by script, never by human hand.

Reply With Quote
  #2  
Old February 28th, 2012, 11:37 PM
abareplace abareplace is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2011
Posts: 29 abareplace User rank is First Lieutenant (10000 - 20000 Reputation Level)abareplace User rank is First Lieutenant (10000 - 20000 Reputation Level)abareplace User rank is First Lieutenant (10000 - 20000 Reputation Level)abareplace User rank is First Lieutenant (10000 - 20000 Reputation Level)abareplace User rank is First Lieutenant (10000 - 20000 Reputation Level)abareplace User rank is First Lieutenant (10000 - 20000 Reputation Level)abareplace User rank is First Lieutenant (10000 - 20000 Reputation Level)abareplace User rank is First Lieutenant (10000 - 20000 Reputation Level) 
Time spent in forums: 8 h 25 m 9 sec
Reputation Power: 0
You should use an HTML parser, not a regex. In general, it's not possible to parse nested tags with regular expressions, because what you described is a context-free language, but regular expressions allow only regular languages. There are some tricks such as recursive patterns, but generally, regex is a wrong tool for this job.

The best solution is to modify the original script, so that it will not generate the nested tags. Another idea is to parse the page using DOMDocument::loadHTML or html5lib, recursively look for the nested tags and replace the inner ones.

Reply With Quote
  #3  
Old March 1st, 2012, 06:56 AM
inogen inogen is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2008
Posts: 11 inogen User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 5 h 14 m 45 sec
Reputation Power: 0
I get ya, I know.

The thing is that really, I don't view it as mark up until it's displayed. Up until then it's a just a string. A string that's generated by a recursive script.

I would have thought a regex pattern replace would be a better solution than recursively building literal strings and replacing those. Which is what I'll do if I can't work out a regex for it.

You're right. In an ideal world the script wouldn't produce markup like that, but it does... and does whilst doing a number of other things which it does rather well. The alternative there is to take the part of the script out of there and write another specifically for the markup.

Both of these alternatives I think will end up with the whole thing taking longer to execute.

Reply With Quote
  #4  
Old March 1st, 2012, 04:40 PM
ragax's Avatar
ragax ragax is offline
Turn left at the third duck
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2011
Location: Nelson, NZ
Posts: 93 ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level)ragax User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Day 24 m 37 sec
Reputation Power: 92
To offer a different perspective and potential solution:

Quote:
Is there a simple preg_replace that will match the inner nested tags and not the outermost ones?


I'm a big fan of recursive regex, and, as abareplace pointed out, at first this sounds like a recursive regex problem. The issue is that a recursive regex for nested expressions will return an overall match, but you cannot grab the innermost match as your regex engine won't let you generate capture "variable" groups on the fly.

However, if I have understood the problem, there is a simple solution with lookaheads. Here is a php example that replaces the inner text with its capitalized version. If you use a different language, you should be able to adapt the code as long as the regex flavor supports lookarounds.

Input:

<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>


Code:
Code:
<?php
$string='<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>
';
$regex=',<mark>(((?!<mark)(?!</mark).)*)</mark,';
$string=preg_replace_callback($regex,function($m){return '<mark>'.strtoupper($m[1]).'</mark';},$string);
echo htmlentities($string);
?>


Output:
<mark><mark><mark><mark>THIS</mark> <mark>IS</mark> <mark>THE</mark> <mark>PROBLEM</mark></mark></mark></mark>

Let me know if I've understood the problem and if you have any questions!

__________________
Regex Tutorial | Latest RegexBuddy Demo

Reply With Quote
  #5  
Old April 5th, 2012, 09:05 PM
php6 php6 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 1 php6 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 18 m 26 sec
Reputation Power: 0
Quote:
Originally Posted by ragax

Input:

<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>


Code:
Code:
<?php
$string='<mark><mark><mark><mark>this</mark> <mark>is</mark> <mark>the</mark> <mark>problem</mark></mark></mark></mark>
';
$regex=',<mark>(((?!<mark)(?!</mark).)*)</mark,';
$string=preg_replace_callback($regex,function($m){return '<mark>'.strtoupper($m[1]).'</mark';},$string);
echo htmlentities($string);
?>


Output:
<mark><mark><mark><mark>THIS</mark> <mark>IS</mark> <mark>THE</mark> <mark>PROBLEM</mark></mark></mark></mark>

Let me know if I've understood the problem and if you have any questions!



Thanks, this was helpful. Your script works great -
Live PHP Version

Reply With Quote
  #6  
Old April 5th, 2012, 11:31 PM
Kravvitz's Avatar
Kravvitz Kravvitz is offline
CSS & JS/DOM Adept
Dev Shed God 30th Plane (19500 - 19999 posts)
 
Join Date: Jul 2004
Location: USA
Posts: 19,835 Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level) 
Time spent in forums: 6 Months 1 Day 22 h 11 m
Reputation Power: 4192
Quote:
Originally Posted by php6
Thanks, this was helpful. Your script works great -
Live PHP Version

Welcome to DevShed Forums, php6.

Your link was stripped out, so here it is so people can see it: http://init.me/191605/regex-to-clean-up-some-markup

New users are restricted from posting URLs until they have made 5 posts. You may need to get around this by leaving out the "http://" and putting a space before each ".". Yes this rule is annoying, but the administrators say it's necessary for limiting spam.
__________________
Spreading knowledge, one newbie at a time. I'm available for hire at Dynamic Site Solutions.

Check out my blog. | Learn CSS. | PHP includes | X/HTML Validator | CSS validator | Common CSS Mistakes | Common JS Mistakes

Remember people spend most of their time on other people's sites (so don't violate web design conventions).

Reply With Quote
  #7  
Old April 11th, 2012, 11:37 AM
Darknite Darknite is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Location: London
Posts: 40 Darknite User rank is Sergeant (500 - 2000 Reputation Level)Darknite User rank is Sergeant (500 - 2000 Reputation Level)Darknite User rank is Sergeant (500 - 2000 Reputation Level)Darknite User rank is Sergeant (500 - 2000 Reputation Level)Darknite User rank is Sergeant (500 - 2000 Reputation Level) 
Time spent in forums: 10 h 51 m 51 sec
Reputation Power: 14
Stop, NOW!

Please read the obligatory:

Parsing HTML the cthulhu way

and this beautiful stack overflow post

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreRegex Programming > PHP - Regex to clean up some markup

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap