The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages - More
> Regex Programming
|
Stripping Only Certain HTML Tags (and contents)
Discuss Stripping Only Certain HTML Tags (and contents) in the Regex Programming forum on Dev Shed. Stripping Only Certain HTML Tags (and contents) Regular expressions forum covering PCRE and POSIX techniques, practices, and standards. Regular expressions help shorten coding time by providing the ability to compact many lines of code into one string.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

November 22nd, 2008, 07:36 PM
|
|
Swimming in a fish bowl....
|
|
Join Date: Jun 2008
Location: Texas, Y'all!
|
|
|
Stripping Only Certain HTML Tags (and contents)
Ok, Ive been trying every way I can think of to do this and nothing is working right with PHP.
I basically want to strip a set of HTML tags from a string while removing the content between those tags and compensating for case and spaces in the tags (such as < img src...>)
strip_tags removes everything except for a whitelist. This is the opposite of what I need. I only want to strip a certain set of tags:
a, img, script, meta, etc
Things like this dont work (obviously)
preg_replace('@<\s*(a|img|script|meta)\b.*?>.*?</\1>@si', '', $htmlstring);
Any help? Please!
|

November 22nd, 2008, 07:52 PM
|
 |
Contributing User
|
|
Join Date: Jul 2001
Location: England
Posts: 967
 
Time spent in forums: 20 h 32 m 5 sec
Reputation Power: 12
|
|
PHP Code:
$htmlstring = 'Hi there, < a href="http://www.google.com">What</a> do you want? < img src="someimage.jpg" alt="whatever" />
<script type="text/javascript">Whatever</script>
<meta name="keywords" />';
$htmlstring = preg_replace('!<\s*(a|img|script|meta).*?>((.*?)</\1>)?!is', '\3', $htmlstring);
echo $htmlstring;
Something like that?
|

November 22nd, 2008, 08:07 PM
|
|
Swimming in a fish bowl....
|
|
Join Date: Jun 2008
Location: Texas, Y'all!
|
|
Quote: | Originally Posted by liljim
PHP Code:
$htmlstring = 'Hi there, < a href="http://www.google.com">What</a> do you want? < img src="someimage.jpg" alt="whatever" />
<script type="text/javascript">Whatever</script>
<meta name="keywords" />';
$htmlstring = preg_replace('!<\s*(a|img|script|meta).*?>((.*?)</\1>)?!is', '\3', $htmlstring);
echo $htmlstring;
Something like that? |
Wow! Thanks for the response. That's much closer than I have gotten.
Two issues I see, though.
It's not removing the content between the tags. For example, the anchor text remains when the A tag is stripped.
It doesnt seem to be working on IMG tags - maybe because they dont have closing tags? I can remove the IMGs separately if that is the reason.
|

November 22nd, 2008, 08:52 PM
|
 |
Contributing User
|
|
Join Date: Jul 2001
Location: England
Posts: 967
 
Time spent in forums: 20 h 32 m 5 sec
Reputation Power: 12
|
|
|
Content between the tags... Not sure what you mean there.... As in something like below would get totally stripped out?
<badstuff>You want this removed?</badstuff>
|

November 22nd, 2008, 08:54 PM
|
|
Swimming in a fish bowl....
|
|
Join Date: Jun 2008
Location: Texas, Y'all!
|
|
Quote: | Originally Posted by liljim Content between the tags... Not sure what you mean there.... As in something like below would get totally stripped out?
<badstuff>You want this removed?</badstuff> |
Yes! Exactly like that. 
|

November 22nd, 2008, 09:14 PM
|
 |
Contributing User
|
|
Join Date: Jul 2001
Location: England
Posts: 967
 
Time spent in forums: 20 h 32 m 5 sec
Reputation Power: 12
|
|
|
Just remove the \3 in preg_replace, so you're left with single quotes.
|

November 22nd, 2008, 10:06 PM
|
|
Swimming in a fish bowl....
|
|
Join Date: Jun 2008
Location: Texas, Y'all!
|
|
Quote: | Originally Posted by liljim Just remove the \3 in preg_replace, so you're left with single quotes. |
Thanks! That part I initially figured out, but there's something funky going on when I apply it to a large chunk of HTML.
When I pass a simple "<a href='blah.php'>anchor text</a>" it works fine. But when I pass in an entire page of code, it leaves the anchor text behind. How odd is that? I'll look into it further.
One thing I forgot to mention..how would I make this case-insenstive since there is no pregi_replace?
Thanks much for you help!
|

November 22nd, 2008, 10:23 PM
|
 |
Contributing User
|
|
Join Date: Jul 2001
Location: England
Posts: 967
 
Time spent in forums: 20 h 32 m 5 sec
Reputation Power: 12
|
|
It's already case-insensitive - the 'i' modifier, which is at the end of the expression in the first argument to preg_replace() takes care of that.
Please post the 'code' you're having problems with, since otherwise, it's like peeing in the dark.
Goodnight. 
|

November 22nd, 2008, 11:05 PM
|
|
Swimming in a fish bowl....
|
|
Join Date: Jun 2008
Location: Texas, Y'all!
|
|
Quote: | Originally Posted by liljim It's already case-insensitive - the 'i' modifier, which is at the end of the expression in the first argument to preg_replace() takes care of that.
Please post the 'code' you're having problems with, since otherwise, it's like peeing in the dark.
Goodnight.  |
Sorry, I was busy wipe'n up the floor in the bathroom...hehe
All I'm doing to test is pasting in the source from this page:
http://developer.yahoo.com/yui/calendar/
|

November 23rd, 2008, 08:39 PM
|
|
Swimming in a fish bowl....
|
|
Join Date: Jun 2008
Location: Texas, Y'all!
|
|
|
Figured out the problem.
I was using htmlspecialchars_decode instead of html_entity_decode
So, I wasn't decoding all the encoded chars after the post/get. Duh......
Thanks liljim!
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|