October 13th, 2008, 02:58 AM
Quoted String regex
I want to match and pick up quoted strings from html text. (but not the ones in the html tags)
( '([^\\']|\\.)*' | "([^\\"]|\\.)*" ) <- does the job of selecting quoted strings, first part for single-quoted and second half for double quoted.
But it also picks up the html tag properties.
eg. <p class="strong"> Here its mostly sunny. But it is raining outside.</p>
<span id="new" class="strong"> What you see now is "Some quoted text". This is 'single-quoted text'.</span>
The regex will match "strong","new" also. which I dont want. Any ideas how to modify the regex?
October 13th, 2008, 04:29 AM
Just strip out the tags beforehand. PHP has a strip_tags function for that exact purpose.
If your language doesn't have something similar replace /<[^>]*>/ with nothing.
October 13th, 2008, 05:37 AM
Also don't forget the HTML character entities for quotes too - &quot; is for double quotes. I don't remember singles off the top of my head.
Last edited by ishnid; October 13th, 2008 at 07:15 AM.
October 13th, 2008, 05:46 AM
& #39; & apos; at times as well (certain browsers don't like this one, though).
October 13th, 2008, 11:13 AM
Thank you all for replies.
@ requinix : My language is Perl. And sorry, I dont get what you are trying to say.
I want to be able to match quoted strings other than the ones in the HTML tags. Even if I strip of the tags, the attribute values will match the regex.
@ishnid and ryon420: Yeah I will keep the html entities in mind.
October 13th, 2008, 11:29 AM
If you strip out the tags, the attribute values won't be there anymore, so they can't possibly match.
Originally Posted by m4st3rm1nd
October 13th, 2008, 12:33 PM
oh that's right. got it. dont know what i was thinking earlier. i aint a morning person. You can tell
October 14th, 2008, 08:23 AM
$str = 'eg. <p class="strong"> Here its mostly sunny. But it is raining outside.</p>
<span id="new" class="strong"> What you see now is "Some quoted text". This is \'single-quoted text\'.</span>';
preg_match_all ( "/(?![^<]+>)(?:\"|')(.+)(?:\"|')/U", $str, $out );
print_r ( $out );
Or if you don't want to match inside </a> tags then it would be..
preg_match_all ( "/(?!(?:[^<]+>|[^>]+\<\/a\>))(?:\"|')(.+)(?:\"|')/U", $str, $out );