Thread: Regex challenge

    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2010
    Posts
    4
    Rep Power
    0

    Question Regex challenge


    Here is the task:

    Remove <prefix> and <suffix> tags when found embedded in an HTML tag:

    EXAMPLE TEXT:
    Code:
    <p>I am a highly motivated and <prefix>sparkly<suffix> programmer with <span class=”<prefix>sparkly<suffix>”>A LOT</span> of <prefix>sparkly<suffix> experience</p>
    The regex should strip out the prefix and suffix ONLY when they are found inside unclosed HTML tags, so in this case, only the ones inside the span properties. So the regex should convert the above to:

    Code:
    <p>I am a highly motivated and <prefix>sparkly<suffix> programmer with <span class=”sparkly>A LOT</span> of <prefix>sparkly<suffix> experience</p>
    The regex should strip out the prefix and suffix ONLY when they are found inside unclosed HTML tags, so in this case, only the ones inside the span properties.

    <prefix> and <suffix> strings should be configurable. i.e. in ColdFusion

    <cfparam name="attributes.prefix" default="<span class=""highlight"">">
    <cfparam name="attributes.suffix" default="</span>">

    Could anyone help me with this? Any help would be appreciated.

    Thanks

    Martin
  2. #2
  3. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,997
    Rep Power
    9397
    Odds are it'll be easier to just fix the way those were added to the string in the first place. How does that happen? Regular expression or simple find/replace?

    Also: what programming language are you using?
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2010
    Posts
    4
    Rep Power
    0
    I think this can get added by users of our site (when adding resumes) and it is not easily controllable. The task at hand is to only allow the selected <prefix> / <suffix> if it exists within normal text and not within a opening html tag properties etc like the <span> example. The programming language I'm using is ColdFusion which I believe uses a perl compatible regex engine.

    I know how to find all instances of the <prefix>/<suffix> but not how to only include ones within the html tags and leave ones within the plain text. This is where I am stuck.

    The regex for finding all occurences is (<prefix>|<suffix>) but my task is to narrow this down to only ones which occur within html tags.

    Any idea how I could achieve this?

    Thanks
  6. #4
  7. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,997
    Rep Power
    9397
    Wait. You're saying people add them in themselves? Manually? It's just that it looks a lot like a search-and-replace did it. Something not smart enough to ignore HTML tags.

    Initial version. Fairly dumb.
    Code:
    <prefix>.*?</suffix>(?=[^<>]*>)
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2010
    Posts
    4
    Rep Power
    0
    After talks with the actual developer who developed this it appears what he is trying to do is highlight anything which matches words within the search terms.

    I'm not sure if you are familar with ColdFusion or not but I've attached the highlight function code as to what he is trying to do.

    Thanks

    Martin

    Code:
    <cffunction name="highlighter" access="public" returntype="string" output="true" hint="TODO: needs refactored into smaller methods . Lifted from the udf stringFunctions - Returns string with words wrapped in html for highlighting">
    	<cfargument name="stringToCheck" type="string" required="yes" />
    	<cfargument name="wordsToHighlight" type="string" required="yes" />
    	<cfargument name="isStringHtml" type="boolean" required="false" default="0" />
    	<cfargument name="delimiterToUse" type="string" required="false" default="#variables.delimiterDefault#" />
    	<cfargument name="prefixHtml" type="string" required="false" default="#variables.prefixDefault#" />
    	<cfargument name="suffixHtml" type="string" required="false" default="#variables.suffixDefault#" />
    	<cfargument name="searchType" type="string" required="false" default="allSearch" />
    
    	<cfscript>
    		var words = arguments.wordsToHighlight;
    		var theString = arguments.stringToCheck;
    		var isHTML = arguments.isStringHtml;
    		var delimiter = arguments.delimiterToUse;
    		var prefix = arguments.prefixHtml;
    		var suffix = arguments.suffixHtml;
    		var originalSearchWords = ''; //this is to cater for the boolean search on wildcard, does not apply to the exact match
    		var sResetPhrase = '';
    		var sSearchType = arguments.searchType;
    		var ihigh = 1;
    		var xHigh = 1;
    		var inCount = 1;
    		var posCheck = 0;
    		var thisPhrase = "";
    
    		if (NOT Len(trim(delimiter))) {
    			delimiter = variables.delimiterDefault;
    		}
    
    		if (NOT Len(trim(prefix))) {
    			prefix = variables.prefixDefault;
    		}
    
    		if (NOT Len(trim(suffix))) {
    			suffix = variables.suffixDefault;
    		}
    
    		if (words neq '') {
    			//Remove special characters
    			words=ReplaceNoCase(words,'%20',' ','all');
    			words=ReplaceNoCase(words,'/','¬¬','all');
    			//words=ReplaceNoCase(words,'-','¬¬','all'); //<!--- BUG-20489 - Haran, Mar 2009 - replacing dashes with spaces screws highlighting up when a dash is passed --->
    
    			theString = ReReplace(theString, "[[:cntrl:]]", " ", "ALL");	// BUG-20302 - Haran, Jan 2009 - replace with space rather than empty string to preserve word boundaries
    			theString = ReReplace(theString, "[\s]+", " ", "ALL");
    
    			theString = ReplaceNoCase(theString,'span lang=IT','span','all'); // remove this as it messes IT searches
    			theString = ReplaceNoCase(theString,'spanlang=IT','span','all'); // remove this as it messes IT searches
    			theString = ReplaceNoCase(theString,'spanlang=FR','span','all'); // remove this as it messes Fr searches
    			theString = ReplaceNoCase(theString,'span lang=FR','span','all'); // remove this as it messes Fr searches
    			theString = ReplaceNoCase(theString,'spanlang=EN','span','all'); // remove this as it messes En searches
    			theString = ReplaceNoCase(theString,'span lang=EN','span','all'); // remove this as it messes En searches
    			theString = ReplaceNoCase(theString,'spanlang=DE','span','all'); // remove this as it messes De searches
    			theString = ReplaceNoCase(theString,'span lang=DE','span','all'); // remove this as it messes De searches
    
    			//Remove these words IF its not an exact search
    			if (sSearchType neq "exactSearch") {
    
    				for (xHigh=1;xHigh LTE ListLen(variables.noHighlightList);xHigh=xHigh+1) {
    					for (inCount=1; inCount LT 50; inCount=inCount+1) {
    						posCheck = listFindNoCase(words,listGetAt(variables.noHighlightList,xHigh),delimiter);
    						if (posCheck NEQ 0) words = listDeleteAt(words,posCheck,delimiter);
    						else break;
    					}
    				}
    			}
    
    			// Add asterisk to the end of words for boolean search (if there is no wild card suffix)	// BUG-20489 - Haran, Mar 2009, superceding: BUG-20349 - Haran, Feb 2009
    			if (sSearchType eq "booleanSearch") {
    				words = doAddAsteriskToEndOfWordsForBooleanSearchIfThereIsNoWildcardSuffix(words);
    			}
    
    			//this is the original search keyword for booleanSearch below
    			originalSearchWords = words;
    
    			//replace all non alphanumeric characters with a space (note: doesn't replace the default delimiter(¬), space( ), c++ (\+) and c# (\##))
    			words = doReplaceNonAlphanumericCharactersWithASpace(words);
    
    			// find any remaining instances of '&' and duplicate the keyword with the & encoded to &amp;
    			for (ihigh=1;ihigh LTE ListLen(words, delimiter);ihigh=ihigh+1) {
    				thisPhrase = ListGetAt(words, ihigh, delimiter);
    				if (find("&",thisPhrase) neq 0 and find("&", thisPhrase) neq findNoCase("&amp;", thisPhrase)) {
    					words = words & delimiter & replace(thisPhrase,'&','&amp;');
    				}
    			}
    
    			words = doSortByLength(wordlistToCheck=words, delimiterToUse=delimiter);	// BUG-20349 - Haran, Feb 2009 - put keywords in order of length (longest first) so that embedded highlighting doesn't break (when one keyword/phrase is also part of a longer keyword/phrase, eg: xxx and "yyy xxx")
    
    			//Surround the keywords found with a highlight span
    			for (ihigh=1;ihigh LTE ListLen(words, delimiter);ihigh=ihigh+1) {
    
    				thisPhrase = ListGetAt(words, ihigh, delimiter);
    
    				if(find(" ", thisPhrase))	{
    					theString = reReplaceNoCase(theString, "(#replace(trim(thisPhrase), " ", "[^a-zA-Z0-9]*", "ALL")#)", "#prefix#\1#suffix#", "ALL");	// BUG-20349 - Haran, Feb 2009 - trim() added here because the highlighted area was stretching to the space before quoted phrase
    				}
    				else if (findnocase("C\+", thisPhrase) OR findnocase("C\##", thisPhrase)) {
    					theString = reReplaceNoCase(theString, "([^<"":]*?)(#thisPhrase#)([^:""=>]*?)", "\1#prefix#\2#suffix#\3", "ALL"); //Remove the word boundaries for c# and c++
    				}
    				else if (isHTML eq 0) {
    
    					if(ReFind("[\*'\?]", originalSearchWords))	{	// if * or ? or ' is found in originalSearchWords then we need to replace these with the corresponding regex expression
    
    						sResetPhrase = thisPhrase;
    						sResetPhrase = doReplaceAsteriskWithRegex(searchElement=sResetPhrase);
    						sResetPhrase = doReplaceApostropheWithRegex(searchElement=sResetPhrase);
    						sResetPhrase = doReplaceQuestionMarkWithRegex(searchElement=sResetPhrase);
    
    						theString = reReplaceNoCase(theString, "(\b#doEscapeRegexSpecialChars(sResetPhrase)#\b)", "#prefix#\1#suffix#", "ALL"); // this highlights the whole word matched, e.g. Boolean Search on Account*, will mean Accounting/Accountancy/Account being highlighted in the result
    
    						theString = reReplaceNoCase(theString, "(<[^>]*)#prefix#(.*?)#suffix#([^>]*>)", "\1\2\3", "ALL");	// ITEM-1351 - Haran, May 19th 2010 - remove highlighting again if inside tags
    
    					}
    					else {	// do an exact match
    						theString = reReplaceNoCase(theString, "(\b#thisPhrase#\b)+", "#prefix#\1#suffix#", "ALL"); //This highlights an exact match, e.g. Boolean search on Account will not include Accounting, Accounts etc
    					}
    
    				}
    				else {
    					theString = reReplaceNoCase(theString, "([^<"":])(\b#thisPhrase#\b)([^:""=>]*?)", "\1#prefix#\2#suffix#\3", "ALL"); //Match only those within HTML
    				}
    
    			}
    		}
    
    		return theString;
    	</cfscript>
    </cffunction>
  10. #6
  11. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,997
    Rep Power
    9397
    Code:
    theString = reReplaceNoCase(theString, "(\b#doEscapeRegexSpecialChars(sResetPhrase)#\b)", "#prefix#\1#suffix#", "ALL");
    I'm guessing it's that line.
    Try replacing
    Code:
    (([^<]*?|<[^>]+>)*)\b(phrase)\b)
    with \1prefix\3suffix.
    Last edited by requinix; October 8th, 2010 at 11:07 PM. Reason: fix bug in regex
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2010
    Posts
    4
    Rep Power
    0
    Originally Posted by requinix
    Code:
    theString = reReplaceNoCase(theString, "(\b#doEscapeRegexSpecialChars(sResetPhrase)#\b)", "#prefix#\1#suffix#", "ALL");
    I'm guessing it's that line.
    Try replacing
    Code:
    (([^<]+|<[^>]+>)*)\b(phrase)\b)
    with \1prefix\3suffix.
    Can you explain as I can't seem to find the string you suggested to replace within the code???

    Could you update the code and re-post back maybe??

    Thanks
  14. #8
  15. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,997
    Rep Power
    9397
    I don't know any CF so, while I probably could get a lot right, asking me to explain anything CF-related isn't a good idea

    Quick comment #1: regex is wrong. I'll edit my post, or you can just see the code below.
    Quick comment #2: if the search phrase begins with "<" then it won't be highlighted. This can be fixed if you think it'll be an issue.

    Here's what I think you would use:
    Code:
    theString = reReplaceNoCase(theString, "(([^<]*?|<[^>]+>)*)\b(#doEscapeRegexSpecialChars(sResetPhrase)#\b", "\1#prefix#\3#suffix#", "ALL");
    reReplaceNoCase does a case-insensitive, regular expression search-and-replace. On theString, it finds that one pattern (the phrase gets run through that doEscapeRegexSpecialChars so someone can't mess up the regex) and replaces it with the prefix/suffix stuff.
    In the expression, each () pair translates into \# (# is a number). \0 is the entire matched string, \1 is the first () pair going left-to-right, \2 is the second, and so on.

    \1prefix\3suffix would be
    - What was found with (([^<]*?|<[^>]+>)*) - which is any leading text, including all HTML tags up to that point
    - #prefix#, which is "<prefix>" in your case
    - The phrase (=the entirety of the 3rd matching pair)
    - #suffix#, which is "<suffix>" in your case
    By virtue of the regex, the \3 (the phrase) can't be found inside any HTML tags. Why? In \1 it looks for a < and, if found, sucks up everything until the next >. If the phrase is found in there it gets ignored. Only until after the tags gets sucked up will it resume looking for the phrase.

IMN logo majestic logo threadwatch logo seochat tools logo