Regex Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming Languages - MoreRegex Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old October 7th, 2009, 07:55 AM
fatmonk fatmonk is offline
The Monk that is Fat.
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2003
Location: UK
Posts: 106 fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 18 h 45 m 17 sec
Reputation Power: 13
Match a string except in markup

This surely MUST have been asked before, but I can't find it anywhere...

I want to match a string except when that string forms part of the markup...

e.g.

I want to match the string "for" in the following (pseudo-code only):

<form etc etc id="color">
This is for testing. Red is a color.
</form>

Only the 'for' in 'This is form testing' should be matched.

Also, if I was searching for 'color' only the second 'color' should be matched, not the color in the <form> tag.

I'm hitting a complete mental block with this one.

-FM

Reply With Quote
  #2  
Old October 7th, 2009, 09:13 AM
fatmonk fatmonk is offline
The Monk that is Fat.
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2003
Location: UK
Posts: 106 fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 18 h 45 m 17 sec
Reputation Power: 13
Okay, now I'm talking to myself!

I think I have made some progress using a negative lookahead, but I'm not convinced this is quite right so am throwing this approach open for criticism and finger pointing.

the regexp I have is thus:
Code:
for(?!.*?>)


the text to try is on is thus:
Code:
<form etc etc color=red>
this is for testing. this is a color.
</form>


the second regexp is thus:
Code:
color(?!.*?>)


Both of these seem to work, but as I say I may be opening myself up to problems here.

Anyone care to point out situations where this may not work?

Or anyone care to let me know if this looks like a safe way of searching for a string that is not contained within the actual markup tags of a document?

Thanks,

FM

[Edit: added the non-greedy ? to the negative lookahead]

Last edited by fatmonk : October 7th, 2009 at 09:22 AM.

Reply With Quote
  #3  
Old October 7th, 2009, 11:02 AM
fatmonk fatmonk is offline
The Monk that is Fat.
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2003
Location: UK
Posts: 106 fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 18 h 45 m 17 sec
Reputation Power: 13
I found an example of it not working, so hopefully someone can now give me some pointers to where I am going wrong and how this can be corrected.

Using the same regexp above on the following text:

Code:
<form sction="http://someURL">I am 
all for regular expressions, but they are a nightmare to use</form><br/> 
This for, for example gets found but the previous one does not!


The second and third 'for's are found, but the first isn't.

If you add a "<br />" tag to the end of the last line, giving:

Code:
<form sction="http://someURL">I am 
all for regular expressions, but they are a nightmare to use</form><br/> 
This for, for example gets found but the previous one does not!<br />


then none of the 'for's are found. This points to the </form> and <br/> on the second line being the problem - or at least the closing > on each tag.

Sure enough removing those > characters (invalid HTML obviously , but just for testing) means that the first 'for' is found.

Grrr...

So, I guess negative lookaheads are probably the right approach, I just aint getting it right...

Any help appreciated.

Ta,

FM

Reply With Quote
  #4  
Old October 7th, 2009, 12:06 PM
ishnid's Avatar
ishnid ishnid is offline
kill 9, $$;
Dev Shed God 4th Plane (6500 - 6999 posts)
 
Join Date: Sep 2001
Location: Shanghai, An tSín
Posts: 6,894 ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level) 
Time spent in forums: 4 Months 2 Weeks 1 Day 22 h 37 m 21 sec
Reputation Power: 3885
My normal advice on this sort of question would be not to use regular expressions at all. Most (all?) programming languages will have parsers available for parsing HTML that will be more robust than anything you're likely to be able to come up with yourself.

Let the parser take care of separating tags from content, and then you're just matching within the part of the page you're interested in.

Reply With Quote
  #5  
Old October 8th, 2009, 03:55 AM
fatmonk fatmonk is offline
The Monk that is Fat.
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2003
Location: UK
Posts: 106 fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 18 h 45 m 17 sec
Reputation Power: 13
Thanks ishnid, unfortunately in this case that's not really an option.

I need to achieve this in a single (short) line of JavaScript and the HTML has to remain intact (so I can't just strip the HTML before doing the match).

As you'll see from the above, I seem to be getting quite close, but I think I need to either add an extra lookahead (or lookbehind) so that I effectivel ignore > characters if they follow a < character (if you see what I mean).

Trying to put the whole thing into a plain english logic statement is even a bit of a struggle at the moment, which is probably why I'm failing to build the correct regexp. I was kind of hoping that someone had done it before and that there might be a cookbook recipe for doing exactly this.

Let me have a go at plain english logic:

"Match the string 'for' in the text
if there is no > character between the match and the next < character or the end of the text".

That would seem to cover it to me, but I'm not 100% convinced.

It could also be written as:

"Match the string 'for' in the text
if the next character matching < or > is <".

Maybe that is easier to write as a regexp, but as I've said I'm blanking on how to do it (I only ever seem to have to resort to regexpsevery 6 months or so, so I get very rusty).

The regexp I've got so far
Code:
/for(?!.*?>)/
, I believe translates into plain english logic as:

"Match the string 'for' in the text
if it is NOT followed by a > character after any number of other characters.

-FM

Reply With Quote
  #6  
Old October 8th, 2009, 10:51 PM
Kravvitz's Avatar
Kravvitz Kravvitz is offline
CSS & JS/DOM Adept
Dev Shed God 30th Plane (19500 - 19999 posts)
 
Join Date: Jul 2004
Location: USA
Posts: 19,835 Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level)Kravvitz User rank is General 48th Grade (Above 100000 Reputation Level) 
Time spent in forums: 6 Months 1 Day 22 h 11 m
Reputation Power: 4192
Why use a complex regexp when you can use the DOM to loop through the elements and thus ignore them in the regexp comparison?
__________________
Spreading knowledge, one newbie at a time. I'm available for hire at Dynamic Site Solutions.

Check out my blog. | Learn CSS. | PHP includes | X/HTML Validator | CSS validator | Common CSS Mistakes | Common JS Mistakes

Remember people spend most of their time on other people's sites (so don't violate web design conventions).

Reply With Quote
  #7  
Old October 9th, 2009, 02:36 AM
prometheuzz prometheuzz is offline
User 165270
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Oct 2005
Posts: 497 prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level) 
Time spent in forums: 5 Days 10 h 14 m 35 sec
Reputation Power: 936
Quote:
Originally Posted by fatmonk
Thanks ishnid, unfortunately in this case that's not really an option.

I need to achieve this in a single (short) line of JavaScript and the HTML has to remain intact ...


You can parse the (x)html without altering it of course.
I must say that I agree with the other members, this sounds like a task for a true parser, not regex. That said, the following regex might suit your needs:

php Code:
Original - php Code
  1. $regex = '/color(?=[^<>]*(<|$))/';


which matches the string 'color' only when looking zero or more characters other than '<' and '>' ahead of it, the character '<' is found, or the end of the string is found.

The dissected regex:

php Code:
Original - php Code
  1. color      // match 'color'
  2. (?=        // start positive look-ahead
  3.   [^<>]*   //   zero or more characters other than '<' and '>'
  4.   (        //   start group 1
  5.     <      //     match '<'
  6.     |      //     OR
  7.     $      //     the end of the string
  8.   )        //   end group 1
  9. )          // end positive look-ahead
  10.  

Reply With Quote
  #8  
Old October 9th, 2009, 11:30 AM
fatmonk fatmonk is offline
The Monk that is Fat.
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2003
Location: UK
Posts: 106 fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 18 h 45 m 17 sec
Reputation Power: 13
I was beginning to think that this forum was all about discouraging people from using regular expressions for a while there...

BUT prometheuzz has got it!

That does the job nicely.. I see I was doing the wrong kind of lookahead - I knew what I needed to do but just couldn't get my head around how to structure the regexp.

Ta v much,

FM

Reply With Quote
  #9  
Old October 9th, 2009, 11:56 AM
ishnid's Avatar
ishnid ishnid is offline
kill 9, $$;
Dev Shed God 4th Plane (6500 - 6999 posts)
 
Join Date: Sep 2001
Location: Shanghai, An tSín
Posts: 6,894 ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level)ishnid User rank is General 44th Grade (Above 100000 Reputation Level) 
Time spent in forums: 4 Months 2 Weeks 1 Day 22 h 37 m 21 sec
Reputation Power: 3885
Quote:
Originally Posted by fatmonk
I was beginning to think that this forum was all about discouraging people from using regular expressions for a while there...

Part of the skill of using regular expressions is knowing when there are more appropriate/reliable alternatives.

Reply With Quote
  #10  
Old October 9th, 2009, 12:44 PM
prometheuzz prometheuzz is offline
User 165270
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Oct 2005
Posts: 497 prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level)prometheuzz User rank is General (90000 - 100000 Reputation Level) 
Time spent in forums: 5 Days 10 h 14 m 35 sec
Reputation Power: 936
Quote:
Originally Posted by fatmonk
I was beginning to think that this forum was all about discouraging people from using regular expressions for a while there...
...


Don't go acting like a smarty-pants now.

Like I said: the people who said that regex might not be the right tool for the job are quite right. I clearly stated that I agreed with them.

By posting a remark like that, you insinuate that their contributions are not of value in this thread or that they're wrong. Obviously, this is NOT the case. Perhaps you didn't mean to sound this way, but that is how it appears to me.

Reply With Quote
  #11  
Old October 12th, 2009, 05:10 AM
fatmonk fatmonk is offline
The Monk that is Fat.
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2003
Location: UK
Posts: 106 fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level)fatmonk User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 18 h 45 m 17 sec
Reputation Power: 13
Look, I appreciate your help on this, but there's no need to be touchy about it!

When I was searching for a way to do this all I kept finding were reponses discouraging the use of regular expressions. That's why I made the comment - not a 'smarty-pants' comment at all, simply an observation.

As I believe I mentioned, the problem I was trying to solve didn't allow the use of a full blown script and a simple regular expression was almost doing the job.

As your solution proved, it was a trivial matter (at least for someone with your obvious regular expression experience) to modify the expression to do what I needed. There was no need, in this case, to resort to creating a full script to achieve the required end result.

While I bow to your clear superiority in the regular expression arena, looking back over this forum there seems to be a lot of hostility from a number of people directed to people who come here looking for help with regular expressions.

I certainly didn't intend to offend anyone. However if my comments prompt people to think before they are so dismissive of pthers who are seeking their help then that's a step in the right direction in my opinion. I've used devshed for years on and off (both to get help, and where I can to give a little assistance as well) and find it an invaluable resource. Hoever the hostility that some people might find here is sure to put them off to the detriment of such a good resource.

Maybe regular expressions aren't the best solution to a lot of problems, but I for one have learned a bit more about how to use them from your help here. So even if it's not the best way to achieve something in general, surely gaining a bit of knowledge in the process is better than just being sent packing with a coment such as ' don't use reg exps, use the parser'. I think 'help AND guidance' would be a good phrase to employ here!

In the case of my problem, it was solved perfectly with the regular expression format your proposed, and as the use of a full script wasn't appropriate in this instance I thank you for that.

-FM

Last edited by fatmonk : October 12th, 2009 at 05:14 AM.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreRegex Programming > Match a string except in markup

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap