#1
  1. No Profile Picture
    The Monk that is Fat.
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2003
    Location
    UK
    Posts
    107
    Rep Power
    14

    Match a string except in markup


    This surely MUST have been asked before, but I can't find it anywhere...

    I want to match a string except when that string forms part of the markup...

    e.g.

    I want to match the string "for" in the following (pseudo-code only):

    <form etc etc id="color">
    This is for testing. Red is a color.
    </form>

    Only the 'for' in 'This is form testing' should be matched.

    Also, if I was searching for 'color' only the second 'color' should be matched, not the color in the <form> tag.

    I'm hitting a complete mental block with this one.

    -FM
  2. #2
  3. No Profile Picture
    The Monk that is Fat.
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2003
    Location
    UK
    Posts
    107
    Rep Power
    14
    Okay, now I'm talking to myself!

    I think I have made some progress using a negative lookahead, but I'm not convinced this is quite right so am throwing this approach open for criticism and finger pointing.

    the regexp I have is thus:
    Code:
    for(?!.*?>)
    the text to try is on is thus:
    Code:
    <form etc etc color=red>
    this is for testing. this is a color.
    </form>
    the second regexp is thus:
    Code:
    color(?!.*?>)
    Both of these seem to work, but as I say I may be opening myself up to problems here.

    Anyone care to point out situations where this may not work?

    Or anyone care to let me know if this looks like a safe way of searching for a string that is not contained within the actual markup tags of a document?

    Thanks,

    FM

    [Edit: added the non-greedy ? to the negative lookahead]
    Last edited by fatmonk; October 7th, 2009 at 09:22 AM.
  4. #3
  5. No Profile Picture
    The Monk that is Fat.
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2003
    Location
    UK
    Posts
    107
    Rep Power
    14
    I found an example of it not working, so hopefully someone can now give me some pointers to where I am going wrong and how this can be corrected.

    Using the same regexp above on the following text:

    Code:
    <form sction="http://someURL">I am 
    all for regular expressions, but they are a nightmare to use</form><br/> 
    This for, for example gets found but the previous one does not!
    The second and third 'for's are found, but the first isn't.

    If you add a "<br />" tag to the end of the last line, giving:

    Code:
    <form sction="http://someURL">I am 
    all for regular expressions, but they are a nightmare to use</form><br/> 
    This for, for example gets found but the previous one does not!<br />
    then none of the 'for's are found. This points to the </form> and <br/> on the second line being the problem - or at least the closing > on each tag.

    Sure enough removing those > characters (invalid HTML obviously , but just for testing) means that the first 'for' is found.

    Grrr...

    So, I guess negative lookaheads are probably the right approach, I just aint getting it right...

    Any help appreciated.

    Ta,

    FM
  6. #4
  7. kill 9, $$;
    Devshed Supreme Being (6500+ posts)

    Join Date
    Sep 2001
    Location
    Shanghai, An tSín
    Posts
    6,897
    Rep Power
    3886
    My normal advice on this sort of question would be not to use regular expressions at all. Most (all?) programming languages will have parsers available for parsing HTML that will be more robust than anything you're likely to be able to come up with yourself.

    Let the parser take care of separating tags from content, and then you're just matching within the part of the page you're interested in.

    Comments on this post

    • prometheuzz agrees : Exactly. This is the work for a parser, not regular expressions!
  8. #5
  9. No Profile Picture
    The Monk that is Fat.
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2003
    Location
    UK
    Posts
    107
    Rep Power
    14
    Thanks ishnid, unfortunately in this case that's not really an option.

    I need to achieve this in a single (short) line of JavaScript and the HTML has to remain intact (so I can't just strip the HTML before doing the match).

    As you'll see from the above, I seem to be getting quite close, but I think I need to either add an extra lookahead (or lookbehind) so that I effectivel ignore > characters if they follow a < character (if you see what I mean).

    Trying to put the whole thing into a plain english logic statement is even a bit of a struggle at the moment, which is probably why I'm failing to build the correct regexp. I was kind of hoping that someone had done it before and that there might be a cookbook recipe for doing exactly this.

    Let me have a go at plain english logic:

    "Match the string 'for' in the text
    if there is no > character between the match and the next < character or the end of the text".

    That would seem to cover it to me, but I'm not 100% convinced.

    It could also be written as:

    "Match the string 'for' in the text
    if the next character matching < or > is <".

    Maybe that is easier to write as a regexp, but as I've said I'm blanking on how to do it (I only ever seem to have to resort to regexpsevery 6 months or so, so I get very rusty).

    The regexp I've got so far
    Code:
    /for(?!.*?>)/
    , I believe translates into plain english logic as:

    "Match the string 'for' in the text
    if it is NOT followed by a > character after any number of other characters.

    -FM
  10. #6
  11. CSS & JS/DOM Adept
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jul 2004
    Location
    USA (verifiably)
    Posts
    20,122
    Rep Power
    4258
    Why use a complex regexp when you can use the DOM to loop through the elements and thus ignore them in the regexp comparison?
    Spreading knowledge, one newbie at a time.

    Check out my blog. | Learn CSS. | PHP includes | X/HTML Validator | CSS validator | Common CSS Mistakes | Common JS Mistakes

    Remember people spend most of their time on other people's sites (so don't violate web design conventions).
  12. #7
  13. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by fatmonk
    Thanks ishnid, unfortunately in this case that's not really an option.

    I need to achieve this in a single (short) line of JavaScript and the HTML has to remain intact ...
    You can parse the (x)html without altering it of course.
    I must say that I agree with the other members, this sounds like a task for a true parser, not regex. That said, the following regex might suit your needs:

    php Code:
    $regex = '/color(?=[^<>]*(<|$))/';


    which matches the string 'color' only when looking zero or more characters other than '<' and '>' ahead of it, the character '<' is found, or the end of the string is found.

    The dissected regex:

    php Code:
    color      // match 'color'
    (?=        // start positive look-ahead
      [^<>]*   //   zero or more characters other than '<' and '>'
      (        //   start group 1
        <      //     match '<'
        |      //     OR
        $      //     the end of the string
      )        //   end group 1
    )          // end positive look-ahead
  14. #8
  15. No Profile Picture
    The Monk that is Fat.
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2003
    Location
    UK
    Posts
    107
    Rep Power
    14
    I was beginning to think that this forum was all about discouraging people from using regular expressions for a while there...

    BUT prometheuzz has got it!

    That does the job nicely.. I see I was doing the wrong kind of lookahead - I knew what I needed to do but just couldn't get my head around how to structure the regexp.

    Ta v much,

    FM
  16. #9
  17. kill 9, $$;
    Devshed Supreme Being (6500+ posts)

    Join Date
    Sep 2001
    Location
    Shanghai, An tSín
    Posts
    6,897
    Rep Power
    3886
    Originally Posted by fatmonk
    I was beginning to think that this forum was all about discouraging people from using regular expressions for a while there...
    Part of the skill of using regular expressions is knowing when there are more appropriate/reliable alternatives.
  18. #10
  19. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by fatmonk
    I was beginning to think that this forum was all about discouraging people from using regular expressions for a while there...
    ...
    Don't go acting like a smarty-pants now.

    Like I said: the people who said that regex might not be the right tool for the job are quite right. I clearly stated that I agreed with them.

    By posting a remark like that, you insinuate that their contributions are not of value in this thread or that they're wrong. Obviously, this is NOT the case. Perhaps you didn't mean to sound this way, but that is how it appears to me.
  20. #11
  21. No Profile Picture
    The Monk that is Fat.
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2003
    Location
    UK
    Posts
    107
    Rep Power
    14
    Look, I appreciate your help on this, but there's no need to be touchy about it!

    When I was searching for a way to do this all I kept finding were reponses discouraging the use of regular expressions. That's why I made the comment - not a 'smarty-pants' comment at all, simply an observation.

    As I believe I mentioned, the problem I was trying to solve didn't allow the use of a full blown script and a simple regular expression was almost doing the job.

    As your solution proved, it was a trivial matter (at least for someone with your obvious regular expression experience) to modify the expression to do what I needed. There was no need, in this case, to resort to creating a full script to achieve the required end result.

    While I bow to your clear superiority in the regular expression arena, looking back over this forum there seems to be a lot of hostility from a number of people directed to people who come here looking for help with regular expressions.

    I certainly didn't intend to offend anyone. However if my comments prompt people to think before they are so dismissive of pthers who are seeking their help then that's a step in the right direction in my opinion. I've used devshed for years on and off (both to get help, and where I can to give a little assistance as well) and find it an invaluable resource. Hoever the hostility that some people might find here is sure to put them off to the detriment of such a good resource.

    Maybe regular expressions aren't the best solution to a lot of problems, but I for one have learned a bit more about how to use them from your help here. So even if it's not the best way to achieve something in general, surely gaining a bit of knowledge in the process is better than just being sent packing with a coment such as ' don't use reg exps, use the parser'. I think 'help AND guidance' would be a good phrase to employ here!

    In the case of my problem, it was solved perfectly with the regular expression format your proposed, and as the use of a full script wasn't appropriate in this instance I thank you for that.

    -FM

    Comments on this post

    • prometheuzz disagrees : Noone was hostile. And the comment does seem like a snide remarks towards people who tried to help you.
    Last edited by fatmonk; October 12th, 2009 at 05:14 AM.

IMN logo majestic logo threadwatch logo seochat tools logo