#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2005
    Posts
    199
    Rep Power
    43

    Arrow Does not contain the string...


    Hey,

    I am trying to write a regex that finds all the links that do contain images.

    Finding all links is relatively easy:
    "(<a .*</a>)"

    But I can't seem to negate a string.

    Desired output:
    <a href bla bla><img></a> SUCCESS
    <a href bla bla>some other stuff <img> and more stuff</a> SUCCESS
    <a href bla bla><other><tags><and><img></a> SUCCESS
    <a href bla bla><other><tags></a> FAILURE
    <a href bla bla>text only link</a> FAILURE
    <a href bla bla><whatever><tags></a> FAILURE
    <a href bla bla></a><img><a href bla bla></a> FAILURE Most important case
    <a href bla bla></a><img><a href bla bla></a> FAILURE Most important case
    <a href bla bla></a><img><a href bla bla></a> FAILURE Most important case

    My gut feeling is that I need to search for:
    <a ###<img ###</a>
    where ### matches anything not containing "</a>"

    The problem is, no matter how I play with it, I can't figure out the ###
    Last edited by videoediting; October 9th, 2008 at 01:59 PM.
  2. #2
  3. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,294
    Rep Power
    9400
    Are you using a preg_* function like ryon mentioned? Because using .*? as you tried originally would fix your problem.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2005
    Posts
    199
    Rep Power
    43

    Talking


    Originally Posted by requinix
    Are you using a preg_* function like ryon mentioned? Because using .*? as you tried originally would fix your problem.
    Yes, I did try ryon's method. The problem I ran into is the same as the last case, where the <a></a><img><a></a>; because it was unable to find an img, regex does anything it can to force a fit. Thus, regex will take the open a of the first link and the close /a of the second link to make a fit. Thanks anyway!

    I did eventually find a solution for the ###. For anyone else interested, they can use ((?!foobar).)* where foobar is the string you wish to solve for (</a> in my case)!
  6. #4
  7. kill 9, $$;
    Devshed Supreme Being (6500+ posts)

    Join Date
    Sep 2001
    Location
    Shanghai, An tSín
    Posts
    6,898
    Rep Power
    3887
    To be honest, I'd never write a regexp to parse HTML. From requinix's post, it appears you're using PHP. I'm no PHP programmer, but I'd advise using a proper tag-aware HTML parser for this sort of thing. It tends to make things an awful lot easier.
  8. #5
  9. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,294
    Rep Power
    9400
    Originally Posted by videoediting
    Yes, I did try ryon's method. The problem I ran into is the same as the last case, where the <a></a><img><a></a>; because it was unable to find an img, regex does anything it can to force a fit. Thus, regex will take the open a of the first link and the close /a of the second link to make a fit.
    Where in your regular expression did you mention an <img>? I don't see it.
    <a .*?</a> means "<a " then as few characters as possible until the next "</a>". Nothing more.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2005
    Posts
    199
    Rep Power
    43

    Arrow


    Originally Posted by requinix
    Where in your regular expression did you mention an <img>? I don't see it.
    <a .*?</a> means "<a " then as few characters as possible until the next "</a>". Nothing more.
    Yes, in the thread Ryon responded to, I did not mention the <img> tag because I didn't think it was relevant at the time. That's why I started a new thread - to avoid confusion. I am providing extremely simplistic test cases so that you guys don't have to painstakingly dig through 200 character regular expressions. In reality, these are parts of much larger expressions.

    If you read the initial post in this thread, you will see my references to the img tag.

    With regards to ishnid's point, I agree, 100%. In this case, however, where there are only two or three large regular expressions within this entire web application, it's probably not worth introducing a new dependency and/or dealing with any additional code. But it's a good idea, I appreciate the intent!

IMN logo majestic logo threadwatch logo seochat tools logo