#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    5
    Rep Power
    0

    Need help with HTML tags Regex


    Ok so I have the following regex expression for identifying html tags:
    [<a-z-A-Z-0-9-!@#$%^*&()"-:>]+

    However, this highlights everything :/ please assist asap. Thanks!
  2. #2
  3. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    So :%! and 123 are valid HTML tags? That would surprise me ...

    You regex just consist of repeating a character class, so any (non-empty) combination of those characters is considered valid. That obviously makes no sense. HTML tags usually look like this:
    Code:
    <h1>
    </p>
    <input type="text" name="password" />
    But maybe you mean something different?

    Note that processing HTML with regexes is a really, really bad idea 99% of the time -- although many people seem to love it. Contrary to popular belief, regexes are not an all-powerful parsing tool. They are in fact very limited and can only parse subsets HTML. So whenever you find yourself trying to parse HTML with regexes, step back and consider using a different approach. Every mainstream language has specialized HTML parsers for exactly that purpose.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    5
    Rep Power
    0

    How would I modify this?


    I am trying to identify all html tags that have characters, numbers, or symbols between them. how would i do that?
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    As Jacques said, it is almost always a bad idea to use regexes to try to parse HTML (or XML, for that matter).

    If you really want to go this way (which could possibly possibly be tolerated for extremely simple operations), you could try something like this:

    Code:
    <[^>]+>
    which means an opening <, followed by a number of anything but a closing >, followed by a closing >.

    This is simplistic, but at least it will not consider this:

    Code:
    <center><b><font face="Verdana">Foo Bar </font></b></center>
    as one single long tag starting with the opening < at the beginning of the line and the closing > at the end of the line above, but will be more or less able to match tags individually.

    However, this will break, for example, if the tag spans over more than one line or in many other circumstances. In brief, don't do that except possibly as a one-shot script for extremely simple substitutions.

IMN logo majestic logo threadwatch logo seochat tools logo