#1
  1. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2004
    Posts
    512
    Rep Power
    50

    Stri p HTML except <br>


    I have been working on this regex so that it strips out any HTML except the <br>.

    I have it working, but I am "sort of" cheating it because I can't figure out how to get it working 100%. 100% to me is having "br" side by side and it working.

    I hope somebody can give me a suggestion. I'm just not so good at regexes.

    Code:
    s/<[^!r>]*>//g;
    The only reason this is working for me is because <br> has a "r" in it. I have tried all different groupings, !(br)(!br), on and on and on. I really think (br) should work, although it does work, when I have "br" together it treats it like "b" or "r", not "b" and "r".

    If I have "br" together then the regex also lets <b> and </b> pass.

    I think I am close... just no cigar.

    Thanks.

    Wes
  2. #2
  3. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,274
    Rep Power
    0
    You are going to hear this over and over, so I'll just get it out of the way:

    Using regular expressions on HTML will not work 100% of the time, because HTML is not a regular language. For best results you should use an HTML parser.

    Still, I don't want to lecture anyone; especially not knowing the context.

    If you wanted to to remove a tag, the general idea might be:

    Code:
    s/<[^>]+?>/ /g;
    Which is to say, from an opening angle bracket, match it and anything that isn't a closing bracket up to and including the first closing bracket we reach (non-greedy match).

    So how to not match 'br' tags? Use a negative look-head to make sure that is not the type we have.

    Code:
    s/<(?!br ?\/?>)[^>]+?>/ /g;
    The options in the look ahead are to account for all forms:
    <br> <br/> <br />
    edit: made slight improvement in regex

    Comments on this post

    • Will-O-The-Wisp agrees
    • additude agrees : Appreciate!
    Last edited by keath; April 20th, 2015 at 12:12 AM.
  4. #3
  5. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2004
    Posts
    512
    Rep Power
    50
    Thanks again keath. And thanks for the explanation. It makes better sense to me and to use: HTML::Parser is good advice. It is one of the modules this script loads. Sometimes regex just seems like a quick and easy fix.

    Appreciate.
  6. #4
  7. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2004
    Posts
    512
    Rep Power
    50
    Closed

IMN logo majestic logo threadwatch logo seochat tools logo