#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2006
    Posts
    18
    Rep Power
    0

    Preg_replace issue with <blockquote>s!


    been scratching my head for the last 3 hours now,.... given up!

    this is the string,:
    Code:
    <blockquote><p><a href="#comment-45" rel="nofollow">Quote</a> from <a href="http://www.somedomain.com" title="MSolution" rel="nofollow">MSolution</a>:<br />
    <blockquote><a href="#comment-44" rel="nofollow">Quote</a> from <a href="http://www.somedomain.com" title="MSolution" rel="nofollow"> MSolution</a>: this is me quoting admin</p></blockquote>
    <p>this is me quoting me</p></blockquote>
    <p>this is what i think about 45</p>
    <blockquote><p><a href="#comment-3" rel="nofollow">Quote</a> from <a href="http://www.somedomain.com" title="MSolution" rel="nofollow">MSolution</a>:<br />
    <blockquote><a href="#comment-3" rel="nofollow">Quote</a> from <a href="http://www.dir.vc" title="admin" rel="nofollow">admin</a>: this is first spam</p></blockquote>
    <p>this is me quoting admin</p></blockquote>
    <p>this is what i think about 44</p>
    and this is what i want out of it:

    Code:
    this is what i think about 45
    this is what i think about 44
    PHP Code:
    $str preg_replace("/<blockquote(.*?)>((.|\n)*?)(<\/blockquote>)/i"""$str);
    $str preg_replace("/<\/?p(.*?)>/ise"""$str); 

    please help!

    M.
  2. #2
  3. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,245
    Rep Power
    9400
    Moved from PHP; thread title edited to be more specific.


    First step is to remove the <blockquote>s. You can't just remove everything between the opening and closing tags because you could potentially remove stuff you wanted to keep.

    Recursion to the rescue. Basically,
    1. Find a <blockquote>
    2. Look for a bunch of stuff
    2a. It can have something that isn't an HTML tag
    2b. It can have something that isn't an opening or closing <blockquote>
    2c. Or it can try to match the entire regex again right there
    3. Find a closing </blockquote>
    Code:
    #<blockquote[^>]*>([^<]+|<(?!/?blockquote)[^>]*>|(?R))+</blockquote>#i
    That will remove quotes so long as they're paired. If the opening and closing tags don't match up then you'll be stuck with them...
    Which is one reason you next grab the stuff between the <p>s, not just try to remove the remaining HTML.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2006
    Posts
    18
    Rep Power
    0
    thanx alot, worked like a charm!!! ...

    ok,
    1. i understood 2a, i got around 2c,... but 2b went overboard!

    so what part of it was a "NOT" directive there?

    Code:
    <(?!/?blockquote)[^>]*>
    a ? suggests it may be there ... right? so why is it the first thing there in the ()... there's nothing before it.

    2. regex is not my favourite part, but learning here....

    a. what is the difference between [^>] and [^<]


    b. some people use quotes (") ... some use pipe (|) and some others back ticks,... and some like you use hash # to quote you regex,... is there any difference?

    Thanx in advance
    M.
  6. #4
  7. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,245
    Rep Power
    9400
    I'll try answering all that with a more-or-less full breakdown of the expression.
    Code:
    #<blockquote[^>]*>([^<]+|<(?!/?blockquote)[^>]*>|(?R))+</blockquote>#i
    • #...# is the delimiter. It can be pretty much anything, but if you want to use that symbol in the expression it has to be escaped. / ! # ~ are most common.
    • [^...] is a negating character set. [^>] is "not a greater-than" and [^<] is "not a less-than".
    • (?!...) is a negative lookahead. For the matching to continue there must not be a "..." starting at the next character. A "(?" is the beginning of something special: ?= is a positive lookahead, ?i enables case-insensitive mode, ?R means recursion...
    • (?R) is for recursion: apply the expression at that point too starting at the beginning.


    Check out the resources sticky.

IMN logo majestic logo threadwatch logo seochat tools logo