#1
  1. Amateur Webdev'er
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2003
    Posts
    141
    Rep Power
    11

    Red face Replacing double quotes with expression


    I have an HTML document that contains a lot of double quotes ( " ) within the text, that I would like to replace with ( & q u o t ; ) (no spaces), but I do not want to affect HTML <tag attribute="value">

    My WYSIWYG editor will allow regular expression replacement with POSIX or PERL type expressions. I figured the dev shed guys would know, as the cryptic-ness of regex strings has always alluded my neanderthal-level brain.

    Thanks in advance.
    Last edited by xpatriot; January 9th, 2012 at 07:50 AM.
  2. #2
  3. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    Interesting problem because negative lookbehinds must be of fixed lengths.
    Matching all quotes except the opening quote of the attribute presents no problem with a negative lookbehind:
    Code:
    (?<!attribute=)"
    But you cannot add a negative lookbehind for the closing quote.

    As a quick sure-fire solution to get around this, I suggest you first replace all the quotes that you do not want to replace with something unique, such as @_HEY_@
    Then replace all the quotes with your & sequence.
    Then turn the @_HEY_@ back to quotes.

    1. Save your file and try this on a COPY of your file!!!

    2. First replacement:
    Code:
    Find: attribute="([^"]+)"
    Replace with: attribute=@_HEY_@$1@_HEY_@
    Note that some editors will use \1 instead of $1.

    2. Second replacement: Match all " and replace with &quot semicolon.

    3. Third replacement: Match all @_HEY_@ and replace with "

    Please let me know if this works for you.

    Comments on this post

    • Kravvitz agrees : That's the approach I'd use too.
  4. #3
  5. Amateur Webdev'er
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2003
    Posts
    141
    Rep Power
    11
    Works great!

    Unfortunately batching these replacements for the entire document blows up my editor (Bluefish) but that's on me -- I'll find another.

    Again, thanks ragax.
  6. #4
  7. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    You're welcome, xpatriot!

    Unfortunately batching these replacements for the entire document blows up my editor (Bluefish) but that's on me -- I'll find another.
    For regex search-and-replace I use EditPadPro (demo here) or Dreamweaver (expensive but mentioning it in case it's on your system). Recently I also tried AbaReplace and found the interface quite efficient. I haven't tried Notepad++ but I hear that its regex is rather rudimentary. Might work for your needs, though.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Posts
    29
    Rep Power
    0
    Ragax suggested a method that works, but you will have to replace

    Code:
    Find: attribute="([^"]+)"
    Replace with: attribute=@_HEY_@$1@_HEY_@
    for each attribute. I would recommend:

    Code:
    Find: (?<=<[^>]+)"
    Replace with: @_HEY_@
    Though not all search-and-replace tools support variable-length lookbehind (mine does).

    Another idea is using an HTML parser, for example, in Python. It can handle the edge cases: unclosed quotes in attributes, > inside an attribute value, etc. But it requires writing a script (see http://www.abareplace.com/blog/html_convertor/#bad-practice).
  10. #6
  11. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    Hi abareplace,

    You are quite right that my solution was specific to an attribute called "attribute". Reading more closely (like you did), I see that a more general solution is more desirable! Totally agree about that.

    I think we need to tweak the tweak you made to my Find expression.

    Three reasons:
    1. As mentioned in my first post, lookbehinds must have fixed lengths (most regex flavors, .NET is an exception). For instance, the plus sign in your lookbehind won't compile in PHP. xpatriot says his editors admit Perl or POSIX. I believe POSIX doesn't admit lookarounds and Perl doesn't admit variable-length lookbehinds.

    2. [Edit: I was wrong about this. See 3.] If the lookbehind worked, the Find expression you suggested would only replace one of the two quotes of "value" (presumably, you were aiming for the first one?): the other quote would be replaced with &quot on xpatriot's next replace operation, which is what he is trying to avoid.

    3. [Edit: I was wrong about this. A few posts below, abarplace explains the matching process of variable-length lookbehinds] If the lookbehind worked, because [^>]+ is greedy, it would actually eat the opening quote! It would go all the way to the first > and backtrack. So the quote you'd replace would be the closing quote of "value".

    I still don't know a generic solution that will work in xpatriot's editors. But I can suggest a solution that will work in PHP. xpatriot, can you process the files where you want to do the search-and-replace in PHP? This might overcome your editor limitations. Take a look at this second, "generic" solution and see if it might work better for you.

    Input:
    Hey you, "Dont_touch_my_quotes!" <div class="value"> <a href="whatever">

    Code:
    Code:
    <?php 
    $string = 'Hey you, "Dont_touch_my_quotes!" <div class="value"> <a href="whatever">';
    $pattern = ',(?s)<[^>]+?\K"([^"]+)",';
    $replace = '@_HEY_@\1@_HEY_@';
    $s = preg_replace($pattern,$replace,$string);
    echo htmlentities($s).'<br />';
    ?>
    Output:
    Hey you, "Dont_touch_my_quotes!" <div class=@_HEY_@value@_HEY_@> <a href=@_HEY_@whatever@_HEY_@>

    Good brainstorm, aba, glad you brought up the idea of a general solution...
    xpatriot, let us know if we can help you further!

    Wishing you both a fun weekend.
    Last edited by ragax; January 14th, 2012 at 03:18 PM. Reason: Tags places that are wrong
  12. #7
  13. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93

    Added bug report about replacement


    Addendum:

    1. I just ran a test in aba's very own abareplace tool. Guess what... It supports variable-length lookbehinds! Very cool!!!

    2. I should add that only a few days ago, aba made me aware of silly errors I had written... You never know when you're going to be half asleep at the computer, so with regex it's good to have multiple pairs of eyes.

    3. In light of #1, we can craft a solution that works in the abareplace search-and-replace tool, to give xpatriot a second option. I'd suggest starting with aba's solution, making the lookbehind lazy rather than greedy, and adding the rest of the string up to the second quote.

    Find: (?<=<[^>]+?)"([^"]+)"
    Replace: @_HEY_@\1@_HEY_@

    Test string:
    Hey you, "Dont_touch_my_quotes!" <div class="value"> <a href="whatever">


    Tested it in abareplace... The match is perfect!!!

    Edit: So is the replacement. At first I said "The \1 in the replacement is not working... I must have the wrong back reference syntax."
    Now it works. There seems to be a slight bug. At first, the replacement was not showing in the highlighted replacement text (it just showed @_HEY_@@_HEY_@ with nothing in the middle). I closed ABA, reopened it: the replacement text was now showing, e.g. @_HEY_@value@_HEY_@.

    Aba, you may want to look into this. Next, when I delete the \1, the replacement text disappears. I put the \1 back, the replacement text does not reappear. (Unless I close and reopen again.) Small bug,

    The bottom line is that this replacement tool is very powerful. I'm impressed with the variable-length lookbehind, that's very sweet!


    xpatriot, I hope all this help... I think you're home free.
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Posts
    29
    Rep Power
    0
    Thank you very much for reporting the bug in my software.

    Originally Posted by ragax
    If the lookbehind worked, because [^>]+ is greedy, it would actually eat the opening quote!
    Not quite. The regex with lookbehind works like this:

    1) It founds all quotes in the text.
    2) It steps back from each quote and checks if there is <[^>]+ here.

    The quote is checked first, then the lookbehind. So there are no problems with greediness or replacing one of the two quotes.

    You can try it in .NET, too:
    Code:
    using System;
    using System.Text.RegularExpressions;
    
    
    class Example
    {
        static void Main(string[] args)
        {
            Console.WriteLine(Regex.Replace("Hey you, \"Dont_touch_my_quotes!\" " +
                "<div class=\"value\"> <a href=\"whatever\" title=\"what's it\">",
                "(?<=<[^>]+)\"", "@_HEY_@"));
        }
    }
    outputs:
    Code:
    Hey you, "Dont_touch_my_quotes!" <div class=@_HEY_@value@_HEY_@> <a href=@_HEY_@whatever@_HEY_@ title=@_HEY_@what's it@_HEY_@>

    Originally Posted by ragax
    <[^>]+?\K"([^"]+)"
    There may be more than one attribute in a tag. For example: <img src="logo.png" alt="My logo"> This regex will replace only the first two quotes. Unfortunately, it's a limitation of \K in PCRE.

    Comments on this post

    • ragax agrees : few people know about the details of variable length lookbehinds
  16. #9
  17. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    93
    Thank you very much for reporting the bug in my software.
    Well, the main thing I was reporting is that your software can perform this complex job, which is really cool! The bug is minor but I thought you should know.

    Thanks for making me more familiar with the details of matching in variable-length lookbehinds. I was wrong. I haven't had experience with variable-length lookbehinds as they don't exist in PHP. Through RB, I see how your example works in .NET.

    There may be more than one attribute in a tag.
    Great point! Not in the original question, but a great extension of it. This never crossed my mind, awesome that you thought of it.

    Wishing you all a fun weekend.


IMN logo majestic logo threadwatch logo seochat tools logo