#1
  1. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495

    Regex to replace white space between brackets in Perl


    Hi folks,

    This is not a problem that I really need to solve, but I was trying to answer the question of someone in another section of this forum and got partly stuck.

    In a question posted in the Regex section of this forum (http://forums.devshed.com/regex-programming-147/regex-to-replace-white-space-between-brackets-933871.html), the original poster asked the following question:

    Originally Posted by benwenger
    I know how to replace everything between brackets but not how to replace parts of it. I need a regex to replace all white space between curly brackets with  

    example
    $string="lorum {ipsum dolor sit} et amed {nucas nullum} est";
    after regex
    lorum {ipsum dolor sit} et amed {nucas nullum} est
    I warned that this was a bit complicated for a regex and was able to come up with a rather tedious solution that would iterate with a while loop through the string, extract each {...} substring, apply a simple substitution to that substring to replace the spaces, and then replace the original substring by the modified substring, and, in the next while iteration, would do the same thing to the next {...} substring, and so on until the job was done.

    Something that works, but looks tedious and rather ugly. I also advised that I personally would probably not do it with regexes, but rather find the substrings with the index function, extract the substring, modify it with a regex, and replace the substring in place, doing the whole thing in a while loop until the job is done.

    But then, another poster came with an illuminating remark on something that I had not thought about for one second (even though I knew it in theory, I have probably never used this functionality, so it had not come to my mind):

    Originally Posted by Jacques1
    What you're doing is completely unnecessary effort. PHP (and I'm sure also Perl) can replace patterns with the return value of a callback function.
    Jacques1 then gave a piece code in PHP to achieve the required result (the original question was about PHP), which is irrelevant here.

    Of course. This is sooooh much better.

    So, for the fun of it, I tried to do it in Perl, but found that it was not as easy as I thought.

    I finally succeeded to do it this way, using a (sort of callback) function:

    Code:
    sub remove_sp {
        $_ = shift; 
        s/ / /g; 
        return $_;
    }
    my $test = "lorum {ipsum dolor sit} et amed {nucas nullum} est";
    $test =~ s/(\{[^}]*\})/remove_sp($1)/eg;
    This works fine, $test now contains: "lorum {ipsum dolor sit} et amed {nucas nullum} est", which was the required result.

    It is pretty good and far better than the regex progressive match constructs within a while loop that I had suggested originally.

    But I came up with that solution with a separate function definition only as a fall-back option after I tried unsuccessfully to inline what is in the remove_sp function above as an anonymous function in the replacement part of the s/// expression.

    I tried all kinds of ways to inline an anonymous function, but, for example, something like this:
    Code:
    $test =~ s/(\{[^}]*\})/{$_=$1; s/  / /g}/eg
    or
    Code:
    $test =~ s/(\{[^}]*\})/{$1 =~ s/  / /g}/eg
    gave me an "Unmatched right curly bracket" error. I played with a number of variations on that, but I still can't find how to do it. I must be missing something or perhaps doing a silly mistake.



    In brief, I am fairly sure it should be possible to do it in an anonymous or inline subroutine within the replacement section of the s/// statement and would like to understand why I don't find the right way to do it. Does anyone have an idea on how to solve this?

    Thanks for your thoughts.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2007
    Posts
    765
    Rep Power
    928
    Read Gory details of parsing quoted constructs. The problem is perl sees the '/' of the nested s/// as the end of the outer s///. If you use different delimiters it parses OK. Also you need to use a temporary variable since $1 is read-only. And finally you need to be running on a new enough perl that the regex engine is reentrant.
    Code:
    $test =~ s/(\{[^}]\})/(my $t = $1) =~ s! ! !g; $t/ge
    For simplicity, I'd probably ditch the inner regex and write something like this:
    Code:
    $test =~ s/(\{[^}]\})/join ' ', split '\s', $1/ge

    Comments on this post

    • Laurent_R agrees : Thank you for your enlightening ideas.
    sub{*{$::{$_}}{CODE}==$_[0]&& print for(%:: )}->(\&Meh);
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    Thank you very much for your answer.

    I will definitely read the article you mention.

    I did try to use other delimiters, either on the inner or on the outer s/// statement, but did not succeed. PossiblyI had another syntax error at the time.

    I am using Perl 5.10, but I would assume it is reentrant since I could do it with the function call which, I imagine, would have the same problem if ithe regex engine was not reentrant.

    I also appreciate the split-join idea, it is a clever way of doing with simplicity.

    Thanks a lot, your post shed a lot of light onto my mind and will help me making other tries in this direction in order to improve my comprehension of this whole shebang.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    Originally Posted by OmegaZero
    Read Gory details of parsing quoted constructs. The problem is perl sees the '/' of the nested s/// as the end of the outer s///. If you use different delimiters it parses OK. Also you need to use a temporary variable since $1 is read-only. And finally you need to be running on a new enough perl that the regex engine is reentrant.
    Code:
    $test =~ s/(\{[^}]\})/(my $t = $1) =~ s! ! !g; $t/ge
    For simplicity, I'd probably ditch the inner regex and write something like this:
    Code:
    $test =~ s/(\{[^}]\})/join ' ', split '\s', $1/ge
    Hi OmegaZero,

    I've tried now your suggestions, they did not work as posted. I thought that it had to do with the re-entrance problem (especially that I am now using a server with Perl 5.8 instal, not 5.10 as with my tests yesterday), but it turns out there is a simply small mistake in the code you presented (a + quantifier missing in the search part of the s/// statement). For the benefit of others reading this thread and wanting to test, these are your regexes with the correction of the minor error:

    Code:
    $test =~ s/(\{[^}]+\})/(my $t = $1) =~ s! ! !g; $t/ge;
    Code:
    $test =~ s/(\{[^}]+\})/join ' ', split ' ', $1/ge;
    With these minor corrections, they work exactly as expected even on Perl 5.8 (even though this Perl version, juste as 5.10 I used yesterday, is not re-entrant). So the fact that newer Perl version are re-entrant must apply to some other functionnality of the Perl regex engine.

    Thank you again for your input.

IMN logo majestic logo threadwatch logo seochat tools logo