#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Posts
    6
    Rep Power
    0

    Substr madness with utf8


    I have the following code - it is run within postgresql, but apart from elog (which should be obvious) it is really, really simple.
    Code:
    my $from = [the result of something else]
    my $fromname = undef();
    elog(WARNING, $from);
    if ($from =~ m/^\s*"?(.+?)"?\s+<([^<>@]+@[^<>@]+)>\s*$/) {
        elog(WARNING, $1);
        elog(WARNING, unpack('H*', $1));
        $fromname = substr($1, 0, 128);
        elog(WARNING, unpack('H*', $fromname));
        elog(WARNING, $fromname);
    }
    The program containing this piece of code is executed twice in a row with different data and results in:

    Code:
    2012-08-12 18:52:15 CEST WARNING:  Marçus Eñdberg <usenet@nueb.net>
    2012-08-12 18:52:15 CEST WARNING:  Marçus Eñdberg
    2012-08-12 18:52:15 CEST WARNING:  4d6172e775732045f16462657267
    2012-08-12 18:52:15 CEST WARNING:  4d6172e775732045f16462657267
    2012-08-12 18:52:15 CEST WARNING:  Marçus Eñdberg
    2012-08-12 18:52:15 CEST WARNING:  Rüdiger Thomas <th_sch@hotmail.de>
    2012-08-12 18:52:15 CEST WARNING:  Rüdiger Thomas
    2012-08-12 18:52:15 CEST WARNING:  52fc64696765722054686f6d6173
    2012-08-12 18:52:15 CEST WARNING:  52fc64696765722054686f6d617300
    2012-08-12 18:52:15 CEST ERROR:  invalid byte sequence for encoding "UTF8": 0x00 at line 95.
    WTF? Unfortunately I don't know too much about perl and utf-8, but it is obvious that a) this string is not utf-8 and b) that a 0x00 is padded to it by the substring function. Why that?

    And, even worse: if I call the program only once with the second data set, it succeeds:
    Code:
    2012-08-12 18:52:17 CEST WARNING:  Rüdiger Thomas <th_sch@hotmail.de>
    2012-08-12 18:52:17 CEST WARNING:  Rüdiger Thomas
    2012-08-12 18:52:17 CEST WARNING:  52fc64696765722054686f6d6173
    2012-08-12 18:52:17 CEST WARNING:  52fc64696765722054686f6d6173
    2012-08-12 18:52:17 CEST WARNING:  Rüdiger Thomas
    Now this is what I'd expect in the first place... Can anybody tell me what is going on here and what I can do to fix it?
    Last edited by SlowFox; August 12th, 2012 at 01:50 PM. Reason: Debug statements with unpack() added
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    Hi,

    can you show the input data corresponding to the output you have shown, so that we can better understand how that somewhat long regular expression reacts to it?
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Posts
    6
    Rep Power
    0
    Originally Posted by Laurent_R
    can you show the input data corresponding to the output you have shown, so that we can better understand how that somewhat long regular expression reacts to it?
    $from is a result of Email::MIME and does contain the From-Header of the message being processed. The regular expression splits this header into name and email-address (pretty long, but not too complex: optional white space, optional double-quote, name, optional double-quote; whitespace; opening delimiter, name-part, @-sign, host/domain-part, closing delimiter, optional white space). If the name contains some non-ASCII characters, the problems may appear (but neither with all records nor with any given sequence of records).

    I just noted that unpack() is tricky for debug output: it seems to convert back to iso-8859-1 implicitely unless given the "U0" switch (correnct me, if I am wrong). So I changed the code to:

    Code:
    my $from = $mime->header('from');
    elog(WARNING, "Input string: " . $from);
    elog(WARNING, "Input bytes: " . unpack('U0H*', $from));
    if ($from =~ m/^\s*"?(.+?)"?\s+<([^<>@]+@[^<>@]+)>\s*$/) {
        elog(WARNING, "Match string: " . $1);
        elog(WARNING, "Match bytes: " . unpack('U0H*', $1));
        $fromname = substr($1, 0, 128);
        elog(WARNING, "Substr result: " . unpack('U0H*', $fromname));
    }
    and get (again, this is the result from two subsequent calls of the function with two different emails as input data:

    Code:
    2012-08-12 22:53:26 CEST WARNING:  Input string: Marçus Eñdberg <usenet@nueb.net>
    2012-08-12 22:53:26 CEST WARNING:  Input bytes: 4d6172c3a775732045c3b16462657267203c7573656e6574406e7565622e6e65743e
    2012-08-12 22:53:26 CEST WARNING:  Match string: Marçus Eñdberg
    2012-08-12 22:53:26 CEST WARNING:  Match bytes: 4d6172c3a775732045c3b16462657267
    2012-08-12 22:53:26 CEST WARNING:  Substr result: 4d6172c3a775732045c3b16462657267
    2012-08-12 22:53:26 CEST WARNING:  Input string: Rüdiger Thomas <th_sch@hotmail.de>
    2012-08-12 22:53:26 CEST WARNING:  Input bytes: 52c3bc64696765722054686f6d6173203c74685f73636840686f746d61696c2e64653e
    2012-08-12 22:53:26 CEST WARNING:  Match string: Rüdiger Thomas
    2012-08-12 22:53:26 CEST WARNING:  Match bytes: 52c3bc64696765722054686f6d6173
    2012-08-12 22:53:26 CEST WARNING:  Substr result: 52c3bc64696765722054686f6d617300
    "Input string" does match to the emails fed into the program. The byte sequence is the correct utf-8 representation of this string. The match result is what I expect it to be - only substr() still changes the string length (by adding this strange 0x00).

    If I call the function only once (i.e. feed only the second email into the program) the result is correct:

    Code:
    2012-08-12 22:53:30 CEST WARNING:  Input string: Rüdiger Thomas <th_sch@hotmail.de>
    2012-08-12 22:53:30 CEST WARNING:  Input bytes: 52c3bc64696765722054686f6d6173203c74685f73636840686f746d61696c2e64653e
    2012-08-12 22:53:30 CEST WARNING:  Match string: Rüdiger Thomas
    2012-08-12 22:53:30 CEST WARNING:  Match bytes: 52c3bc64696765722054686f6d6173
    2012-08-12 22:53:30 CEST WARNING:  Substr result: 52c3bc64696765722054686f6d6173
    I could post the actual emails here, but you would not see anything more relevant than the From-Header repeated as "input string" in the debug output.
    Last edited by SlowFox; August 12th, 2012 at 04:56 PM. Reason: accidentially deleted a line of code
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,947
    Rep Power
    1225
    Where is the Match bytes log entry coming from? It's not in your code snippet.

    Also, what happened to the Match bytes entry that should have been between the Input bytes and Substr result entries.

    Are you sure that this is the code that generated those log entries, or is it that you filtered out some of the entries when posting?
    Last edited by FishMonger; August 12th, 2012 at 04:41 PM.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Posts
    6
    Rep Power
    0
    Originally Posted by FishMonger
    Are you sure that this is the code that generated those log entries, or is it that you filtered out some of the entries when posting?
    Oops, sorry for that.

    a) I changed some debug statements while writing my last posting and used the wrong version

    b) yes, I have to filter the output (as I don't want to bother you with additional postgresql messages), I seem to have missed one line. All statements are copy/paste though so they don't contain any typos.

    I just had another run of the program and re-edited my last posting. Now code and the ouptut should be consistent.

    Another update: I finally managed to reproduce this thing outside of PostgreSQL. So now I can give you some native perl code (reduced to the max):

    Code:
    use utf8;
    
    sub do_it {
            my $from = $_[0];
            my $name = undef();
    
            print unpack('U0H*', $from) . "\n";
            if ($from =~ m/^\s*"?(.+?)"?\s+<([^<>@]+@[^<>@]+)>\s*$/) {
                    print unpack('U0H*', $1) . "\n";
                    $name = substr($1, 0, 128);
                    print unpack('U0H*', $name) . "\n";
            }
    
            return;
    }
    
    
    my $from = 'Marçus Eñdberg <a@b>';
    do_it($from);
    
    $from = 'Rüdiger Thomas <a@b>';
    do_it($from);
    results in:

    Code:
    4d6172c3a775732045c3b16462657267203c6140623e
    4d6172c3a775732045c3b16462657267
    4d6172c3a775732045c3b16462657267
    52c3bc64696765722054686f6d6173203c6140623e
    52c3bc64696765722054686f6d6173
    52c3bc64696765722054686f6d617300
    Note the unwanted and highly irregular 0x00 at the end of the last line. If I change anything with the input names, the 0x00 disappears.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    Hi,

    this is a bit strange.

    I tried to run your code and I don't have your problem:

    Code:
    4d6172e775732045f16462657267203c6140623e
    4d6172e775732045f16462657267
    4d6172e775732045f16462657267
    52fc64696765722054686f6d6173203c6140623e
    52fc64696765722054686f6d6173
    52fc64696765722054686f6d6173
    I ran the script on Unix (HP-UX) and on VMS, and both produced the output copied above.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    BTW, it may be of interest that both my platforms on which I ran the test use Perl 5.8.

    I also tried on a Linux box, also running Perl 5.8, and I again don't reproduce your problem.

    It looks like there may be something local to your computer, or perhaps another version of Perl.
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Posts
    6
    Rep Power
    0
    Originally Posted by Laurent_R
    I tried to run your code and I don't have your problem
    I am more and more convinced that this is a bug inside perl. Here I can use v5.10.1 (within Debian squeeze) and unfortunately don't have a chance to try anything more recent (or older).

    However, meanwhile I most likely circumvented the problem by replacing

    Code:
    $name = substr($1, 0, 128);
    with
    Code:
    my $temp = $1;
    $name = substr($temp, 0, 128);
    With this modification, the bug has not been triggered during the last 4 hours and app. 300k records (compared to several 10k issues before with the same amount of data).

IMN logo majestic logo threadwatch logo seochat tools logo