August 12th, 2012, 11:57 AM
-
Substr madness with utf8
I have the following code - it is run within postgresql, but apart from elog (which should be obvious) it is really, really simple.
Code:
my $from = [the result of something else]
my $fromname = undef();
elog(WARNING, $from);
if ($from =~ m/^\s*"?(.+?)"?\s+<([^<>@]+@[^<>@]+)>\s*$/) {
elog(WARNING, $1);
elog(WARNING, unpack('H*', $1));
$fromname = substr($1, 0, 128);
elog(WARNING, unpack('H*', $fromname));
elog(WARNING, $fromname);
}
The program containing this piece of code is executed twice in a row with different data and results in:
Code:
2012-08-12 18:52:15 CEST WARNING: Marçus Eñdberg <usenet@nueb.net>
2012-08-12 18:52:15 CEST WARNING: Marçus Eñdberg
2012-08-12 18:52:15 CEST WARNING: 4d6172e775732045f16462657267
2012-08-12 18:52:15 CEST WARNING: 4d6172e775732045f16462657267
2012-08-12 18:52:15 CEST WARNING: Marçus Eñdberg
2012-08-12 18:52:15 CEST WARNING: Rüdiger Thomas <th_sch@hotmail.de>
2012-08-12 18:52:15 CEST WARNING: Rüdiger Thomas
2012-08-12 18:52:15 CEST WARNING: 52fc64696765722054686f6d6173
2012-08-12 18:52:15 CEST WARNING: 52fc64696765722054686f6d617300
2012-08-12 18:52:15 CEST ERROR: invalid byte sequence for encoding "UTF8": 0x00 at line 95.
WTF? Unfortunately I don't know too much about perl and utf-8, but it is obvious that a) this string is not utf-8 and b) that a 0x00 is padded to it by the substring function. Why that?
And, even worse: if I call the program only once with the second data set, it succeeds:
Code:
2012-08-12 18:52:17 CEST WARNING: Rüdiger Thomas <th_sch@hotmail.de>
2012-08-12 18:52:17 CEST WARNING: Rüdiger Thomas
2012-08-12 18:52:17 CEST WARNING: 52fc64696765722054686f6d6173
2012-08-12 18:52:17 CEST WARNING: 52fc64696765722054686f6d6173
2012-08-12 18:52:17 CEST WARNING: Rüdiger Thomas
Now this is what I'd expect in the first place... Can anybody tell me what is going on here and what I can do to fix it?
Last edited by SlowFox; August 12th, 2012 at 12:50 PM.
Reason: Debug statements with unpack() added
August 12th, 2012, 02:04 PM
-
Hi,
can you show the input data corresponding to the output you have shown, so that we can better understand how that somewhat long regular expression reacts to it?
August 12th, 2012, 03:31 PM
-
Originally Posted by Laurent_R
can you show the input data corresponding to the output you have shown, so that we can better understand how that somewhat long regular expression reacts to it?
$from is a result of Email::MIME and does contain the From-Header of the message being processed. The regular expression splits this header into name and email-address (pretty long, but not too complex: optional white space, optional double-quote, name, optional double-quote; whitespace; opening delimiter, name-part, @-sign, host/domain-part, closing delimiter, optional white space). If the name contains some non-ASCII characters, the problems may appear (but neither with all records nor with any given sequence of records).
I just noted that unpack() is tricky for debug output: it seems to convert back to iso-8859-1 implicitely unless given the "U0" switch (correnct me, if I am wrong). So I changed the code to:
Code:
my $from = $mime->header('from');
elog(WARNING, "Input string: " . $from);
elog(WARNING, "Input bytes: " . unpack('U0H*', $from));
if ($from =~ m/^\s*"?(.+?)"?\s+<([^<>@]+@[^<>@]+)>\s*$/) {
elog(WARNING, "Match string: " . $1);
elog(WARNING, "Match bytes: " . unpack('U0H*', $1));
$fromname = substr($1, 0, 128);
elog(WARNING, "Substr result: " . unpack('U0H*', $fromname));
}
and get (again, this is the result from two subsequent calls of the function with two different emails as input data:
Code:
2012-08-12 22:53:26 CEST WARNING: Input string: Marçus Eñdberg <usenet@nueb.net>
2012-08-12 22:53:26 CEST WARNING: Input bytes: 4d6172c3a775732045c3b16462657267203c7573656e6574406e7565622e6e65743e
2012-08-12 22:53:26 CEST WARNING: Match string: Marçus Eñdberg
2012-08-12 22:53:26 CEST WARNING: Match bytes: 4d6172c3a775732045c3b16462657267
2012-08-12 22:53:26 CEST WARNING: Substr result: 4d6172c3a775732045c3b16462657267
2012-08-12 22:53:26 CEST WARNING: Input string: Rüdiger Thomas <th_sch@hotmail.de>
2012-08-12 22:53:26 CEST WARNING: Input bytes: 52c3bc64696765722054686f6d6173203c74685f73636840686f746d61696c2e64653e
2012-08-12 22:53:26 CEST WARNING: Match string: Rüdiger Thomas
2012-08-12 22:53:26 CEST WARNING: Match bytes: 52c3bc64696765722054686f6d6173
2012-08-12 22:53:26 CEST WARNING: Substr result: 52c3bc64696765722054686f6d617300
"Input string" does match to the emails fed into the program. The byte sequence is the correct utf-8 representation of this string. The match result is what I expect it to be - only substr() still changes the string length (by adding this strange 0x00).
If I call the function only once (i.e. feed only the second email into the program) the result is correct:
Code:
2012-08-12 22:53:30 CEST WARNING: Input string: Rüdiger Thomas <th_sch@hotmail.de>
2012-08-12 22:53:30 CEST WARNING: Input bytes: 52c3bc64696765722054686f6d6173203c74685f73636840686f746d61696c2e64653e
2012-08-12 22:53:30 CEST WARNING: Match string: Rüdiger Thomas
2012-08-12 22:53:30 CEST WARNING: Match bytes: 52c3bc64696765722054686f6d6173
2012-08-12 22:53:30 CEST WARNING: Substr result: 52c3bc64696765722054686f6d6173
I could post the actual emails here, but you would not see anything more relevant than the From-Header repeated as "input string" in the debug output.
Last edited by SlowFox; August 12th, 2012 at 03:56 PM.
Reason: accidentially deleted a line of code
August 12th, 2012, 03:37 PM
-
Where is the Match bytes log entry coming from? It's not in your code snippet.
Also, what happened to the Match bytes entry that should have been between the Input bytes and Substr result entries.
Are you sure that this is the code that generated those log entries, or is it that you filtered out some of the entries when posting?
Last edited by FishMonger; August 12th, 2012 at 03:41 PM.
August 12th, 2012, 04:00 PM
-
Originally Posted by FishMonger
Are you sure that this is the code that generated those log entries, or is it that you filtered out some of the entries when posting?
Oops, sorry for that.
a) I changed some debug statements while writing my last posting and used the wrong version
b) yes, I have to filter the output (as I don't want to bother you with additional postgresql messages), I seem to have missed one line. All statements are copy/paste though so they don't contain any typos.
I just had another run of the program and re-edited my last posting. Now code and the ouptut should be consistent.
Another update: I finally managed to reproduce this thing outside of PostgreSQL. So now I can give you some native perl code (reduced to the max):
Code:
use utf8;
sub do_it {
my $from = $_[0];
my $name = undef();
print unpack('U0H*', $from) . "\n";
if ($from =~ m/^\s*"?(.+?)"?\s+<([^<>@]+@[^<>@]+)>\s*$/) {
print unpack('U0H*', $1) . "\n";
$name = substr($1, 0, 128);
print unpack('U0H*', $name) . "\n";
}
return;
}
my $from = 'Marçus Eñdberg <a@b>';
do_it($from);
$from = 'Rüdiger Thomas <a@b>';
do_it($from);
results in:
Code:
4d6172c3a775732045c3b16462657267203c6140623e
4d6172c3a775732045c3b16462657267
4d6172c3a775732045c3b16462657267
52c3bc64696765722054686f6d6173203c6140623e
52c3bc64696765722054686f6d6173
52c3bc64696765722054686f6d617300
Note the unwanted and highly irregular 0x00 at the end of the last line. If I change anything with the input names, the 0x00 disappears.
August 13th, 2012, 03:46 AM
-
Hi,
this is a bit strange.
I tried to run your code and I don't have your problem:
Code:
4d6172e775732045f16462657267203c6140623e
4d6172e775732045f16462657267
4d6172e775732045f16462657267
52fc64696765722054686f6d6173203c6140623e
52fc64696765722054686f6d6173
52fc64696765722054686f6d6173
I ran the script on Unix (HP-UX) and on VMS, and both produced the output copied above.
August 13th, 2012, 04:07 AM
-
BTW, it may be of interest that both my platforms on which I ran the test use Perl 5.8.
I also tried on a Linux box, also running Perl 5.8, and I again don't reproduce your problem.
It looks like there may be something local to your computer, or perhaps another version of Perl.
August 13th, 2012, 04:21 AM
-
Originally Posted by Laurent_R
I tried to run your code and I don't have your problem
I am more and more convinced that this is a bug inside perl. Here I can use v5.10.1 (within Debian squeeze) and unfortunately don't have a chance to try anything more recent (or older).
However, meanwhile I most likely circumvented the problem by replacing
Code:
$name = substr($1, 0, 128);
with
Code:
my $temp = $1;
$name = substr($temp, 0, 128);
With this modification, the bug has not been triggered during the last 4 hours and app. 300k records (compared to several 10k issues before with the same amount of data).