You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Rob Mueller <ro...@fastmail.fm> on 2004/05/18 20:27:48 UTC
mod_perl and utf-8 data...
I'm not sure if this is a mod_perl problem or not, but I can't reproduce it
under regular perl, so I thought I'd post here. Anyway it's apache 1.3.29,
mod_perl 1.29 and perl 5.8.4.
The problem is occuring in the following piece of code. I've tried creating
a test case, but I can't seem to narrow it down. Just creating a basic
handler to test this seems to work, but when it's used like this buried deep
in some code, it fails. Always a bugger of a problem to track down.
Anyway, the problem seems to be with using "join" where the array has utf-8
strings in it. The resultant string does NOT have the utf-8 flag set. The
basic problem code is this:
$BodyText = join("\n", @Lines[0 .. (@Lines < 3 ? @Lines-1 : 2)]) .
"\n";
Narrowing it down a bit, and dumping the internal structures as so:
warn '$Lines[0]: ' . $Lines[0];
warn 'utf-8 $Lines[0]: ' . is_utf8($Lines[0]);
Dump($Lines[0]);
$BodyText = join("\n", $Lines[0]);
warn '$BodyText: ' . $BodyText;
warn 'utf-8 $BodyText: ' . is_utf8($BodyText);
Dump($BodyText);
I get:
$Lines[0]: Hej mor,
utf-8 $Lines[0]: 1
SV = PV(0x9a051a4) at 0xa27f828
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa2f0008 "Hej mor,"\0 [UTF8 "Hej mor,"]
CUR = 8
LEN = 9
Which looks fine, but then the joined result:
$BodyText: Hej mor,
utf-8 $BodyText: at /home/mod_perl/hm/Data/Store/Mailbox.pm line 400.
SV = PVMG(0xa279140) at 0x8cb9228
REFCNT = 1
FLAGS = (PADBUSY,PADMY,GMG,SMG,pPOK)
IV = 0
NV = 0
PV = 0xa2bbf50 "Hej mor,\n"\0
CUR = 9
LEN = 408
MAGIC = 0xa397cd8
MG_VIRTUAL = &PL_vtbl_taint
MG_TYPE = PERL_MAGIC_taint(t)
Ouch, that seems wrong. No utf-8 flag, and the string seems to be marked as
tainted, even though the inputs aren't? I thought maybe it had something to
do with that $BodyText had been assigned to earlier and obviously was
tainged, and wasn't loosing it when the new value was being assigned to it.
So I changed to:
$#Lines = 0;
warn '$Lines[0]: ' . $Lines[0];
warn 'utf-8 $Lines[0]: ' . is_utf8($Lines[0]);
Dump($Lines[0]);
my $NewBodyText = join("\n", $Lines[0]);
warn '$NewBodyText: ' . $NewBodyText;
warn 'utf-8 $NewBodyText: ' . is_utf8($NewBodyText);
Dump($NewBodyText);
Which gives:
$Lines[0]: Hej mor, at /home/mod_perl/hm/Data/Store/Mailbox.pm line 393.
utf-8 $Lines[0]: 1 at /home/mod_perl/hm/Data/Store/Mailbox.pm line 394.
SV = PV(0x99f7a94) at 0xa386e68
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa2bc188 "Hej mor,"\0 [UTF8 "Hej mor,"]
CUR = 8
LEN = 9
$BodyText: Hej mor,
utf-8 $BodyText: at /home/mod_perl/hm/Data/Store/Mailbox.pm line 400.
SV = PVMG(0xa3b61a8) at 0xa346cc0
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
IV = 0
NV = 0
PV = 0xa3dde10 "Hej mor,\n"\0
CUR = 9
LEN = 162
Ah, so the magic taint stuff is now gone (though it is still a PVMG rather
than a PV?), but it still doesn't have the UTF-8 flag set (and the fact this
string doesn't have any utf-8 chars isn't the problem, it happens on all of
them, even those that do have utf-8 chars). There is no 'use bytes' or
anything at the top of the module, so I don't think that's the problem,
though I don't think that should actuall affect things should it since it
only controls how the actual source code is interpreted? I tried explicitly
doing 'use utf8' to check, but no difference.
Testing on a small standalong program from the command line, it does seem to
work as expected:
[root@robm root]# perl -e 'use Devel::Peek; $a="\x{1234}"; @a = ("a", $a,
"b"); $c = join "d", @a; Dump($c);'
SV = PV(0x811ee40) at 0x81318d0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x812e0e8 "ad\341\210\264db"\0 [UTF8 "ad\x{1234}db"]
CUR = 7
LEN = 8
Which actually raises a general perl question I just wanted to check. If you
have two strings and concat them, and one has the utf-8 flag and the other
doesn't, the resultant string does have the utf-8 flag set? Assuming that th
e non-utf8 flagged string is ASCII, this will work fine. If it has chars >
127 in it though, it'll create a rubbish string...
Ok, so to summarise, I think I see two problems here:
1. Assigning an untainted value to a value that was previously tainted
leaves the new value tainted
2. join with utf-8 strings doesn't seem to leave the joined string with the
utf-8 flag on
Seems all a bit weird to me...
Rob
--
Report problems: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html
List etiquette: http://perl.apache.org/maillist/email-etiquette.html
Re: mod_perl and utf-8 data...
Posted by Stas Bekman <st...@stason.org>.
Rob Mueller wrote:
> Ok, I've tracked this down a bit more, and I think it's a perl problem.
> Basically it seems tainted variables and utf-8 don't work together. I did
> find one example of someone posting the same problem:
>
> http://groups.google.com/groups?q=taint+group:perl.unicode&hl=en&lr=&ie=UTF-8&group=perl.unicode&selm=4.2.0.58.J.20040101203406.009d32e0%40dream.big.or.jp&rnum=1
>
> Seems it's still not fixed in 5.8.4. Example code to reproduce shown
> below...
Have you tried blead-perl, Rob? In any case make sure you submit a perlbug
report, otherwise it won't get fixed. But first check whether there is one
submitted already (rt.perl.org).
--
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org http://ticketmaster.com
--
Report problems: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html
List etiquette: http://perl.apache.org/maillist/email-etiquette.html
Re: mod_perl and utf-8 data...
Posted by Rob Mueller <ro...@fastmail.fm>.
Ok, I've tracked this down a bit more, and I think it's a perl problem.
Basically it seems tainted variables and utf-8 don't work together. I did
find one example of someone posting the same problem:
http://groups.google.com/groups?q=taint+group:perl.unicode&hl=en&lr=&ie=UTF-8&group=perl.unicode&selm=4.2.0.58.J.20040101203406.009d32e0%40dream.big.or.jp&rnum=1
Seems it's still not fixed in 5.8.4. Example code to reproduce shown
below...
Rob
-----
#!/usr/bin/perl -T
package main;
use Encode qw(is_utf8 _utf8_on);
use Scalar::Util qw(tainted);
use strict;
sub handler {
open(F, ">/tmp/tainttest") || die "could not open: $!";
print F "aaa";
close(F);
my $a = "\x{1234}";
warn '$a is utf8: ' . (is_utf8($a) ? 1 : 0) . " (expect 1)\n";
warn '$a is tainted: ' . (tainted($a) ? 1 : 0) . " (expect 0)\n";
open(F, "/tmp/tainttest") || die "could not open: $!";
my $b = <F>;
close(F);
warn '$b is utf8: ' . (is_utf8($b) ? 1 : 0) . " (expect 0)\n";
warn '$b is tainted: ' . (tainted($b) ? 1 : 0) . " (expect 1)\n";
my $c = $a . $b;
warn '$c is utf8: ' . (is_utf8($c) ? 1 : 0) . " (expect 1)\n";
warn '$c is tainted: ' . (tainted($c) ? 1 : 0) . " (expect 1)\n";
_utf8_on($c);
warn '$c is utf8: ' . (is_utf8($c) ? 1 : 0) . " (expect 1)\n";
warn '$c is tainted: ' . (tainted($c) ? 1 : 0) . " (expect 1)\n";
my ($d) = ($b =~ /(.*)/);
warn '$d is utf8: ' . (is_utf8($d) ? 1 : 0) . " (expoct 0)\n";
warn '$d is tainted: ' . (tainted($d) ? 1 : 0) . " (expoct 0)\n";
my $e = $a . $d;
warn '$e is utf8: ' . (is_utf8($e) ? 1 : 0) . " (expect 1)\n";
warn '$e is tainted: ' . (tainted($e) ? 1 : 0) . " (expoct 0)\n";
$c = $a . $d;
warn '$c is utf8: ' . (is_utf8($c) ? 1 : 0) . " (expect 1)\n";
warn '$c is tainted: ' . (tainted($c) ? 1 : 0) . " (expoct 0)\n";
my @a = ($a, $b);
my $f = "@a";
warn '$f is utf8: ' . (is_utf8($f) ? 1 : 0) . " (expect 1)\n";
warn '$f is tainted: ' . (tainted($f) ? 1 : 0) . " (expoct 1)\n";
@a = ($a, $d);
$f = "@a";
warn '$f is utf8: ' . (is_utf8($f) ? 1 : 0) . " (expect 1)\n";
warn '$f is tainted: ' . (tainted($f) ? 1 : 0) . " (expoct 0)\n";
}
handler();
-----
$a is utf8: 1 (expect 1)
$a is tainted: 0 (expect 0)
$b is utf8: 0 (expect 0)
$b is tainted: 1 (expect 1)
$c is utf8: 0 (expect 1)
$c is tainted: 1 (expect 1)
$c is utf8: 0 (expect 1)
$c is tainted: 1 (expect 1)
$d is utf8: 0 (expoct 0)
$d is tainted: 0 (expoct 0)
$e is utf8: 1 (expect 1)
$e is tainted: 0 (expoct 0)
$c is utf8: 1 (expect 1)
$c is tainted: 0 (expoct 0)
$f is utf8: 0 (expect 1)
$f is tainted: 1 (expoct 1)
$f is utf8: 1 (expect 1)
$f is tainted: 0 (expoct 0)
--
Report problems: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html
List etiquette: http://perl.apache.org/maillist/email-etiquette.html
Re: mod_perl and utf-8 data...
Posted by Stas Bekman <st...@stason.org>.
Rob Mueller wrote:
[...]
> Ok, so to summarise, I think I see two problems here:
> 1. Assigning an untainted value to a value that was previously tainted
> leaves the new value tainted
It's hard to tell or even try to reproduce that, since you didn't show a real
test case. What kind of variable is that? my, our, global? Usually a variable
that once became PVMG, won't go back to a lower type PV. But it should not
have the MG flags set.
> 2. join with utf-8 strings doesn't seem to leave the joined string with the
> utf-8 flag on
I'd try to minimize your setup to exclude everything but the code at fault.
This can be done easily using geoff's bugreport skeleton available from here
[1]. See first if you still have a problem using it. It could be that some
other code that you load affects your whole setup. And you can't reproduce it
with a shell script since you don't load the same setup there.
If you could submit a separate bugreport tar ball [1] for each of the issues
you have raised, each including a shortest possible code, then it'll be much
easier for us to try to reproduce your case and only then try to find a fix.
Thanks.
[1] http://perl.apache.org/docs/1.0/guide/help.html#How_to_Report_Problems
(scroll down a bit)
--
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org http://ticketmaster.com
--
Report problems: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html
List etiquette: http://perl.apache.org/maillist/email-etiquette.html