You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Rob Mueller <ro...@fastmail.fm> on 2004/05/18 20:27:48 UTC

mod_perl and utf-8 data...

I'm not sure if this is a mod_perl problem or not, but I can't reproduce it
under regular perl, so I thought I'd post here. Anyway it's apache 1.3.29,
mod_perl 1.29 and perl 5.8.4.

The problem is occuring in the following piece of code. I've tried creating
a test case, but I can't seem to narrow it down. Just creating a basic
handler to test this seems to work, but when it's used like this buried deep
in some code, it fails. Always a bugger of a problem to track down.

Anyway, the problem seems to be with using "join" where the array has utf-8
strings in it. The resultant string does NOT have the utf-8 flag set. The
basic problem code is this:

        $BodyText = join("\n", @Lines[0 .. (@Lines < 3 ? @Lines-1 : 2)]) .
"\n";

Narrowing it down a bit, and dumping the internal structures as so:

        warn '$Lines[0]: ' . $Lines[0];
        warn 'utf-8 $Lines[0]: ' . is_utf8($Lines[0]);
        Dump($Lines[0]);

        $BodyText = join("\n", $Lines[0]);

        warn '$BodyText: ' . $BodyText;
        warn 'utf-8 $BodyText: ' . is_utf8($BodyText);
        Dump($BodyText);

I get:

$Lines[0]: Hej mor,
utf-8 $Lines[0]: 1
SV = PV(0x9a051a4) at 0xa27f828
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0xa2f0008 "Hej mor,"\0 [UTF8 "Hej mor,"]
  CUR = 8
  LEN = 9

Which looks fine, but then the joined result:

$BodyText: Hej mor,
utf-8 $BodyText:  at /home/mod_perl/hm/Data/Store/Mailbox.pm line 400.
SV = PVMG(0xa279140) at 0x8cb9228
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,GMG,SMG,pPOK)
  IV = 0
  NV = 0
  PV = 0xa2bbf50 "Hej mor,\n"\0
  CUR = 9
  LEN = 408
  MAGIC = 0xa397cd8
    MG_VIRTUAL = &PL_vtbl_taint
    MG_TYPE = PERL_MAGIC_taint(t)

Ouch, that seems wrong. No utf-8 flag, and the string seems to be marked as
tainted, even though the inputs aren't? I thought maybe it had something to
do with that $BodyText had been assigned to earlier and obviously was
tainged, and wasn't loosing it when the new value was being assigned to it.
So I changed to:

        $#Lines = 0;
        warn '$Lines[0]: ' . $Lines[0];
        warn 'utf-8 $Lines[0]: ' . is_utf8($Lines[0]);
        Dump($Lines[0]);

        my $NewBodyText = join("\n", $Lines[0]);

        warn '$NewBodyText: ' . $NewBodyText;
        warn 'utf-8 $NewBodyText: ' . is_utf8($NewBodyText);
        Dump($NewBodyText);

Which gives:

$Lines[0]: Hej mor, at /home/mod_perl/hm/Data/Store/Mailbox.pm line 393.
utf-8 $Lines[0]: 1 at /home/mod_perl/hm/Data/Store/Mailbox.pm line 394.
SV = PV(0x99f7a94) at 0xa386e68
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0xa2bc188 "Hej mor,"\0 [UTF8 "Hej mor,"]
  CUR = 8
  LEN = 9
$BodyText: Hej mor,
utf-8 $BodyText:  at /home/mod_perl/hm/Data/Store/Mailbox.pm line 400.
SV = PVMG(0xa3b61a8) at 0xa346cc0
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0xa3dde10 "Hej mor,\n"\0
  CUR = 9
  LEN = 162

Ah, so the magic taint stuff is now gone (though it is still a PVMG rather
than a PV?), but it still doesn't have the UTF-8 flag set (and the fact this
string doesn't have any utf-8 chars isn't the problem, it happens on all of
them, even those that do have utf-8 chars). There is no 'use bytes' or
anything at the top of the module, so I don't think that's the problem,
though I don't think that should actuall affect things should it since it
only controls how the actual source code is interpreted? I tried explicitly
doing 'use utf8' to check, but no difference.

Testing on a small standalong program from the command line, it does seem to
work as expected:

[root@robm root]# perl -e 'use Devel::Peek; $a="\x{1234}"; @a = ("a", $a,
"b"); $c = join "d", @a; Dump($c);'
SV = PV(0x811ee40) at 0x81318d0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x812e0e8 "ad\341\210\264db"\0 [UTF8 "ad\x{1234}db"]
  CUR = 7
  LEN = 8

Which actually raises a general perl question I just wanted to check. If you
have two strings and concat them, and one has the utf-8 flag and the other
doesn't, the resultant string does have the utf-8 flag set? Assuming that th
e non-utf8 flagged string is ASCII, this will work fine. If it has chars >
127 in it though, it'll create a rubbish string...

Ok, so to summarise, I think I see two problems here:
1. Assigning an untainted value to a value that was previously tainted
leaves the new value tainted
2. join with utf-8 strings doesn't seem to leave the joined string with the
utf-8 flag on

Seems all a bit weird to me...

Rob


-- 
Report problems: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html
List etiquette: http://perl.apache.org/maillist/email-etiquette.html


Re: mod_perl and utf-8 data...

Posted by Stas Bekman <st...@stason.org>.
Rob Mueller wrote:
> Ok, I've tracked this down a bit more, and I think it's a perl problem.
> Basically it seems tainted variables and utf-8 don't work together. I did
> find one example of someone posting the same problem:
> 
> http://groups.google.com/groups?q=taint+group:perl.unicode&hl=en&lr=&ie=UTF-8&group=perl.unicode&selm=4.2.0.58.J.20040101203406.009d32e0%40dream.big.or.jp&rnum=1
> 
> Seems it's still not fixed in 5.8.4. Example code to reproduce shown
> below...

Have you tried blead-perl, Rob? In any case make sure you submit a perlbug 
report, otherwise it won't get fixed. But first check whether there is one 
submitted already (rt.perl.org).


-- 
__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com

-- 
Report problems: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html
List etiquette: http://perl.apache.org/maillist/email-etiquette.html


Re: mod_perl and utf-8 data...

Posted by Rob Mueller <ro...@fastmail.fm>.
Ok, I've tracked this down a bit more, and I think it's a perl problem.
Basically it seems tainted variables and utf-8 don't work together. I did
find one example of someone posting the same problem:

http://groups.google.com/groups?q=taint+group:perl.unicode&hl=en&lr=&ie=UTF-8&group=perl.unicode&selm=4.2.0.58.J.20040101203406.009d32e0%40dream.big.or.jp&rnum=1

Seems it's still not fixed in 5.8.4. Example code to reproduce shown
below...

Rob

-----

#!/usr/bin/perl -T

package main;
use Encode qw(is_utf8 _utf8_on);
use Scalar::Util qw(tainted);
use strict;
sub handler {
  open(F, ">/tmp/tainttest") || die "could not open: $!";
  print F "aaa";
  close(F);

  my $a = "\x{1234}";
  warn '$a is utf8: ' . (is_utf8($a) ? 1 : 0) . " (expect 1)\n";
  warn '$a is tainted: ' . (tainted($a) ? 1 : 0) . " (expect 0)\n";

  open(F, "/tmp/tainttest") || die "could not open: $!";
  my $b = <F>;
  close(F);
  warn '$b is utf8: ' . (is_utf8($b) ? 1 : 0) . " (expect 0)\n";
  warn '$b is tainted: ' . (tainted($b) ? 1 : 0) . " (expect 1)\n";

  my $c = $a . $b;

  warn '$c is utf8: ' . (is_utf8($c) ? 1 : 0) . " (expect 1)\n";
  warn '$c is tainted: ' . (tainted($c) ? 1 : 0) . " (expect 1)\n";

  _utf8_on($c);

  warn '$c is utf8: ' . (is_utf8($c) ? 1 : 0) . " (expect 1)\n";
  warn '$c is tainted: ' . (tainted($c) ? 1 : 0) . " (expect 1)\n";

  my ($d) = ($b =~ /(.*)/);
  warn '$d is utf8: ' . (is_utf8($d) ? 1 : 0) . " (expoct 0)\n";
  warn '$d is tainted: ' . (tainted($d) ? 1 : 0) . " (expoct 0)\n";

  my $e = $a . $d;
  warn '$e is utf8: ' . (is_utf8($e) ? 1 : 0) . " (expect 1)\n";
  warn '$e is tainted: ' . (tainted($e) ? 1 : 0) . " (expoct 0)\n";

  $c = $a . $d;
  warn '$c is utf8: ' . (is_utf8($c) ? 1 : 0) . " (expect 1)\n";
  warn '$c is tainted: ' . (tainted($c) ? 1 : 0) . " (expoct 0)\n";

  my @a = ($a, $b);
  my $f = "@a";
  warn '$f is utf8: ' . (is_utf8($f) ? 1 : 0) . " (expect 1)\n";
  warn '$f is tainted: ' . (tainted($f) ? 1 : 0) . " (expoct 1)\n";

  @a = ($a, $d);
  $f = "@a";
  warn '$f is utf8: ' . (is_utf8($f) ? 1 : 0) . " (expect 1)\n";
  warn '$f is tainted: ' . (tainted($f) ? 1 : 0) . " (expoct 0)\n";

}
handler();

-----

$a is utf8: 1 (expect 1)
$a is tainted: 0 (expect 0)
$b is utf8: 0 (expect 0)
$b is tainted: 1 (expect 1)
$c is utf8: 0 (expect 1)
$c is tainted: 1 (expect 1)
$c is utf8: 0 (expect 1)
$c is tainted: 1 (expect 1)
$d is utf8: 0 (expoct 0)
$d is tainted: 0 (expoct 0)
$e is utf8: 1 (expect 1)
$e is tainted: 0 (expoct 0)
$c is utf8: 1 (expect 1)
$c is tainted: 0 (expoct 0)
$f is utf8: 0 (expect 1)
$f is tainted: 1 (expoct 1)
$f is utf8: 1 (expect 1)
$f is tainted: 0 (expoct 0)




-- 
Report problems: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html
List etiquette: http://perl.apache.org/maillist/email-etiquette.html


Re: mod_perl and utf-8 data...

Posted by Stas Bekman <st...@stason.org>.
Rob Mueller wrote:
[...]
> Ok, so to summarise, I think I see two problems here:
> 1. Assigning an untainted value to a value that was previously tainted
> leaves the new value tainted

It's hard to tell or even try to reproduce that, since you didn't show a real 
test case. What kind of variable is that? my, our, global? Usually a variable 
that once became PVMG, won't go back to a lower type PV. But it should not 
have the MG flags set.

> 2. join with utf-8 strings doesn't seem to leave the joined string with the
> utf-8 flag on

I'd try to minimize your setup to exclude everything but the code at fault. 
This can be done easily using geoff's bugreport skeleton available from here 
[1]. See first if you still have a problem using it. It could be that some 
other code that you load affects your whole setup. And you can't reproduce it 
with a shell script since you don't load the same setup there.

If you could submit a separate bugreport tar ball [1] for each of the issues 
you have raised, each including a shortest possible code, then it'll be much 
easier for us to try to reproduce your case and only then try to find a fix.

Thanks.

[1] http://perl.apache.org/docs/1.0/guide/help.html#How_to_Report_Problems
(scroll down a bit)

-- 
__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com

-- 
Report problems: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html
List etiquette: http://perl.apache.org/maillist/email-etiquette.html