You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by John Siracusa <si...@mindspring.com> on 2002/05/07 16:51:10 UTC
HTML::Entities chokes on XML::Parser strings
I ran into this problem during mod_perl development, and I'm posting it to
this list hoping that other mod_perl developers have dealt with the same
thing and have good solutions :)
I've found that strings collected while processing XML using XML::Parser do
not play nice with the HTML::Entities module. Here's the sample program
illustrating the problem:
#!/usr/bin/perl -w
use strict;
use HTML::Entities;
use XML::Parser;
my $buffer;
my $p = XML::Parser->new(Handlers => { Char => \&xml_char });
my $xml = '<?xml version="1.0" encoding="iso-8859-1"?><test>' .
chr(0xE9) . '</test>';
$p->parse($xml);
print encode_entities($buffer), "\n";
sub xml_char
{
my($expat, $string) = @_;
$buffer .= $string;
}
The output unfortunately looks like this:
é
Which makes very little sense, since the correct entity for 0xE9 is:
é
My current work-around is to run the buffer through a (lossy!?) pack/unpack
cycle:
my $buffer2 = pack("C*", unpack("U*", $buffer));
print encode_entities($buffer2), "\n";
This works and prints:
é
I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
will maul UTF-8 or UTF-16. This seems like quite an evil hack.
So, what is the Right Thing to do here? Which module, if any, is at fault?
Is there some combination of Perl Unicode-related "use" statements that will
help me here? Has anyone else run into this problem?
-John
Re: HTML::Entities chokes on XML::Parser strings
Posted by Paul Lindner <li...@inuus.com>.
On Tue, May 07, 2002 at 11:13:43AM -0400, John Siracusa wrote:
> On 5/7/02 10:58 AM, Paul Lindner wrote:
> > The output from your example looks like UTF-8 data (Ã is a
> > commonly seen UTF-8 escape sequence). XML::Parser converts all
> > incoming text into UTF-8. You will need to convert it back to
> > iso-8859-1.
> >
> > My favorite is Text::Iconv
> >
> > use Text::Iconv;
> > $utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
> >
> > my $buffer_latin1 = $converter->convert($buffer);
>
> So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)? What if
> I have actual UTF-8 data? Won't conversion to ISO8859-1 in service of
> HTML::Entities result in data loss?
Yes, HTML::Entities is based on ISO8859-1 input only. BTW, for better
performance in mod_perl consider using Apache::Util::escape_html()
escape_html
This routine replaces unsafe characters in $string
with their entity representation.
my $esc = Apache::Util::escape_html($html);
Anyway, back to character entities..
Text::Iconv will fail if you try to convert unconvertable text, so at
least you can test for that condition (and adjust accordingly)
BasisTech sells a comprehensive unicode library called Rosette that
knows how to automatically convert to a target character set while
incorporating SGML entities for any character set. Perhaps it's time
for an open implementation of that..
Also see http://rf.net/~james/perli18n.html for a perl i18n faq.
--
Paul Lindner lindner@inuus.com ||||| | | | | | | | | |
mod_perl Developer's Cookbook http://www.modperlcookbook.org/
Human Rights Declaration http://www.unhchr.ch/udhr/
Re: HTML::Entities chokes on XML::Parser strings
Posted by John Siracusa <si...@mindspring.com>.
On 5/7/02 11:25 AM, Gisle Aas wrote:
> John Siracusa <si...@mindspring.com> writes:
>> On 5/7/02 10:58 AM, Paul Lindner wrote:
>>> The output from your example looks like UTF-8 data (Ã is a
>>> commonly seen UTF-8 escape sequence). XML::Parser converts all
>>> incoming text into UTF-8. You will need to convert it back to
>>> iso-8859-1.
>>>
>>> My favorite is Text::Iconv
>>>
>>> use Text::Iconv;
>>> $utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
>>>
>>> my $buffer_latin1 = $converter->convert($buffer);
>>
>> So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?
>
> Not true. But the unicode support in perl-5.6.x has many bugs. With
> 5.8 things will be better. It is a bad idea for XML::Parser to give
> out strings with the UTF8 flag set.
Well, I'll let your guys figure it out (all fixed in 5.8, right? :) In the
meantime, I guess I'll stick with the workaround(s) posted... :)
-John
Re: HTML::Entities chokes on XML::Parser strings
Posted by Gisle Aas <gi...@ActiveState.com>.
John Siracusa <si...@mindspring.com> writes:
> On 5/7/02 10:58 AM, Paul Lindner wrote:
> > The output from your example looks like UTF-8 data (Ã is a
> > commonly seen UTF-8 escape sequence). XML::Parser converts all
> > incoming text into UTF-8. You will need to convert it back to
> > iso-8859-1.
> >
> > My favorite is Text::Iconv
> >
> > use Text::Iconv;
> > $utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
> >
> > my $buffer_latin1 = $converter->convert($buffer);
>
> So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?
Not true. But the unicode support in perl-5.6.x has many bugs. With
5.8 things will be better. It is a bad idea for XML::Parser to give
out strings with the UTF8 flag set.
Regards,
Gisle
Re: HTML::Entities chokes on XML::Parser strings
Posted by John Siracusa <si...@mindspring.com>.
On 5/7/02 10:58 AM, Paul Lindner wrote:
> The output from your example looks like UTF-8 data (Ã is a
> commonly seen UTF-8 escape sequence). XML::Parser converts all
> incoming text into UTF-8. You will need to convert it back to
> iso-8859-1.
>
> My favorite is Text::Iconv
>
> use Text::Iconv;
> $utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
>
> my $buffer_latin1 = $converter->convert($buffer);
So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)? What if
I have actual UTF-8 data? Won't conversion to ISO8859-1 in service of
HTML::Entities result in data loss?
-John
Re: HTML::Entities chokes on XML::Parser strings
Posted by Paul Lindner <li...@inuus.com>.
The output from your example looks like UTF-8 data (Ã is a
commonly seen UTF-8 escape sequence). XML::Parser converts all
incoming text into UTF-8. You will need to convert it back to
iso-8859-1.
My favorite is Text::Iconv
use Text::Iconv;
$utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
my $buffer_latin1 = $converter->convert($buffer);
On Tue, May 07, 2002 at 10:51:10AM -0400, John Siracusa wrote:
> I ran into this problem during mod_perl development, and I'm posting it to
> this list hoping that other mod_perl developers have dealt with the same
> thing and have good solutions :)
>
> I've found that strings collected while processing XML using XML::Parser do
> not play nice with the HTML::Entities module. Here's the sample program
> illustrating the problem:
>
> #!/usr/bin/perl -w
>
> use strict;
>
> use HTML::Entities;
> use XML::Parser;
>
> my $buffer;
>
> my $p = XML::Parser->new(Handlers => { Char => \&xml_char });
>
> my $xml = '<?xml version="1.0" encoding="iso-8859-1"?><test>' .
> chr(0xE9) . '</test>';
>
> $p->parse($xml);
>
> print encode_entities($buffer), "\n";
>
> sub xml_char
> {
> my($expat, $string) = @_;
>
> $buffer .= $string;
> }
>
> The output unfortunately looks like this:
>
> é
>
> Which makes very little sense, since the correct entity for 0xE9 is:
>
> é
>
> My current work-around is to run the buffer through a (lossy!?) pack/unpack
> cycle:
>
> my $buffer2 = pack("C*", unpack("U*", $buffer));
> print encode_entities($buffer2), "\n";
>
> This works and prints:
>
> é
>
> I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
> will maul UTF-8 or UTF-16. This seems like quite an evil hack.
>
> So, what is the Right Thing to do here? Which module, if any, is at fault?
> Is there some combination of Perl Unicode-related "use" statements that will
> help me here? Has anyone else run into this problem?
>
> -John
--
Paul Lindner lindner@inuus.com ||||| | | | | | | | | |
mod_perl Developer's Cookbook http://www.modperlcookbook.org/
Human Rights Declaration http://www.unhchr.ch/udhr/
Re: HTML::Entities chokes on XML::Parser strings
Posted by John Siracusa <si...@mindspring.com>.
On 5/7/02 11:06 AM, Rafael Garcia-Suarez wrote:
> The workaround I used is to write the handler like this :
>
> sub xml_char
> {
> my ($expat) = @_;
> $buffer .= $expat->original_string;
> }
>
> Reading the original string, no need to convert UTF-8 back to iso-8859-1.
Doh! I dunno why I didn't think of that, since I've used that expat method
plenty of times before. This seems safer than forcing a conversion from
UTF-8 to something else (although the other technique is nice to know too :)
-John
Re: HTML::Entities chokes on XML::Parser strings
Posted by Rafael Garcia-Suarez <ra...@hexaflux.com>.
John Siracusa wrote:
> I ran into this problem during mod_perl development, and I'm posting it to
> this list hoping that other mod_perl developers have dealt with the same
> thing and have good solutions :)
I did ;-)
> I've found that strings collected while processing XML using XML::Parser do
> not play nice with the HTML::Entities module. Here's the sample program
> illustrating the problem:
>
> #!/usr/bin/perl -w
>
> use strict;
>
> use HTML::Entities;
> use XML::Parser;
>
> my $buffer;
>
> my $p = XML::Parser->new(Handlers => { Char => \&xml_char });
>
> my $xml = '<?xml version="1.0" encoding="iso-8859-1"?><test>' .
> chr(0xE9) . '</test>';
>
> $p->parse($xml);
>
> print encode_entities($buffer), "\n";
>
> sub xml_char
> {
> my($expat, $string) = @_;
>
> $buffer .= $string;
> }
>
> The output unfortunately looks like this:
>
> é
>
> Which makes very little sense, since the correct entity for 0xE9 is:
>
> é
That's an XML::Parser issue.
XML::Parser gives UTF-8 to your Char handler, as specified in the manpage :
"Whatever the encoding of the string in the original document,
this is given to the handler in UTF-8."
The workaround I used is to write the handler like this :
sub xml_char
{
my ($expat) = @_;
$buffer .= $expat->original_string;
}
Reading the original string, no need to convert UTF-8 back to iso-8859-1.
> My current work-around is to run the buffer through a (lossy!?) pack/unpack
> cycle:
>
> my $buffer2 = pack("C*", unpack("U*", $buffer));
> print encode_entities($buffer2), "\n";
>
> This works and prints:
>
> é
>
> I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
> will maul UTF-8 or UTF-16. This seems like quite an evil hack.
>
> So, what is the Right Thing to do here? Which module, if any, is at fault?
> Is there some combination of Perl Unicode-related "use" statements that will
> help me here? Has anyone else run into this problem?
>
> -John
>
--
Rafael Garcia-Suarez