You are viewing a plain text version of this content. The canonical link for it is here.

Posted to modperl@perl.apache.org by John Siracusa <si...@mindspring.com> on 2002/05/07 16:51:10 UTC

HTML::Entities chokes on XML::Parser strings

I ran into this problem during mod_perl development, and I'm posting it to
this list hoping that other mod_perl developers have dealt with the same
thing and have good solutions :)

I've found that strings collected while processing XML using XML::Parser do
not play nice with the HTML::Entities module.  Here's the sample program
illustrating the problem:

    #!/usr/bin/perl -w

    use strict;

    use HTML::Entities;
    use XML::Parser;

    my $buffer;

    my $p = XML::Parser->new(Handlers => { Char  => \&xml_char });

    my $xml = '<?xml version="1.0" encoding="iso-8859-1"?><test>' .
              chr(0xE9) . '</test>';

    $p->parse($xml);

    print encode_entities($buffer), "\n";

    sub xml_char
    {
      my($expat, $string) = @_;
  
      $buffer .= $string;
    }

The output unfortunately looks like this:

    &Atilde;&copy;

Which makes very little sense, since the correct entity for 0xE9 is:

    &eacute;

My current work-around is to run the buffer through a (lossy!?) pack/unpack
cycle:

    my $buffer2 = pack("C*", unpack("U*", $buffer));
    print encode_entities($buffer2), "\n";

This works and prints:

    &eacute;

I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
will maul UTF-8 or UTF-16.  This seems like quite an evil hack.

So, what is the Right Thing to do here?  Which module, if any, is at fault?
Is there some combination of Perl Unicode-related "use" statements that will
help me here?  Has anyone else run into this problem?

-John

Re: HTML::Entities chokes on XML::Parser strings

Posted by Paul Lindner <li...@inuus.com>.

On Tue, May 07, 2002 at 11:13:43AM -0400, John Siracusa wrote:
> On 5/7/02 10:58 AM, Paul Lindner wrote:
> > The output from your example looks like UTF-8 data (&Atilde; is a
> > commonly seen UTF-8 escape sequence).  XML::Parser converts all
> > incoming text into UTF-8.  You will need to convert it back to
> > iso-8859-1.
> > 
> > My favorite is Text::Iconv
> > 
> >        use Text::Iconv;
> >        $utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
> > 
> >        my $buffer_latin1 = $converter->convert($buffer);
> 
> So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?  What if
> I have actual UTF-8 data?  Won't conversion to ISO8859-1 in service of
> HTML::Entities result in data loss?

Yes, HTML::Entities is based on ISO8859-1 input only.  BTW, for better
performance in mod_perl consider using Apache::Util::escape_html()


 escape_html
           This routine replaces unsafe characters in $string
           with their entity representation.

            my $esc = Apache::Util::escape_html($html);


Anyway, back to character entities..

Text::Iconv will fail if you try to convert unconvertable text, so at
least you can test for that condition (and adjust accordingly)

BasisTech sells a comprehensive unicode library called Rosette that
knows how to automatically convert to a target character set while
incorporating SGML entities for any character set.  Perhaps it's time
for an open implementation of that..

Also see http://rf.net/~james/perli18n.html for a perl i18n faq.




-- 
Paul Lindner    lindner@inuus.com   ||||| | | | |  |  |  |   |   |

    mod_perl Developer's Cookbook   http://www.modperlcookbook.org/
         Human Rights Declaration   http://www.unhchr.ch/udhr/

Re: HTML::Entities chokes on XML::Parser strings

Posted by John Siracusa <si...@mindspring.com>.

On 5/7/02 11:25 AM, Gisle Aas wrote:
> John Siracusa <si...@mindspring.com> writes:
>> On 5/7/02 10:58 AM, Paul Lindner wrote:
>>> The output from your example looks like UTF-8 data (&Atilde; is a
>>> commonly seen UTF-8 escape sequence).  XML::Parser converts all
>>> incoming text into UTF-8.  You will need to convert it back to
>>> iso-8859-1.
>>> 
>>> My favorite is Text::Iconv
>>> 
>>>        use Text::Iconv;
>>>        $utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
>>> 
>>>        my $buffer_latin1 = $converter->convert($buffer);
>> 
>> So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?
> 
> Not true.  But the unicode support in perl-5.6.x has many bugs.  With
> 5.8 things will be better.  It is a bad idea for XML::Parser to give
> out strings with the UTF8 flag set.

Well, I'll let your guys figure it out (all fixed in 5.8, right? :)  In the
meantime, I guess I'll stick with the workaround(s) posted... :)

-John

Re: HTML::Entities chokes on XML::Parser strings

Posted by Gisle Aas <gi...@ActiveState.com>.

John Siracusa <si...@mindspring.com> writes:

> On 5/7/02 10:58 AM, Paul Lindner wrote:
> > The output from your example looks like UTF-8 data (&Atilde; is a
> > commonly seen UTF-8 escape sequence).  XML::Parser converts all
> > incoming text into UTF-8.  You will need to convert it back to
> > iso-8859-1.
> > 
> > My favorite is Text::Iconv
> > 
> >        use Text::Iconv;
> >        $utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
> > 
> >        my $buffer_latin1 = $converter->convert($buffer);
> 
> So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?

Not true.  But the unicode support in perl-5.6.x has many bugs.  With
5.8 things will be better.  It is a bad idea for XML::Parser to give
out strings with the UTF8 flag set.

Regards,
Gisle

Re: HTML::Entities chokes on XML::Parser strings

Posted by John Siracusa <si...@mindspring.com>.

On 5/7/02 10:58 AM, Paul Lindner wrote:
> The output from your example looks like UTF-8 data (&Atilde; is a
> commonly seen UTF-8 escape sequence).  XML::Parser converts all
> incoming text into UTF-8.  You will need to convert it back to
> iso-8859-1.
> 
> My favorite is Text::Iconv
> 
>        use Text::Iconv;
>        $utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
> 
>        my $buffer_latin1 = $converter->convert($buffer);

So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?  What if
I have actual UTF-8 data?  Won't conversion to ISO8859-1 in service of
HTML::Entities result in data loss?

-John

Re: HTML::Entities chokes on XML::Parser strings

Posted by Paul Lindner <li...@inuus.com>.

The output from your example looks like UTF-8 data (&Atilde; is a
commonly seen UTF-8 escape sequence).  XML::Parser converts all
incoming text into UTF-8.  You will need to convert it back to
iso-8859-1.

My favorite is Text::Iconv

         use Text::Iconv;
         $utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");

         my $buffer_latin1 = $converter->convert($buffer);


On Tue, May 07, 2002 at 10:51:10AM -0400, John Siracusa wrote:
> I ran into this problem during mod_perl development, and I'm posting it to
> this list hoping that other mod_perl developers have dealt with the same
> thing and have good solutions :)
> 
> I've found that strings collected while processing XML using XML::Parser do
> not play nice with the HTML::Entities module.  Here's the sample program
> illustrating the problem:
> 
>     #!/usr/bin/perl -w
> 
>     use strict;
> 
>     use HTML::Entities;
>     use XML::Parser;
> 
>     my $buffer;
> 
>     my $p = XML::Parser->new(Handlers => { Char  => \&xml_char });
> 
>     my $xml = '<?xml version="1.0" encoding="iso-8859-1"?><test>' .
>               chr(0xE9) . '</test>';
> 
>     $p->parse($xml);
> 
>     print encode_entities($buffer), "\n";
> 
>     sub xml_char
>     {
>       my($expat, $string) = @_;
>   
>       $buffer .= $string;
>     }
> 
> The output unfortunately looks like this:
> 
>     &Atilde;&copy;
> 
> Which makes very little sense, since the correct entity for 0xE9 is:
> 
>     &eacute;
> 
> My current work-around is to run the buffer through a (lossy!?) pack/unpack
> cycle:
> 
>     my $buffer2 = pack("C*", unpack("U*", $buffer));
>     print encode_entities($buffer2), "\n";
> 
> This works and prints:
> 
>     &eacute;
> 
> I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
> will maul UTF-8 or UTF-16.  This seems like quite an evil hack.
> 
> So, what is the Right Thing to do here?  Which module, if any, is at fault?
> Is there some combination of Perl Unicode-related "use" statements that will
> help me here?  Has anyone else run into this problem?
> 
> -John

-- 
Paul Lindner    lindner@inuus.com   ||||| | | | |  |  |  |   |   |

    mod_perl Developer's Cookbook   http://www.modperlcookbook.org/
         Human Rights Declaration   http://www.unhchr.ch/udhr/

Re: HTML::Entities chokes on XML::Parser strings

Posted by John Siracusa <si...@mindspring.com>.

On 5/7/02 11:06 AM, Rafael Garcia-Suarez wrote:
> The workaround I used is to write the handler like this :
> 
> sub xml_char
> {
>  my ($expat) = @_;
>  $buffer .= $expat->original_string;
> }
> 
> Reading the original string, no need to convert UTF-8 back to iso-8859-1.

Doh!  I dunno why I didn't think of that, since I've used that expat method
plenty of times before.  This seems safer than forcing a conversion from
UTF-8 to something else (although the other technique is nice to know too :)

-John

Re: HTML::Entities chokes on XML::Parser strings

Posted by Rafael Garcia-Suarez <ra...@hexaflux.com>.

John Siracusa wrote:
> I ran into this problem during mod_perl development, and I'm posting it to
> this list hoping that other mod_perl developers have dealt with the same
> thing and have good solutions :)

I did ;-)

> I've found that strings collected while processing XML using XML::Parser do
> not play nice with the HTML::Entities module.  Here's the sample program
> illustrating the problem:
> 
>     #!/usr/bin/perl -w
> 
>     use strict;
> 
>     use HTML::Entities;
>     use XML::Parser;
> 
>     my $buffer;
> 
>     my $p = XML::Parser->new(Handlers => { Char  => \&xml_char });
> 
>     my $xml = '<?xml version="1.0" encoding="iso-8859-1"?><test>' .
>               chr(0xE9) . '</test>';
> 
>     $p->parse($xml);
> 
>     print encode_entities($buffer), "\n";
> 
>     sub xml_char
>     {
>       my($expat, $string) = @_;
>   
>       $buffer .= $string;
>     }
> 
> The output unfortunately looks like this:
> 
>     &Atilde;&copy;
> 
> Which makes very little sense, since the correct entity for 0xE9 is:
> 
>     &eacute;

That's an XML::Parser issue.
XML::Parser gives UTF-8 to your Char handler, as specified in the manpage :
"Whatever the encoding of the string in the original document,
this is given to the handler in UTF-8."

The workaround I used is to write the handler like this :

sub xml_char
{
   my ($expat) = @_;
   $buffer .= $expat->original_string;
}

Reading the original string, no need to convert UTF-8 back to iso-8859-1.

> My current work-around is to run the buffer through a (lossy!?) pack/unpack
> cycle:
> 
>     my $buffer2 = pack("C*", unpack("U*", $buffer));
>     print encode_entities($buffer2), "\n";
> 
> This works and prints:
> 
>     &eacute;
> 
> I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
> will maul UTF-8 or UTF-16.  This seems like quite an evil hack.
> 
> So, what is the Right Thing to do here?  Which module, if any, is at fault?
> Is there some combination of Perl Unicode-related "use" statements that will
> help me here?  Has anyone else run into this problem?
> 
> -John
> 



-- 
Rafael Garcia-Suarez