You are viewing a plain text version of this content. The canonical link for it is here.
Posted to p-dev@xerces.apache.org by "Jason E. Stewart" <ja...@openinformatics.com> on 2001/10/27 02:55:35 UTC

Re: Bug in UTF-8 output

"Jiri Fiser" <Fi...@seznam.cz> writes:

> I want to use your XML Perl module (1.5.1) for processing XML document
> which are
> 
> writen in Czech (diploma thesis of my students). These documents will be
> wriiten in ISO Latin 2 (Linux) or CP 1250 (Windows) encodings and I want to
> transform it to UTF-8 encoding before the processing.
> 
> I have tried your XML module and in standard condition (Linux RH 7.2,
> locale = cs_CZ) it's
> 
> all OK, all strings were converted from UTF-8 to ISO Latin 2 (8859-2)
> (without
> 
> error). Unfortunately I need an output in UTF-8. When I have tried the
> locale
> 
> cs_CZ.utf8 with utf8 option in Perl 5.6.0, the output was in UTF-8, but
> strings with multibyte (=2 bytes for Czech) UTF-8 characters are reduced
> (shortened) on their end (probably by one character for each multibyte
> character in string

It seems that I never got around to answering your question. The
answer is that I bollocksed that unicode support for Xerces. 

<technical-digression>
In order to get perl to talk to Xerces-C++ we build a layer of C++
glue code that talks both to the Perl C API and to the Xerces C++ API:

XML::Xerces <=> Perl C API <=> Glue Code <=> Xerces C++ API

the XML::Xerces <=> Perl C API is done automatically by perl. The glue
code is generated automatically for us by SWIG. In order to make it
nice and simple to write applications using XML::Xerces, we allow (or force)
people to use generic perl strings instead of the nasty brutish
DOMString's or the XMLCh*'s that the C++ API uses. And for every
method that passes in string arguments to Xerces SWIG intersperses
some code that converts Perl's strings into the DOMString objects or
that XMLCh*'s that Xerces wants. And for every method that returns a
DOMString or an XMLCh* SWIG converts it to a Perl string.

</technical-digression>

What I bollocksed was in the conversion routines I mashed everything
into plain old vanilla char*'s... Same old stupic american programmer
mistake. Until recently all the test files I had were either ASCII or
ISO-8859-1. So I didn't realize I had screwed up.

Here's an example of the typemap convertor code I mentioned:

  %typemap(perl5, in) const XMLCh* qualifiedName (XMLCh *temp_qualifiedName) {
    if (  SvPOK( $source )  ) {
      char *ptr = (char *)SvPV( $source,PL_na);
  //  ^^^^^^^^^^^^^^^^^^^^      
  //  HERE THERE BE DRAGONS
      $target = temp_qualifiedName = XMLString::transcode(ptr);
    } else {
      croak("Type error in argument 2 of $name, Expected perl-string.");
      XSRETURN(1);
    }
  }

What I'll need to do is to put a test in there so the code looks like:

  %typemap(perl5, in) const XMLCh* qualifiedName (XMLCh *temp_qualifiedName) {
    if (  SvPOK( $source )  ) {
      if (SvUTF8($source)) {
        // turn it into a UTF8 XMLCh*
      } else {
        // turn it into a ISO-8859-1 XMLCh*
      }
    } else {
      croak("Type error in argument 2 of $name, Expected perl-string.");
      XSRETURN(1);
    }
  }

I'm not sure of the exact code for the first case, so I'll have to
check. Luckily, because I use SWIG, I only have to change the code in
a few places.

Thanks for figuring this out,
jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-p-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-p-dev-help@xml.apache.org


Re: Unicode compliance for Xerces.pm

Posted by "Jason E. Stewart" <ja...@openinformatics.com>.
I've had a chance to play around with Perl's Unicode support and the
issues of transcoding between Perl's UTF-8 and Xerces' UTF-16. 

"Jason E. Stewart" <ja...@openinformatics.com> writes:

> 1  When Xerces returns a string that has high-order UTF-8 characters
>    (i.e. when the chars are outside the ASCII range 0-127) I'll have to
>    transcode from UTF-16 into UTF-8.
>   
> 2. This will involve a lot of back and forth transcoding if the
>    document contains a significant amount of high-order information. I
>    don't know what affect this will have on the running time of the
>    code. 

OK. These two are clearly dumb, because I have to transcode no matter
what. UTF-16 can't deal with straight ASCII because it's a two byte
format. So there's no *more* transcoding with UTF8 then there was with
just ASCII or ISO-8859-1.

> 3. to make the glue code simple that passes arguments from perl and
>    hands them to Xerces, I will always have to call the XMLCh*
>    interfaces instead of the char* ones. 

This I think is a good thing. No one will notice.

> 4. Currently, if a Xerces API method returns a DOMString object or an
>    XMLCh*, there is no way to keep that object, the glue code converts
>    all of them into perl strings for 'convenience'. I think this is a
>    feature, but it might turn out to be have performance benefits to
>    allow users to keep them around. 
> 
> 5. All of this is going to require the use of Perl-5.6.0 or better. I
>    get a lot of notices from people still using 5.004 and 5.005, so
>    this is going to mean upgrading for a lot of people. It is possible
>    that I can make the code conditional and people with 5.005/4 could
>    compile XML::Xerces but just not get unicode support. 
> 
>      I WILL NOT ATTEMPT THIS WITHOUT ASSISTANCE => it's a lot of
>      work.

Seriously. This is a lot of work. I'm not even going to think about
this unless someone pipes up.

> The issue with 4 is tricky. Perl is great about giving lots of
> information about what context a method is being called in. For
> example the following all look different to perl:
> 
> I believe that it can be handled with a pragma similar to that of 
> 
>   use utf8;
> 
> maybe 
> 
>   use utf16;

I think this is dumb. It should just work. All methods that return a
DOMString object in C++ should do the same in Perl, except that in
Perl you should be able to use a DOMString object in *exactly* the
same way that you would use a Perl string:

my $dom_string = $element->getAttribute('foo');
$dom_string .= 'bar'; # concatenation
print STDERR "The new value for foo is: $dom_string\n"; # stringify

etc...

The same for XMLCh*. There should just be a class, maybe
XML::Xerces::XMLCh or XML::Xerces::XMLString and any method that
returns an XMLCh* would just return an XML::Xerces::XMLString
instance. 

This would mean a lot less transcoding, AND it would solve some of the
issues with Perl and Unicode. Both Perl 5.6.0 and 5.6.1 have
significant unicode bugs. It's not only the 5.7.2 development series
that many really key patches come in. 

jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-p-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-p-dev-help@xml.apache.org


Unicode compliance for Xerces.pm

Posted by "Jason E. Stewart" <ja...@openinformatics.com>.
"Jason E. Stewart" <ja...@openinformatics.com> writes:

> What I'll need to do is to put a test in there so the code looks like:
> 
>   %typemap(perl5, in) const XMLCh* qualifiedName (XMLCh *temp_qualifiedName) {
>     if (  SvPOK( $source )  ) {
>       if (SvUTF8($source)) {
>         // turn it into a UTF8 XMLCh*
>       } else {
>         // turn it into a ISO-8859-1 XMLCh*
>       }
>     } else {
>       croak("Type error in argument 2 of $name, Expected perl-string.");
>       XSRETURN(1);
>     }
>   }

There are a couple of noteworthy consequences to this:

1  When Xerces returns a string that has high-order UTF-8 characters
   (i.e. when the chars are outside the ASCII range 0-127) I'll have to
   transcode from UTF-16 into UTF-8.
  
2. This will involve a lot of back and forth transcoding if the
   document contains a significant amount of high-order information. I
   don't know what affect this will have on the running time of the
   code. 
  
3. to make the glue code simple that passes arguments from perl and
   hands them to Xerces, I will always have to call the XMLCh*
   interfaces instead of the char* ones. 
  
4. Currently, if a Xerces API method returns a DOMString object or an
   XMLCh*, there is no way to keep that object, the glue code converts
   all of them into perl strings for 'convenience'. I think this is a
   feature, but it might turn out to be have performance benefits to
   allow users to keep them around. 

5. All of this is going to require the use of Perl-5.6.0 or better. I
   get a lot of notices from people still using 5.004 and 5.005, so
   this is going to mean upgrading for a lot of people. It is possible
   that I can make the code conditional and people with 5.005/4 could
   compile XML::Xerces but just not get unicode support. 

     I WILL NOT ATTEMPT THIS WITHOUT ASSISTANCE => it's a lot of work.

The issue with 4 is tricky. Perl is great about giving lots of
information about what context a method is being called in. For
example the following all look different to perl:

$a = foo(); // scalar context
@a = foo(); // list context
foo();      // void context

So within any method, I can figure out exactly what value to return to
best satisfy the user. It is tricky because these look identical to
Perl: 

my $dom_string  = $element->getAttribute('foo');
my $perl_string = $element->getAttribute('foo');

Even though the first should return a reference to
XML::Xerces::DOMString object, and the second should return a vanilla
perl string, there is no way to tell them apart. So I'll need more
help from the user.

I believe that it can be handled with a pragma similar to that of 

  use utf8;

maybe 

  use utf16;

The user could then turn it on for different pieces of the
application by using code blocks:

  # now we get perl strings from all functions
  my $perl_string = $element->getAttribute('foo');
  $perl_string .= 'nothing up my sleeve';
  $element->setAttribute($perl_string);
  {
    use utf16;
    # now we get DOMString's from all functions
    my $dom_string = $element->getAttribute('baz');
    $dom_string->appendData($perl_string);
    $element->setAttribute($dom_string);
  }
  # now we get perl strings from all functions
  my $perl_string2 = $element->getAttribute('bar');

jas.






---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-p-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-p-dev-help@xml.apache.org