You are viewing a plain text version of this content. The canonical link for it is here.

Posted to p-dev@xerces.apache.org by "Jason E. Stewart" <ja...@openinformatics.com> on 2001/10/27 17:27:22 UTC

Unicode compliance for Xerces.pm

"Jason E. Stewart" <ja...@openinformatics.com> writes:

> What I'll need to do is to put a test in there so the code looks like:
> 
>   %typemap(perl5, in) const XMLCh* qualifiedName (XMLCh *temp_qualifiedName) {
>     if (  SvPOK( $source )  ) {
>       if (SvUTF8($source)) {
>         // turn it into a UTF8 XMLCh*
>       } else {
>         // turn it into a ISO-8859-1 XMLCh*
>       }
>     } else {
>       croak("Type error in argument 2 of $name, Expected perl-string.");
>       XSRETURN(1);
>     }
>   }

There are a couple of noteworthy consequences to this:

1  When Xerces returns a string that has high-order UTF-8 characters
   (i.e. when the chars are outside the ASCII range 0-127) I'll have to
   transcode from UTF-16 into UTF-8.

2. This will involve a lot of back and forth transcoding if the
   document contains a significant amount of high-order information. I
   don't know what affect this will have on the running time of the
   code. 

3. to make the glue code simple that passes arguments from perl and
   hands them to Xerces, I will always have to call the XMLCh*
   interfaces instead of the char* ones. 

4. Currently, if a Xerces API method returns a DOMString object or an
   XMLCh*, there is no way to keep that object, the glue code converts
   all of them into perl strings for 'convenience'. I think this is a
   feature, but it might turn out to be have performance benefits to
   allow users to keep them around. 

5. All of this is going to require the use of Perl-5.6.0 or better. I
   get a lot of notices from people still using 5.004 and 5.005, so
   this is going to mean upgrading for a lot of people. It is possible
   that I can make the code conditional and people with 5.005/4 could
   compile XML::Xerces but just not get unicode support. 

     I WILL NOT ATTEMPT THIS WITHOUT ASSISTANCE => it's a lot of work.

The issue with 4 is tricky. Perl is great about giving lots of
information about what context a method is being called in. For
example the following all look different to perl:

$a = foo(); // scalar context
@a = foo(); // list context
foo();      // void context

So within any method, I can figure out exactly what value to return to
best satisfy the user. It is tricky because these look identical to
Perl: 

my $dom_string  = $element->getAttribute('foo');
my $perl_string = $element->getAttribute('foo');

Even though the first should return a reference to
XML::Xerces::DOMString object, and the second should return a vanilla
perl string, there is no way to tell them apart. So I'll need more
help from the user.

I believe that it can be handled with a pragma similar to that of 

  use utf8;

maybe 

  use utf16;

The user could then turn it on for different pieces of the
application by using code blocks:

  # now we get perl strings from all functions
  my $perl_string = $element->getAttribute('foo');
  $perl_string .= 'nothing up my sleeve';
  $element->setAttribute($perl_string);
  {
    use utf16;
    # now we get DOMString's from all functions
    my $dom_string = $element->getAttribute('baz');
    $dom_string->appendData($perl_string);
    $element->setAttribute($dom_string);
  }
  # now we get perl strings from all functions
  my $perl_string2 = $element->getAttribute('bar');

jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-p-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-p-dev-help@xml.apache.org

Re: Unicode compliance for Xerces.pm

Posted by "Jason E. Stewart" <ja...@openinformatics.com>.

I've had a chance to play around with Perl's Unicode support and the
issues of transcoding between Perl's UTF-8 and Xerces' UTF-16. 

"Jason E. Stewart" <ja...@openinformatics.com> writes:

> 1  When Xerces returns a string that has high-order UTF-8 characters
>    (i.e. when the chars are outside the ASCII range 0-127) I'll have to
>    transcode from UTF-16 into UTF-8.
>   
> 2. This will involve a lot of back and forth transcoding if the
>    document contains a significant amount of high-order information. I
>    don't know what affect this will have on the running time of the
>    code. 

OK. These two are clearly dumb, because I have to transcode no matter
what. UTF-16 can't deal with straight ASCII because it's a two byte
format. So there's no *more* transcoding with UTF8 then there was with
just ASCII or ISO-8859-1.

> 3. to make the glue code simple that passes arguments from perl and
>    hands them to Xerces, I will always have to call the XMLCh*
>    interfaces instead of the char* ones. 

This I think is a good thing. No one will notice.

> 4. Currently, if a Xerces API method returns a DOMString object or an
>    XMLCh*, there is no way to keep that object, the glue code converts
>    all of them into perl strings for 'convenience'. I think this is a
>    feature, but it might turn out to be have performance benefits to
>    allow users to keep them around. 
> 
> 5. All of this is going to require the use of Perl-5.6.0 or better. I
>    get a lot of notices from people still using 5.004 and 5.005, so
>    this is going to mean upgrading for a lot of people. It is possible
>    that I can make the code conditional and people with 5.005/4 could
>    compile XML::Xerces but just not get unicode support. 
> 
>      I WILL NOT ATTEMPT THIS WITHOUT ASSISTANCE => it's a lot of
>      work.

Seriously. This is a lot of work. I'm not even going to think about
this unless someone pipes up.

> The issue with 4 is tricky. Perl is great about giving lots of
> information about what context a method is being called in. For
> example the following all look different to perl:
> 
> I believe that it can be handled with a pragma similar to that of 
> 
>   use utf8;
> 
> maybe 
> 
>   use utf16;

I think this is dumb. It should just work. All methods that return a
DOMString object in C++ should do the same in Perl, except that in
Perl you should be able to use a DOMString object in *exactly* the
same way that you would use a Perl string:

my $dom_string = $element->getAttribute('foo');
$dom_string .= 'bar'; # concatenation
print STDERR "The new value for foo is: $dom_string\n"; # stringify

etc...

The same for XMLCh*. There should just be a class, maybe
XML::Xerces::XMLCh or XML::Xerces::XMLString and any method that
returns an XMLCh* would just return an XML::Xerces::XMLString
instance. 

This would mean a lot less transcoding, AND it would solve some of the
issues with Perl and Unicode. Both Perl 5.6.0 and 5.6.1 have
significant unicode bugs. It's not only the 5.7.2 development series
that many really key patches come in. 

jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-p-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-p-dev-help@xml.apache.org