You are viewing a plain text version of this content. The canonical link for it is here.

Posted to p-dev@xerces.apache.org by Fredrick Paul Eisele <ph...@netarx.com> on 2002/02/26 14:19:13 UTC

Transcode produces Unicode, which has bugs in perl 5.6.1

I would like some advice, and possibly a change to xerces-perl.
I found a bug in perl-5.6.1 which is related to unicode (actually I found 
several).
An example of which follows:
==================

use Devel::Peek; 

#======================= 
# Something to try out the pattern 
sub try_it { 
  my $pattern = shift; 

  Dump( $pattern ); 

  print STDERR "\ncompiled:\n"; 
  my $re = qr/$pattern/; 
  Dump( $re ); 

  my $match = "Jan 11 14:50:01 10.1.0.1 CRON[15021]: (root) CMD 
(/usr/libexec/atrun)"; 

  if ($match =~ m/$re/) { 
return "Matched\n"; 
  } 
return "Not Matched\n"; 
} 
#======================= 
    
# This first example forces a unicode encoding by pushing a smiley onto the 
string. 
# The smiley is then removed. 
# 
FAILURE: { 
 my $failure = 
   "(?sx)  ( \\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3} )    \\s   CRON .+? 
CMD \\s \\( (\\S+?) \\)  " 
 . "\x{263A}"; 
 chop $failure; 
 print "Pattern which should match but does not\n", try_it( $failure ), "\n"; 
} 

# This sample "works" if the previous sample is commented out. 
# If you didn't notice, the difference is the unicode character. 
# 
SUCCESS: { 
 my $success = 
   "(?sx)  ( \\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3} )    \\s   CRON .+? 
CMD \\s \\( (\\S+?) \\)  "; 
 print "Pattern which should match and does\n", try_it( $success ), "\n"; 
} 

==================
The problem, in this case, causes bad regular expressions to be built.
I have been putting patterns into CDATA sections of xml files and
subsequently compiling them into regular expressions.
I have also had some problems with pack when used to create
snmp streams.
The response to this bug follows:
==================
Bugids(1)
20020225.013
Modified
2002-02-25 21:35:43 

Subject
Re: [ID 20020225.013] Unicode vs. Regex 

Source
andreas.koenig@anima.de

Thanks for your bugreport. The bug you describe has been fixed in the
current development branch and the fix will be in perl 5.8.0.

Numerous bugs in the Unicode sphere have been fixed since 5.6.1. If
you're interested to try out the current development branch, see
perldoc perlhack or just pick a recent snapshot from
    ftp://ftp.funet.fi/pub/languages/perl/snap
and test your code with it.

-- 
andreas

==================
Given that the fixes to these bugs will not be generally available for
a while what can be done in the meantime (I would rather not use
a perl snapshot).
I am thinking that the perl strings returned by the xerces functions 
should be stripped of their UTF-8 nature.
This could be done by supplying a function which does this, much
like transcode already does.
Or maybe an global option which controls the behavior or transcode?
What do you think?

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-p-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-p-dev-help@xml.apache.org

Re: Transcode produces Unicode, which has bugs in perl 5.6.1

Posted by "Jason E. Stewart" <ja...@openinformatics.com>.

"Fredrick Paul Eisele" <ph...@netarx.com> writes:

> I would like some advice, and possibly a change to xerces-perl.
> I found a bug in perl-5.6.1 which is related to unicode (actually I found 
> several).

[snip]

> Given that the fixes to these bugs will not be generally available for
> a while what can be done in the meantime (I would rather not use
> a perl snapshot).
>
> I am thinking that the perl strings returned by the xerces functions
> should be stripped of their UTF-8 nature.  This could be done by
> supplying a function which does this, much like transcode already
> does.  Or maybe an global option which controls the behavior or
> transcode?  What do you think?

Hey Frederick Paul,

Yes, as Andreas pointed out Unicode support works but has bugs in
5.6.1. But as far as I can tell, it works flawlessly in 5.7.2. 

First, if anyone wants to use Unicode seriously (including
ISO-8859-1), I would suggest that you upgrade to Perl-5.7.2.

Second I'm happy to add in some kind of support to Xerces to controls
the global behavior of transcoding. Xerces-P already has a ISO-8859-1
transcoder built into, but I just don't use it. So there could easily
be a global variable that any user can set that controls whether
Unicode is used or not.

Understand however, that this is rather low on my priority list. If
either you or Harwin would like to modify the code, I'd be happy to
test it and include it in the next XML::Xerces snapshot. The code that
you need is all in typemaps.i. You can find an example of how to get
SWIG to wrap a C variable for Perl in Xerces.i:

  bool DEBUG_UTF8_OUT;
  bool DEBUG_UTF8_IN;

Any variable outside of a %{ ... }% gets wrapped as a global Perl
variable: 

  package XML::Xerces;

  *DEBUG_UTF8_OUT = *XML::Xercesc::DEBUG_UTF8_OUT;
  *DEBUG_UTF8_IN = *XML::Xercesc::DEBUG_UTF8_IN;

Please add some tests in the t/ directory.
jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-p-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-p-dev-help@xml.apache.org