You are viewing a plain text version of this content. The canonical link for it is here.

Posted to p-dev@xerces.apache.org by Stephen Collyer <sc...@netspinner.co.uk> on 2005/10/05 22:03:43 UTC

UTF-8 problems with CGI script

OK

I've run into a pretty nasty problem with XML::Xerces
when parsing in a CGI script using a SAX2 parser: in brief,
the parser seems to ignore totally any character data in a
document that contains UTF-8 characters.

So, when the characters callback is invoked, it passes an
empty string in the 2nd argument, and a rather incorrect
looking value in the final length argument.

Bizarrely, if I invoke the same parser from a command
line script, with the same data, it works fine and UTF-8
data is returned correctly.

Any ideas ? I have to admit that I'm completely stumped with
this one, and I would have thought that someone would have
noticed it before me, so maybe, just maybe, I'm doing something
v. silly, but I can't think what at the moment.

-- 
Regards

Stephen Collyer
Netspinner Ltd

---------------------------------------------------------------------
To unsubscribe, e-mail: p-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: p-dev-help@xerces.apache.org

Re: UTF-8 problems with CGI script

Posted by "Jason E. Stewart" <ja...@openinformatics.com>.

Stephen Collyer <sc...@netspinner.co.uk> writes:

>> While preparing for the 2.6 release I stumbled upon an obvious bug in
>> the callback handlers - i.e. SAX2 character parsers. They are
>> transcoding into ASCII by default, and I have not provided a way to
>> override that, so everything will get tossed.
>
> Right. That wouldn't explain why the parser only fails in a CGI
> script though, would it ? I seem to have no problems in a command
> line environment, though I'm baffled as to why.

Oh, sorry, I didn't see that it only fails in CGI...

That sounds very odd. That could mean some import ENV variables are
not being set by the CGI user (like LANG?).

>> Stephen, could you send an example file and a short program that
>> demonstrates the problem? That would make it even simpler for me to
>> test that things are working as they should.
>
> Sure. I'll put together an LWP based sending script, and a corresponding
> CGI script, together with a command line script that works OK.
>
> May be a day or two till I do it though.

sure, that would be great. I'll be back in India in a few days and
will have better computer time then.

Cheers,
jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: p-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: p-dev-help@xerces.apache.org

Re: UTF-8 problems with CGI script

Posted by Stephen Collyer <sc...@netspinner.co.uk>.

Jason E. Stewart wrote:
> Stephen Collyer <sc...@netspinner.co.uk> writes:
> 
> 
>>I've run into a pretty nasty problem with XML::Xerces
>>when parsing in a CGI script using a SAX2 parser: in brief,
>>the parser seems to ignore totally any character data in a
>>document that contains UTF-8 characters.
> 
> 
> Hi Stephen,
> 
> Yes, I think I know exactly what this is.
> 
> While preparing for the 2.6 release I stumbled upon an obvious bug in
> the callback handlers - i.e. SAX2 character parsers. They are
> transcoding into ASCII by default, and I have not provided a way to
> override that, so everything will get tossed.

Right. That wouldn't explain why the parser only fails in a CGI
script though, would it ? I seem to have no problems in a command
line environment, though I'm baffled as to why.

> Stephen, could you send an example file and a short program that
> demonstrates the problem? That would make it even simpler for me to
> test that things are working as they should.

Sure. I'll put together an LWP based sending script, and a corresponding
CGI script, together with a command line script that works OK.

May be a day or two till I do it though.

Thanks for the rapid response.

-- 
Regards

Stephen Collyer
Netspinner Ltd

---------------------------------------------------------------------
To unsubscribe, e-mail: p-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: p-dev-help@xerces.apache.org

Re: UTF-8 problems with CGI script

Posted by "Jason E. Stewart" <ja...@openinformatics.com>.

Stephen Collyer <sc...@netspinner.co.uk> writes:

> I've run into a pretty nasty problem with XML::Xerces
> when parsing in a CGI script using a SAX2 parser: in brief,
> the parser seems to ignore totally any character data in a
> document that contains UTF-8 characters.

Hi Stephen,

Yes, I think I know exactly what this is.

While preparing for the 2.6 release I stumbled upon an obvious bug in
the callback handlers - i.e. SAX2 character parsers. They are
transcoding into ASCII by default, and I have not provided a way to
override that, so everything will get tossed.

There is a reasonably simple fix for this, but it is in the C++ code,
not the Perl code, and it involves re-running SWIG. 

It is a *serious* problem, as serious as the memory leaks, and must be
fixed. I will *not* have time to devote over the next 5 days, but
after that I'll be on a long plane ride back to India, and so I will
make sure it is fixed then.

Stephen, could you send an example file and a short program that
demonstrates the problem? That would make it even simpler for me to
test that things are working as they should.

Thanks,
jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: p-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: p-dev-help@xerces.apache.org