You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xalan.apache.org by Paul Kelly <pk...@virtual.org.uk> on 2003/06/17 10:48:20 UTC

Xalan-C++ and UTF-8 with non ascii characters

hi,
    i am using xalan-c++ to perform XPath queries on an XML document, All
works fine except some non ascii characters when encoded as UTF-8 cause an
exception in theliaison->parseXMLStream();

A example problematic character is the german umlaut. The XML its trnsported
over http/SOAP from a VB application to Xalan-C++ using gSOAP. looking at
the encoding of the umlaut character shows it is sent from VB as two bytes
(hex) C3 84  - (decimal) 195 132 however if i return the same character
created from the Xerces-C++ DOM this character is encoded as &#195;&#132.

strangely some other characters that are encoded as two bytes do load into
Xalan ok an example being the Euro symbol.
Im using Xalan 1.5 without ICU and Xerces 2.2.0

Any advice really appreciated,

Thanks
Paul


Re: Xalan-C++ and UTF-8 with non ascii characters

Posted by Paul Kelly <pk...@virtual.org.uk>.
thanks for replying,
    managed to work this out, it seems as though the gSOAP framework was
modifying the 8bit UTF-8 characters into 7bit strings ie &#195;&#132 causing
xalan to fail to load the document, the solution was to set the following in
gSOAP

soap_init2(&soap, SOAP_C_UTFSTRING,SOAP_C_UTFSTRING);

this stops gets gSOAP to not alter 8bit chars

Cheers
Paul


----- Original Message -----
From: <da...@us.ibm.com>
To: <xa...@xml.apache.org>
Sent: Tuesday, June 17, 2003 4:46 PM
Subject: Re: Xalan-C++ and UTF-8 with non ascii characters


>
>
>
>
> > hi,
> >     i am using xalan-c++ to perform XPath queries on an XML document,
All
> > works fine except some non ascii characters when encoded as UTF-8 cause
> an
> > exception in theliaison->parseXMLStream();
>
> I suggest you catch the exception and take a look at the error message.
> Without that, it will be impossible to diagnose the problem.  Start with
> catching SAXParseException, because that's probably what's being thrown.
>
> > A example problematic character is the german umlaut. The XML its
> trnsported
> > over http/SOAP from a VB application to Xalan-C++ using gSOAP. looking
at
> > the encoding of the umlaut character shows it is sent from VB as two
> bytes
> > (hex) C3 84  - (decimal) 195 132
>
> The two bytes C3 84 in UTF-8 encode the Unicode character U+00C4, Latin
> Capital Letter A With Diaeresis, or capital A with an umlaut.  Is that the
> character you're expecting?
>
> > however if i return the same character created from the Xerces-C++ DOM
> this character is encoded as &#195;&#132.
>
> What do you mean by "if i return the same character created from the
> Xerces-C++ DOM?"  How did you create this instance?  Did you parse it?  If
> not, that DOM instance probably isn't relevant to the discussion.  Do you
> mean you are serializing an instance of the DOM, and you are getting those
> two characters?  If that's the case, you have an encoding problem,
because,
> in UTF-16, you are getting U+00C3 (Latin Capital Letter A With Tilde) and
> U+0132, which is a control character.
>
> My understanding of VB, which is extremely limited, is that strings are
> encoded in UCS-2, not UTF-8.  You may have a problem with parsing a
> document which contains an encoding declaration asserting the document is
> in UTF-8, when it really is UCS-2.
>
> Dave
>
>


Re: Xalan-C++ and UTF-8 with non ascii characters

Posted by da...@us.ibm.com.



> hi,
>     i am using xalan-c++ to perform XPath queries on an XML document, All
> works fine except some non ascii characters when encoded as UTF-8 cause
an
> exception in theliaison->parseXMLStream();

I suggest you catch the exception and take a look at the error message.
Without that, it will be impossible to diagnose the problem.  Start with
catching SAXParseException, because that's probably what's being thrown.

> A example problematic character is the german umlaut. The XML its
trnsported
> over http/SOAP from a VB application to Xalan-C++ using gSOAP. looking at
> the encoding of the umlaut character shows it is sent from VB as two
bytes
> (hex) C3 84  - (decimal) 195 132

The two bytes C3 84 in UTF-8 encode the Unicode character U+00C4, Latin
Capital Letter A With Diaeresis, or capital A with an umlaut.  Is that the
character you're expecting?

> however if i return the same character created from the Xerces-C++ DOM
this character is encoded as &#195;&#132.

What do you mean by "if i return the same character created from the
Xerces-C++ DOM?"  How did you create this instance?  Did you parse it?  If
not, that DOM instance probably isn't relevant to the discussion.  Do you
mean you are serializing an instance of the DOM, and you are getting those
two characters?  If that's the case, you have an encoding problem, because,
in UTF-16, you are getting U+00C3 (Latin Capital Letter A With Tilde) and
U+0132, which is a control character.

My understanding of VB, which is extremely limited, is that strings are
encoded in UCS-2, not UTF-8.  You may have a problem with parsing a
document which contains an encoding declaration asserting the document is
in UTF-8, when it really is UCS-2.

Dave