You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Jaya Nageswar <ja...@gmail.com> on 2008/09/02 19:21:47 UTC
xerces c 1.7.0 ICU for unicode
Hi,
I am using xerces c 1.7.0 (ICU build) for parsing xml files. I have some
special chinese characters in the xml file. So i am using ICU build to
support unicode. I defined encoding as UTF-8
*<?xml version="1.0" encoding="UTF-8"?>*
Part of xml file contains the has the following chinese characters.
* <Convert>
<FromValue>TRUE</FromValue>
<ToValue>您是如</ToValue>
</Convert>
<Convert>
<FromValue>FALSE</FromValue>
<ToValue>您好</ToValue>
</Convert>*
I am using DOM to prase the xml file. I have the following code for DOM
parsing
* static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
DOMImplementation *impl =
DOMImplementationRegistry::getDOMImplementation(gLS);
DOMBuilder *CtlParser =
((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS,
0);*
* CtlParser->setFeature(XMLUni::fgDOMNamespaces, true);
CtlParser->setFeature(XMLUni::fgXercesSchema, true);
CtlParser->setFeature(XMLUni::fgXercesSchemaFullChecking, true);
CtlParser->setFeature(XMLUni::fgDOMValidateIfSchema, true);*
* //create our error handler and install it
XMLErrorHandler errorHandler;
CtlParser->setErrorHandler(&errorHandler);
CtlDoc = CtlParser->parseURI(XMLFilePath);
if(errorHandler.getSawErrors())
{
cout<<errorHandler.ReturnErrorMessage();
} *
I am getting the following error.
*Message: An exception occurred! Type:UTFDataFormatException,
Message:invalid byte 2 (�) of a 2-byte sequence.*
I do not understand why i am getting this error even though i am using
xercec-c ICU build. ICU build is supposed to work with unicode characters.
If i remove the chinese characters, i am not getting any error message while
parsing.
If any body worked with unicode in xerces-c, please help me. Did i miss any
of the parser settings for unicode?
Thanks in advance,
Jaya Nageswar.
Re: xerces c 1.7.0 ICU for unicode
Posted by David Bertoni <db...@apache.org>.
Jaya Nageswar wrote:
> Hi David,
>
> Thanks for the update. I translated the characters from UCS-2 to UTF-8 using
> C APIs. Actually i took these chinese characters(您如是) from Goolge Translate
> and used in xml file to test the unicode support.When i translated these
> characters from UCS-2 to UTF-8 using C APIs, i got these characters(귦꺡髧„).
> Now i am not getting the errors from xerces parser.
I don't think you "got" any characters from the transcoding APIs. Also,
you need to be carefully when associating the glyphs you see on a
display device with a particular character, since they are dependent on
the font and the encoding assumed by the application and rendering system.
>
> But i have a question. Will the characters themselves change from one format
> to another format? If i have a string "abcd", will it change from one format
> to another format? I understand the encoding in different formats is
> different but i do not understand why the characters themselves are chaning
> from one format to another format. Any information related to this will be a
> great help to me.
I suggest you read this article on Wikipedia:
http://en.wikipedia.org/wiki/Character_encoding
Dave
Re: xerces c 1.7.0 ICU for unicode
Posted by Jaya Nageswar <ja...@gmail.com>.
Hi David,
Thanks for the update. I translated the characters from UCS-2 to UTF-8 using
C APIs. Actually i took these chinese characters(您是如) from Goolge Translate
and used in xml file to test the unicode support.When i translated these
characters from UCS-2 to UTF-8 using C APIs, i got these characters(귦꺡髧„).
Now i am not getting the errors from xerces parser.
But i have a question. Will the characters themselves change from one format
to another format? If i have a string "abcd", will it change from one format
to another format? I understand the encoding in different formats is
different but i do not understand why the characters themselves are chaning
from one format to another format. Any information related to this will be a
great help to me.
Thanks,
Jaya Nageswar.
On Wed, Sep 3, 2008 at 3:18 AM, David Bertoni <db...@apache.org> wrote:
> Jaya Nageswar wrote:
>
>> Hi,
>>
>> I am using xerces c 1.7.0 (ICU build) for parsing xml files. I have some
>> special chinese characters in the xml file. So i am using ICU build to
>> support unicode. I defined encoding as UTF-8
>>
>> *<?xml version="1.0" encoding="UTF-8"?>*
>>
>> Part of xml file contains the has the following chinese characters.
>> * <Convert>
>> <FromValue>TRUE</FromValue>
>> <ToValue>您是如</ToValue>
>> </Convert>
>> <Convert>
>> <FromValue>FALSE</FromValue>
>> <ToValue>您好</ToValue>
>> </Convert>*
>>
>> I am using DOM to prase the xml file. I have the following code for DOM
>> parsing
>>
>> * static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
>> DOMImplementation *impl =
>> DOMImplementationRegistry::getDOMImplementation(gLS);
>> DOMBuilder *CtlParser =
>>
>> ((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS,
>> 0);*
>>
>> * CtlParser->setFeature(XMLUni::fgDOMNamespaces, true);
>> CtlParser->setFeature(XMLUni::fgXercesSchema, true);
>> CtlParser->setFeature(XMLUni::fgXercesSchemaFullChecking, true);
>> CtlParser->setFeature(XMLUni::fgDOMValidateIfSchema, true);*
>>
>> * //create our error handler and install it
>> XMLErrorHandler errorHandler;
>> CtlParser->setErrorHandler(&errorHandler);
>>
>> CtlDoc = CtlParser->parseURI(XMLFilePath);
>> if(errorHandler.getSawErrors())
>> {
>> cout<<errorHandler.ReturnErrorMessage();
>> } *
>>
>>
>> I am getting the following error.
>> *Message: An exception occurred! Type:UTFDataFormatException,
>> Message:invalid byte 2 (�) of a 2-byte sequence.*
>>
> This indicates your file is not really encoded in UTF-8.
>
>
>> I do not understand why i am getting this error even though i am using
>> xercec-c ICU build. ICU build is supposed to work with unicode characters.
>> If i remove the chinese characters, i am not getting any error message
>> while
>> parsing.
>>
> Xerces-C supports UTF-8 even without using the ICU transcoders.
>
>
>> If any body worked with unicode in xerces-c, please help me. Did i miss
>> any
>> of the parser settings for unicode?
>>
> Your file is not encoded in UTF-8, so the parser reports an error. You can
> either fix the file so it's properly encoded, or update the encoding in the
> XML declaration to reflect the actual encoding.
>
> Dave
>
Re: xerces c 1.7.0 ICU for unicode
Posted by David Bertoni <db...@apache.org>.
Jaya Nageswar wrote:
> Hi,
>
> I am using xerces c 1.7.0 (ICU build) for parsing xml files. I have some
> special chinese characters in the xml file. So i am using ICU build to
> support unicode. I defined encoding as UTF-8
>
> *<?xml version="1.0" encoding="UTF-8"?>*
>
> Part of xml file contains the has the following chinese characters.
> * <Convert>
> <FromValue>TRUE</FromValue>
> <ToValue>您是如</ToValue>
> </Convert>
> <Convert>
> <FromValue>FALSE</FromValue>
> <ToValue>您好</ToValue>
> </Convert>*
>
> I am using DOM to prase the xml file. I have the following code for DOM
> parsing
>
> * static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
> DOMImplementation *impl =
> DOMImplementationRegistry::getDOMImplementation(gLS);
> DOMBuilder *CtlParser =
> ((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS,
> 0);*
>
> * CtlParser->setFeature(XMLUni::fgDOMNamespaces, true);
> CtlParser->setFeature(XMLUni::fgXercesSchema, true);
> CtlParser->setFeature(XMLUni::fgXercesSchemaFullChecking, true);
> CtlParser->setFeature(XMLUni::fgDOMValidateIfSchema, true);*
>
> * //create our error handler and install it
> XMLErrorHandler errorHandler;
> CtlParser->setErrorHandler(&errorHandler);
>
> CtlDoc = CtlParser->parseURI(XMLFilePath);
> if(errorHandler.getSawErrors())
> {
> cout<<errorHandler.ReturnErrorMessage();
> } *
>
>
> I am getting the following error.
> *Message: An exception occurred! Type:UTFDataFormatException,
> Message:invalid byte 2 (�) of a 2-byte sequence.*
This indicates your file is not really encoded in UTF-8.
>
> I do not understand why i am getting this error even though i am using
> xercec-c ICU build. ICU build is supposed to work with unicode characters.
> If i remove the chinese characters, i am not getting any error message while
> parsing.
Xerces-C supports UTF-8 even without using the ICU transcoders.
>
> If any body worked with unicode in xerces-c, please help me. Did i miss any
> of the parser settings for unicode?
Your file is not encoded in UTF-8, so the parser reports an error. You
can either fix the file so it's properly encoded, or update the encoding
in the XML declaration to reflect the actual encoding.
Dave