You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Jaya Nageswar <ja...@gmail.com> on 2008/09/02 19:21:47 UTC

xerces c 1.7.0 ICU for unicode

Hi,

I am using xerces c 1.7.0 (ICU build) for parsing xml files. I have some
special chinese characters in the xml file. So i am using ICU build to
support unicode. I defined encoding as UTF-8

*<?xml version="1.0" encoding="UTF-8"?>*

Part of xml file contains the has the following chinese characters.
  *      <Convert>
            <FromValue>TRUE</FromValue>
            <ToValue>您是如</ToValue>
        </Convert>
        <Convert>
            <FromValue>FALSE</FromValue>
            <ToValue>您好</ToValue>
        </Convert>*

I am using DOM to prase the xml file. I have the following code for DOM
parsing

*    static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
    DOMImplementation *impl =
DOMImplementationRegistry::getDOMImplementation(gLS);
    DOMBuilder        *CtlParser =
((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS,
0);*

*    CtlParser->setFeature(XMLUni::fgDOMNamespaces, true);
    CtlParser->setFeature(XMLUni::fgXercesSchema, true);
    CtlParser->setFeature(XMLUni::fgXercesSchemaFullChecking, true);
    CtlParser->setFeature(XMLUni::fgDOMValidateIfSchema, true);*

*    //create our error handler and install it
    XMLErrorHandler errorHandler;
    CtlParser->setErrorHandler(&errorHandler);

    CtlDoc = CtlParser->parseURI(XMLFilePath);
     if(errorHandler.getSawErrors())
     {
           cout<<errorHandler.ReturnErrorMessage();
     } *


I am getting the following error.
*Message: An exception occurred! Type:UTFDataFormatException,
Message:invalid byte 2 (�) of a 2-byte sequence.*

I do not understand why i am getting this error even though i am using
xercec-c ICU build. ICU build is supposed to work with unicode characters.
If i remove the chinese characters, i am not getting any error message while
parsing.

If any body worked with unicode in xerces-c, please help me. Did i miss any
of the parser settings for unicode?

Thanks in advance,
Jaya Nageswar.

Re: xerces c 1.7.0 ICU for unicode

Posted by David Bertoni <db...@apache.org>.
Jaya Nageswar wrote:
> Hi David,
> 
> Thanks for the update. I translated the characters from UCS-2 to UTF-8 using
> C APIs. Actually i took these chinese characters(您如是) from Goolge Translate
> and used in xml file to test the unicode support.When i translated these
> characters from UCS-2 to UTF-8 using C APIs, i got these characters(귦꺡髧„).
> Now i am not getting the errors from xerces parser.
I don't think you "got" any characters from the transcoding APIs.  Also, 
you need to be carefully when associating the glyphs you see on a 
display device with a particular character, since they are dependent on 
the font and the encoding assumed by the application and rendering system.

> 
> But i have a question. Will the characters themselves change from one format
> to another format? If i have a string "abcd", will it change from one format
> to another format? I understand the encoding in different formats is
> different but i do not understand why the characters themselves are chaning
> from one format to another format. Any information related to this will be a
> great help to me.
I suggest you read this article on Wikipedia:

http://en.wikipedia.org/wiki/Character_encoding

Dave

Re: xerces c 1.7.0 ICU for unicode

Posted by Jaya Nageswar <ja...@gmail.com>.
Hi David,

Thanks for the update. I translated the characters from UCS-2 to UTF-8 using
C APIs. Actually i took these chinese characters(您是如) from Goolge Translate
and used in xml file to test the unicode support.When i translated these
characters from UCS-2 to UTF-8 using C APIs, i got these characters(귦꺡髧„).
Now i am not getting the errors from xerces parser.

But i have a question. Will the characters themselves change from one format
to another format? If i have a string "abcd", will it change from one format
to another format? I understand the encoding in different formats is
different but i do not understand why the characters themselves are chaning
from one format to another format. Any information related to this will be a
great help to me.

Thanks,
Jaya Nageswar.

On Wed, Sep 3, 2008 at 3:18 AM, David Bertoni <db...@apache.org> wrote:

> Jaya Nageswar wrote:
>
>> Hi,
>>
>> I am using xerces c 1.7.0 (ICU build) for parsing xml files. I have some
>> special chinese characters in the xml file. So i am using ICU build to
>> support unicode. I defined encoding as UTF-8
>>
>> *<?xml version="1.0" encoding="UTF-8"?>*
>>
>> Part of xml file contains the has the following chinese characters.
>>  *      <Convert>
>>            <FromValue>TRUE</FromValue>
>>            <ToValue>您是如</ToValue>
>>        </Convert>
>>        <Convert>
>>            <FromValue>FALSE</FromValue>
>>            <ToValue>您好</ToValue>
>>        </Convert>*
>>
>> I am using DOM to prase the xml file. I have the following code for DOM
>> parsing
>>
>> *    static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
>>    DOMImplementation *impl =
>> DOMImplementationRegistry::getDOMImplementation(gLS);
>>    DOMBuilder        *CtlParser =
>>
>> ((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS,
>> 0);*
>>
>> *    CtlParser->setFeature(XMLUni::fgDOMNamespaces, true);
>>    CtlParser->setFeature(XMLUni::fgXercesSchema, true);
>>    CtlParser->setFeature(XMLUni::fgXercesSchemaFullChecking, true);
>>    CtlParser->setFeature(XMLUni::fgDOMValidateIfSchema, true);*
>>
>> *    //create our error handler and install it
>>    XMLErrorHandler errorHandler;
>>    CtlParser->setErrorHandler(&errorHandler);
>>
>>    CtlDoc = CtlParser->parseURI(XMLFilePath);
>>     if(errorHandler.getSawErrors())
>>     {
>>           cout<<errorHandler.ReturnErrorMessage();
>>     } *
>>
>>
>> I am getting the following error.
>> *Message: An exception occurred! Type:UTFDataFormatException,
>> Message:invalid byte 2 (�) of a 2-byte sequence.*
>>
> This indicates your file is not really encoded in UTF-8.
>
>
>> I do not understand why i am getting this error even though i am using
>> xercec-c ICU build. ICU build is supposed to work with unicode characters.
>> If i remove the chinese characters, i am not getting any error message
>> while
>> parsing.
>>
> Xerces-C supports UTF-8 even without using the ICU transcoders.
>
>
>> If any body worked with unicode in xerces-c, please help me. Did i miss
>> any
>> of the parser settings for unicode?
>>
> Your file is not encoded in UTF-8, so the parser reports an error.  You can
> either fix the file so it's properly encoded, or update the encoding in the
> XML declaration to reflect the actual encoding.
>
> Dave
>

Re: xerces c 1.7.0 ICU for unicode

Posted by David Bertoni <db...@apache.org>.
Jaya Nageswar wrote:
> Hi,
> 
> I am using xerces c 1.7.0 (ICU build) for parsing xml files. I have some
> special chinese characters in the xml file. So i am using ICU build to
> support unicode. I defined encoding as UTF-8
> 
> *<?xml version="1.0" encoding="UTF-8"?>*
> 
> Part of xml file contains the has the following chinese characters.
>   *      <Convert>
>             <FromValue>TRUE</FromValue>
>             <ToValue>您是如</ToValue>
>         </Convert>
>         <Convert>
>             <FromValue>FALSE</FromValue>
>             <ToValue>您好</ToValue>
>         </Convert>*
> 
> I am using DOM to prase the xml file. I have the following code for DOM
> parsing
> 
> *    static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
>     DOMImplementation *impl =
> DOMImplementationRegistry::getDOMImplementation(gLS);
>     DOMBuilder        *CtlParser =
> ((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS,
> 0);*
> 
> *    CtlParser->setFeature(XMLUni::fgDOMNamespaces, true);
>     CtlParser->setFeature(XMLUni::fgXercesSchema, true);
>     CtlParser->setFeature(XMLUni::fgXercesSchemaFullChecking, true);
>     CtlParser->setFeature(XMLUni::fgDOMValidateIfSchema, true);*
> 
> *    //create our error handler and install it
>     XMLErrorHandler errorHandler;
>     CtlParser->setErrorHandler(&errorHandler);
> 
>     CtlDoc = CtlParser->parseURI(XMLFilePath);
>      if(errorHandler.getSawErrors())
>      {
>            cout<<errorHandler.ReturnErrorMessage();
>      } *
> 
> 
> I am getting the following error.
> *Message: An exception occurred! Type:UTFDataFormatException,
> Message:invalid byte 2 (�) of a 2-byte sequence.*
This indicates your file is not really encoded in UTF-8.

> 
> I do not understand why i am getting this error even though i am using
> xercec-c ICU build. ICU build is supposed to work with unicode characters.
> If i remove the chinese characters, i am not getting any error message while
> parsing.
Xerces-C supports UTF-8 even without using the ICU transcoders.

> 
> If any body worked with unicode in xerces-c, please help me. Did i miss any
> of the parser settings for unicode?
Your file is not encoded in UTF-8, so the parser reports an error.  You 
can either fix the file so it's properly encoded, or update the encoding 
in the XML declaration to reflect the actual encoding.

Dave