You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-users@xerces.apache.org by Anna Simbirtsev <as...@ca.afilias.info> on 2008/09/16 21:24:34 UTC

Problems with xerces-c version 1.7.0 and UTF-8

Hello,

I compiled xerces-c 1.7.0 with ICU 4.0 to be able to handle UTF-8
strings. Now the parser takes in UTF-8 string, but when it comes out its
truncated by a couple of characters. Can anybody help?

Thank you
Anna

RE: Understanding Problem with XMLString::transcode("LS", tempStr, 99);

Posted by Da...@ptb.de.

Thank you Alberto and Jesse for the quick and very good answers.

RE: Understanding Problem with XMLString::transcode("LS", tempStr, 99);

Posted by Jesse Pelton <js...@PKC.com>.

The point of that snippet is to get a class that implements the DOM with
specific features.  Specifying that you want an "LS" implementation
indicates that you want support for the DOM 3 Load and Save features.
See
http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#DOMFea
tures for some background.

I don't see any indication in the document you linked to that "*" is an
allowed value for getDOMImplementation().  Two of the calls to
getDOMImplementation() specify the "LS" feature, and one specifies
"Range."  "*" doesn't make a lot of sense to me.  I want to know that an
implementation has the features I require, but "*" is usually a wildcard
matching anything.  If you obtained an implementation that way, how
would you know what it could do?

I agree that it would be useful to have a list of feature strings that
you can pass to getDOMImplementation().  I'd think there must be one
somewhere, but if so, I couldn't find it.  Anyone...?

-----Original Message-----
From: David.Sander@ptb.de [mailto:David.Sander@ptb.de] 
Sent: Thursday, September 18, 2008 8:25 AM
To: c-users@xerces.apache.org
Subject: Understanding Problem with XMLString::transcode("LS", tempStr,
99);

Hello all together,

can somebody tell me what the following code does (see questions in the 
commment)

          XMLCh tempStr[100];
        XMLString::transcode("LS", tempStr, 99); 
 /*for what stands "LS" ?? and why can't i use "*" , what is also
possible 
according to 
        http://old.hki.uni-koeln.de/teach/ws0506/hs/tag7/xerces-c.pdf ? 
Is there an document which i can read and it told me all options, and
when 
i can use it ?*/


        DOMImplementation *impl =
        DOMImplementationRegistry::getDOMImplementation(tempStr);



        DOMBuilder* parser =
 
((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE
_SYNCHRONOUS, 
0);

Thanks David

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.

Thank you very much, its working.

On Mon, 2008-09-22 at 11:36 +0300, Lucian Cosoi wrote:
> 2008/9/19 Anna Simbirtsev <as...@ca.afilias.info>
>         Thank you.
>         I think I can just take it out completely, since I want to
>         keep it in
>         UTF-8 and just display to the user, not to convert to local
>         code page.
>         And all I need a parser to do is parse a document that is in
>         UTF-8 so it
>         should be ok.
>         
>  
> 
> If I understand correctly, you need to read-in a UTF-8 encoded XML
> file and keep using this encoding after the Xerces Parser is done with
> it.
> 
> I found only one way to accomplish this.
> First, read in the XML with the correct encoding: 
> 
>    XMLString::transcode("UTF-8", tempStr, xercesMaxString_ - 1);
>    domInputSource->setEncoding(tempStr);
>    
> Then create a UTF-8 Transcoder object to encode back (!) to UTF-8 the
> strings the Xerces Parser will keep internally in UTF-16:
> 
>   XMLTransService::Codes returnCode;
>   XMLTranscoder * utf8Transcoder_ =
>       XMLPlatformUtils::fgTransService->makeNewTranscoderFor("UTF-8",
> returnCode, xercesMaxStringLength_);
>       
> Use this Transcoder on the strings the Xerces Parser returns:
> 
> std::string StringTranscoder::TranscodeToUTF8(const XMLCh * str,
> unsigned int inputSize)
> {
>    XMLByte * resultingString = new XMLByte[xercesMaxStringLength_ -
> 1];
> 
>    unsigned int charsEaten;
>    unsigned int resultingSize = utf8Transcoder_->transcodeTo(str,
> inputSize, resultingString,
>       xercesMaxStringLength_ - 1, charsEaten,
> XMLTranscoder::UnRep_RepChar);
>    
>    std::string resultValue(resultingString, resultingString +
> resultingSize);
> 
>    delete resultingString;
>    return resultValue;
> } 
> 
> If there is a better way, I am interested as well.
> 
> Best regards,
> Lucian
> 
>

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Lucian Cosoi <lu...@gmail.com>.

2008/9/19 Anna Simbirtsev <as...@ca.afilias.info>

> Thank you.
> I think I can just take it out completely, since I want to keep it in
> UTF-8 and just display to the user, not to convert to local code page.
> And all I need a parser to do is parse a document that is in UTF-8 so it
> should be ok.
>


If I understand correctly, you need to read-in a UTF-8 encoded XML file and
keep using this encoding after the Xerces Parser is done with it.

I found only one way to accomplish this.
First, read in the XML with the correct encoding:

   XMLString::transcode("UTF-8", tempStr, xercesMaxString_ - 1);
   domInputSource->setEncoding(tempStr);

Then create a UTF-8 Transcoder object to encode back (!) to UTF-8 the
strings the Xerces Parser will keep internally in UTF-16:

  XMLTransService::Codes returnCode;
  XMLTranscoder * utf8Transcoder_ =
      XMLPlatformUtils::fgTransService->makeNewTranscoderFor("UTF-8",
returnCode, xercesMaxStringLength_);

Use this Transcoder on the strings the Xerces Parser returns:

std::string StringTranscoder::TranscodeToUTF8(const XMLCh * str, unsigned
int inputSize)
{
   XMLByte * resultingString = new XMLByte[xercesMaxStringLength_ - 1];

   unsigned int charsEaten;
   unsigned int resultingSize = utf8Transcoder_->transcodeTo(str, inputSize,
resultingString,
      xercesMaxStringLength_ - 1, charsEaten, XMLTranscoder::UnRep_RepChar);

   std::string resultValue(resultingString, resultingString +
resultingSize);

   delete resultingString;
   return resultValue;
}

If there is a better way, I am interested as well.

Best regards,
Lucian

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.

Thank you.
I think I can just take it out completely, since I want to keep it in
UTF-8 and just display to the user, not to convert to local code page.
And all I need a parser to do is parse a document that is in UTF-8 so it
should be ok.

On Fri, 2008-09-19 at 12:27 -0700, David Bertoni wrote:
> Anna Simbirtsev wrote:
> > Hi,
> > 
> > Do you know if you can give me an example of how to transcode utf-8
> > string to unicode and back? I think if I get the string in utf-8
> > encoding, I need to convert it to unicode before I pass it into xerces
> > parser?
> UTF-8 is an encoding of Unicode, so I'm not sure I understand your 
> question.  Xerces-C uses UTF-16 internally, so you would need to 
> transcode strings from UTF-8 to UTF-16 for APIs that expect arrays of 
> UTF-16 code units, such as DOMDocument::createElement(const XMLCh* 
> tagName). You can, however, parse UTF-8 documents without transcoding them.
> 
> There was a thread last week that discussed some of the issues with 
> local code page transcoding and you will find a link to an earlier 
> thread that has some transcoding code snippets.
> 
> Dave

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by David Bertoni <db...@apache.org>.

Anna Simbirtsev wrote:
> Hi,
> 
> Do you know if you can give me an example of how to transcode utf-8
> string to unicode and back? I think if I get the string in utf-8
> encoding, I need to convert it to unicode before I pass it into xerces
> parser?
UTF-8 is an encoding of Unicode, so I'm not sure I understand your 
question.  Xerces-C uses UTF-16 internally, so you would need to 
transcode strings from UTF-8 to UTF-16 for APIs that expect arrays of 
UTF-16 code units, such as DOMDocument::createElement(const XMLCh* 
tagName). You can, however, parse UTF-8 documents without transcoding them.

There was a thread last week that discussed some of the issues with 
local code page transcoding and you will find a link to an earlier 
thread that has some transcoding code snippets.

Dave

RE: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Jesse Pelton <js...@PKC.com>.

You shouldn't need to use std::wstring, and in fact, you shouldn't.
Whether you can use std::string depends on whether std::string is
capable of understanding UTF-8.  That may depend on the implementation.
(Or not.  I haven't used any implementation of the class myself.)

Note that UTF-8 *is* Unicode.  ("UTF" stands for "Unicode Transformation
Format.")  Perhaps you are asking how to transcode between UTF-8 and
UTF-16, which is the encoding Xerces uses internally.  (UTF-16 is also
the native format of Windows NT-based operating systems, and is
unfortunately referred to by Microsoft as "Unicode."  This perpetuates
the confusion between Unicode and its encodings.)  If so, there are many
messages in the mailing list archives that address this question.  See,
for example, http://marc.info/?l=xerces-c-users&m=119514889329902&w=2.
The archives are listed at
http://xerces.apache.org/xerces-c/mailing-lists.html.

-----Original Message-----
From: Anna Simbirtsev [mailto:asimbirt@ca.afilias.info] 
Sent: Friday, September 19, 2008 11:21 AM
To: c-users@xerces.apache.org
Subject: Re: Problems with xerces-c version 1.7.0 and UTF-8

Also do I need to use std::wstring to store UTF-8 strings or I will be
ok with std::string?

Thank you

On Fri, 2008-09-19 at 09:40 -0400, Anna Simbirtsev wrote:
> Hi,
> 
> Do you know if you can give me an example of how to transcode utf-8
> string to unicode and back? I think if I get the string in utf-8
> encoding, I need to convert it to unicode before I pass it into xerces
> parser?
> 
> On Wed, 2008-09-17 at 09:58 -0700, David Bertoni wrote:
> > Anna Simbirtsev wrote:
> > > When I print it in hex format, I get
> > >  : 0xffffffd0
> > >  : 0xffffffb1
> > >  : 0xffffffd0
> > >  : 0xffffffb1
> > >  : 0xffffffd0
> > >  : 0xffffffb1
> > > 
> > > Which I am not even sure what format, but maybe my shell does not
> > > know what it is.
> > You need to understand the limitations of any library you use.  Here
is 
> > a snippet of the source code from the domtools library you're using:
> > 
> > string domtools::toString(const DOMString s)
> > {
> >     char * t = s.transcode();
> >     if (!t) return "";
> >     string tmp = t;
> >     delete [] t;
> >     return tmp;
> > }
> > 
> > You can see the call to DOMString::transcode().  This will fail when

> > characters in the DOMString are not representable in the local code 
> > page.  This is likely what's happening, and I suggest you find
another 
> > library to use, because this one is broken.
> > 
> > Alternately, if you always want to transcode data to UTF-8, you can 
> > modify the library to use a UTF-8 transcoder.  There was another
thread 
> > late last week and this week on this topic.
> > 
> > Dave
>

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.

Also do I need to use std::wstring to store UTF-8 strings or I will be
ok with std::string?

Thank you

On Fri, 2008-09-19 at 09:40 -0400, Anna Simbirtsev wrote:
> Hi,
> 
> Do you know if you can give me an example of how to transcode utf-8
> string to unicode and back? I think if I get the string in utf-8
> encoding, I need to convert it to unicode before I pass it into xerces
> parser?
> 
> On Wed, 2008-09-17 at 09:58 -0700, David Bertoni wrote:
> > Anna Simbirtsev wrote:
> > > When I print it in hex format, I get
> > > �: 0xffffffd0
> > > �: 0xffffffb1
> > > �: 0xffffffd0
> > > �: 0xffffffb1
> > > �: 0xffffffd0
> > > �: 0xffffffb1
> > > 
> > > Which I am not even sure what format, but maybe my shell does not
> > > know what it is.
> > You need to understand the limitations of any library you use.  Here is 
> > a snippet of the source code from the domtools library you're using:
> > 
> > string domtools::toString(const DOMString s)
> > {
> >     char * t = s.transcode();
> >     if (!t) return "";
> >     string tmp = t;
> >     delete [] t;
> >     return tmp;
> > }
> > 
> > You can see the call to DOMString::transcode().  This will fail when 
> > characters in the DOMString are not representable in the local code 
> > page.  This is likely what's happening, and I suggest you find another 
> > library to use, because this one is broken.
> > 
> > Alternately, if you always want to transcode data to UTF-8, you can 
> > modify the library to use a UTF-8 transcoder.  There was another thread 
> > late last week and this week on this topic.
> > 
> > Dave
>

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.

Hi,

Do you know if you can give me an example of how to transcode utf-8
string to unicode and back? I think if I get the string in utf-8
encoding, I need to convert it to unicode before I pass it into xerces
parser?

On Wed, 2008-09-17 at 09:58 -0700, David Bertoni wrote:
> Anna Simbirtsev wrote:
> > When I print it in hex format, I get
> > �: 0xffffffd0
> > �: 0xffffffb1
> > �: 0xffffffd0
> > �: 0xffffffb1
> > �: 0xffffffd0
> > �: 0xffffffb1
> > 
> > Which I am not even sure what format, but maybe my shell does not
> > know what it is.
> You need to understand the limitations of any library you use.  Here is 
> a snippet of the source code from the domtools library you're using:
> 
> string domtools::toString(const DOMString s)
> {
>     char * t = s.transcode();
>     if (!t) return "";
>     string tmp = t;
>     delete [] t;
>     return tmp;
> }
> 
> You can see the call to DOMString::transcode().  This will fail when 
> characters in the DOMString are not representable in the local code 
> page.  This is likely what's happening, and I suggest you find another 
> library to use, because this one is broken.
> 
> Alternately, if you always want to transcode data to UTF-8, you can 
> modify the library to use a UTF-8 transcoder.  There was another thread 
> late last week and this week on this topic.
> 
> Dave

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.

I think they come in as UTF-8 from the server, all I need is to parse
them.

On Fri, 2008-09-19 at 16:07 -0700, David Bertoni wrote:
> Anna Simbirtsev wrote:
> > Do you know if I receive utf-8 string, can I just take out s.transcode
> > completely and keep the string in utf-8? DOMString is capable of
> > containing utf-8 strings?
> No, Xerces-C always uses UTF-16 internally to encode character data. 
> When you supply a document that is not encoded in UTF-16, it uses a 
> transcoder to convert the byte stream to UTF-16 before parsing it.
> 
> You seemed to be confused about the differences between UTF-8 and 
> UTF-16.  Both are encodings that can represent all of the characters in 
> Unicode.  UTF-8 is an 8-bit encoding that is compatible with the char 
> data type in C.  UTF-16 is a 16-bit encoding, so it's not compatible 
> with the char data type.
> 
> Is there some reason you need strings encoded in UTF-8?
> 
> Dave

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by David Bertoni <db...@apache.org>.

Anna Simbirtsev wrote:
> Do you know if I receive utf-8 string, can I just take out s.transcode
> completely and keep the string in utf-8? DOMString is capable of
> containing utf-8 strings?
No, Xerces-C always uses UTF-16 internally to encode character data. 
When you supply a document that is not encoded in UTF-16, it uses a 
transcoder to convert the byte stream to UTF-16 before parsing it.

You seemed to be confused about the differences between UTF-8 and 
UTF-16.  Both are encodings that can represent all of the characters in 
Unicode.  UTF-8 is an 8-bit encoding that is compatible with the char 
data type in C.  UTF-16 is a 16-bit encoding, so it's not compatible 
with the char data type.

Is there some reason you need strings encoded in UTF-8?

Dave

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.

Do you know if I receive utf-8 string, can I just take out s.transcode
completely and keep the string in utf-8? DOMString is capable of
containing utf-8 strings?

On Wed, 2008-09-17 at 09:58 -0700, David Bertoni wrote:
> Anna Simbirtsev wrote:
> > When I print it in hex format, I get
> > �: 0xffffffd0
> > �: 0xffffffb1
> > �: 0xffffffd0
> > �: 0xffffffb1
> > �: 0xffffffd0
> > �: 0xffffffb1
> > 
> > Which I am not even sure what format, but maybe my shell does not
> > know what it is.
> You need to understand the limitations of any library you use.  Here is 
> a snippet of the source code from the domtools library you're using:
> 
> string domtools::toString(const DOMString s)
> {
>     char * t = s.transcode();
>     if (!t) return "";
>     string tmp = t;
>     delete [] t;
>     return tmp;
> }
> 
> You can see the call to DOMString::transcode().  This will fail when 
> characters in the DOMString are not representable in the local code 
> page.  This is likely what's happening, and I suggest you find another 
> library to use, because this one is broken.
> 
> Alternately, if you always want to transcode data to UTF-8, you can 
> modify the library to use a UTF-8 transcoder.  There was another thread 
> late last week and this week on this topic.
> 
> Dave

Re: Understanding Problem with XMLString::transcode("LS", tempStr, 99);

Posted by Alberto Massari <am...@datadirect.com>.

"LS" means "Load & Save"; the W3C DOM allows multiple DOM implementation 
to cohesist, so the DOMImplementationRegistry::getDOMImplementation will 
look in the available implementations and pick one that has the 
specified features (this to allow a low footprint DOM when you don't 
need advanced features)

Core = it supports just the basic DOM features
Range = createRange is supported
Traversal = createTreeWalker and createNodeIterator are supported
LS = Load & Save -> createDOMBuilder and createDOMWriter are supported

Having said that, Xerces has just one DOM model, so regardless of how 
many feature you request, you will be given the entire set of features.

Alberto

David.Sander@ptb.de wrote:
> Hello all together,
>
> can somebody tell me what the following code does (see questions in the 
> commment)
>
>           XMLCh tempStr[100];
>         XMLString::transcode("LS", tempStr, 99); 
>  /*for what stands "LS" ?? and why can't i use "*" , what is also possible 
> according to 
>         http://old.hki.uni-koeln.de/teach/ws0506/hs/tag7/xerces-c.pdf ? 
> Is there an document which i can read and it told me all options, and when 
> i can use it ?*/
>
>
>         DOMImplementation *impl =
>         DOMImplementationRegistry::getDOMImplementation(tempStr);
>
>
>
>         DOMBuilder* parser =
>  
> ((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS, 
> 0);
>
> Thanks David
>
>  
>

Understanding Problem with XMLString::transcode("LS", tempStr, 99);

Posted by Da...@ptb.de.

Hello all together,

can somebody tell me what the following code does (see questions in the 
commment)

          XMLCh tempStr[100];
        XMLString::transcode("LS", tempStr, 99); 
 /*for what stands "LS" ?? and why can't i use "*" , what is also possible 
according to 
        http://old.hki.uni-koeln.de/teach/ws0506/hs/tag7/xerces-c.pdf ? 
Is there an document which i can read and it told me all options, and when 
i can use it ?*/


        DOMImplementation *impl =
        DOMImplementationRegistry::getDOMImplementation(tempStr);



        DOMBuilder* parser =
 
((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS, 
0);

Thanks David

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by David Bertoni <db...@apache.org>.

Anna Simbirtsev wrote:
> When I print it in hex format, I get
> �: 0xffffffd0
> �: 0xffffffb1
> �: 0xffffffd0
> �: 0xffffffb1
> �: 0xffffffd0
> �: 0xffffffb1
> 
> Which I am not even sure what format, but maybe my shell does not
> know what it is.
You need to understand the limitations of any library you use.  Here is 
a snippet of the source code from the domtools library you're using:

string domtools::toString(const DOMString s)
{
    char * t = s.transcode();
    if (!t) return "";
    string tmp = t;
    delete [] t;
    return tmp;
}

You can see the call to DOMString::transcode().  This will fail when 
characters in the DOMString are not representable in the local code 
page.  This is likely what's happening, and I suggest you find another 
library to use, because this one is broken.

Alternately, if you always want to transcode data to UTF-8, you can 
modify the library to use a UTF-8 transcoder.  There was another thread 
late last week and this week on this topic.

Dave

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.

When I print it in hex format, I get
�: 0xffffffd0
�: 0xffffffb1
�: 0xffffffd0
�: 0xffffffb1
�: 0xffffffd0
�: 0xffffffb1

Which I am not even sure what format, but maybe my shell does not
know what it is.


On Wed, 2008-09-17 at 15:39 +0200, Alberto Massari wrote:
> Hi Anna,
> if I am not mistaken, the code you attached doesn't have the sample data 
> you are trying to parse (e.g. parseString is used to parse the result of 
> a toXML call on an extension object).
> However, you say "in the dom_wrapper.c I print the string before it is 
> passed to the xerces-c parser [...] and my value in utf-8 looks fine"; 
> in the code you write
> 
>    cout << "parseString: " << str << endl;
>    return parseMemory(str.c_str(),(int)str.length());
> 
> But the fact that your console prints the data as you expects doesn't 
> imply that the std::string contains real UTF-8; your shell could be 
> using a Japanese locale, and be able to print correctly 
> Shift_JIS-encoded strings (while failing to print UTF-8-encoded strings).
> If you want to really see what you are considering UTF-8, replace that 
> cout << str with this code
> 
> for(int i=0;i<str.length();i++)
>   cout << "0x" << hex << (int)str[i] << " ";
> cout << endl;
> 
> Alberto
> 
> Anna Simbirtsev wrote:
> > In the epp_eppXMLbase.cc in function createDOMDocument it calls
> > parseString function from domtools::XercesParser. In the dom_wrapper.c I
> > print the string before it is passed to the xerces-c parser in
> > domtools::XercesParser::parseMemory function and my value in utf-8 looks
> > fine. When it gets back from xerces-c a DOM_document, it uses XercesNode
> > object(defined in dom_wrapper.h) to store the DOM_document and break it
> > into nodes. Then in epp_eppXMLbase.cc, in function
> > eppobject::epp::addExtensionElements(EPP_output & outputobject, const
> > epp_extension_ref_seq_ref & extensions)
> >
> > it calls
> > DomPrint dp(outputobject);
> > dp.putDOMTree(extensionDoc);
> >
> > from dom_print.cc where I actually print the value in putDOMTree
> > function. Here the value looks truncated.
> > The entire source code of domtools is available on
> > http://sourceforge.net/project/showfiles.php?group_id=26675
> >
> > Thank you very much for your help.
> >
> > On Wed, 2008-09-17 at 08:19 +0200, Alberto Massari wrote:
> >   
> >> Anna Simbirtsev wrote:
> >>     
> >>> I pass just plain xml string to the DOMParser, so I don't use the
> >>> transcode function.
> >>>
> >>> [...]
> >>> I just copy utf-8 strings from wikipedia.org and paste it right
> >>> into the code to test. After I compiled the parser with ICU, it returns
> >>> the string, but shorter. My xml has UTF-8 encoding set: <?xml
> >>> version='1.0' encoding='UTF-8'?>.
> >>>   
> >>>       
> >> If you just used cut & paste from your browser to your C++ code editor, 
> >> I can bet you are not pasting UTF-8 codepoints, but something in your 
> >> local code page. Can you attach your source code to this e-mail 
> >> (attached, not copied)?
> >>
> >> Alberto
> >>     
>

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Alberto Massari <am...@datadirect.com>.

Hi Anna,
if I am not mistaken, the code you attached doesn't have the sample data 
you are trying to parse (e.g. parseString is used to parse the result of 
a toXML call on an extension object).
However, you say "in the dom_wrapper.c I print the string before it is 
passed to the xerces-c parser [...] and my value in utf-8 looks fine"; 
in the code you write

   cout << "parseString: " << str << endl;
   return parseMemory(str.c_str(),(int)str.length());

But the fact that your console prints the data as you expects doesn't 
imply that the std::string contains real UTF-8; your shell could be 
using a Japanese locale, and be able to print correctly 
Shift_JIS-encoded strings (while failing to print UTF-8-encoded strings).
If you want to really see what you are considering UTF-8, replace that 
cout << str with this code

for(int i=0;i<str.length();i++)
  cout << "0x" << hex << (int)str[i] << " ";
cout << endl;

Alberto

Anna Simbirtsev wrote:
> In the epp_eppXMLbase.cc in function createDOMDocument it calls
> parseString function from domtools::XercesParser. In the dom_wrapper.c I
> print the string before it is passed to the xerces-c parser in
> domtools::XercesParser::parseMemory function and my value in utf-8 looks
> fine. When it gets back from xerces-c a DOM_document, it uses XercesNode
> object(defined in dom_wrapper.h) to store the DOM_document and break it
> into nodes. Then in epp_eppXMLbase.cc, in function
> eppobject::epp::addExtensionElements(EPP_output & outputobject, const
> epp_extension_ref_seq_ref & extensions)
>
> it calls
> DomPrint dp(outputobject);
> dp.putDOMTree(extensionDoc);
>
> from dom_print.cc where I actually print the value in putDOMTree
> function. Here the value looks truncated.
> The entire source code of domtools is available on
> http://sourceforge.net/project/showfiles.php?group_id=26675
>
> Thank you very much for your help.
>
> On Wed, 2008-09-17 at 08:19 +0200, Alberto Massari wrote:
>   
>> Anna Simbirtsev wrote:
>>     
>>> I pass just plain xml string to the DOMParser, so I don't use the
>>> transcode function.
>>>
>>> [...]
>>> I just copy utf-8 strings from wikipedia.org and paste it right
>>> into the code to test. After I compiled the parser with ICU, it returns
>>> the string, but shorter. My xml has UTF-8 encoding set: <?xml
>>> version='1.0' encoding='UTF-8'?>.
>>>   
>>>       
>> If you just used cut & paste from your browser to your C++ code editor, 
>> I can bet you are not pasting UTF-8 codepoints, but something in your 
>> local code page. Can you attach your source code to this e-mail 
>> (attached, not copied)?
>>
>> Alberto
>>

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.

In the epp_eppXMLbase.cc in function createDOMDocument it calls
parseString function from domtools::XercesParser. In the dom_wrapper.c I
print the string before it is passed to the xerces-c parser in
domtools::XercesParser::parseMemory function and my value in utf-8 looks
fine. When it gets back from xerces-c a DOM_document, it uses XercesNode
object(defined in dom_wrapper.h) to store the DOM_document and break it
into nodes. Then in epp_eppXMLbase.cc, in function
eppobject::epp::addExtensionElements(EPP_output & outputobject, const
epp_extension_ref_seq_ref & extensions)

it calls
DomPrint dp(outputobject);
dp.putDOMTree(extensionDoc);

from dom_print.cc where I actually print the value in putDOMTree
function. Here the value looks truncated.
The entire source code of domtools is available on
http://sourceforge.net/project/showfiles.php?group_id=26675

Thank you very much for your help.

On Wed, 2008-09-17 at 08:19 +0200, Alberto Massari wrote:
> Anna Simbirtsev wrote:
> > I pass just plain xml string to the DOMParser, so I don't use the
> > transcode function.
> >
> > [...]
> > I just copy utf-8 strings from wikipedia.org and paste it right
> > into the code to test. After I compiled the parser with ICU, it returns
> > the string, but shorter. My xml has UTF-8 encoding set: <?xml
> > version='1.0' encoding='UTF-8'?>.
> >   
> 
> If you just used cut & paste from your browser to your C++ code editor, 
> I can bet you are not pasting UTF-8 codepoints, but something in your 
> local code page. Can you attach your source code to this e-mail 
> (attached, not copied)?
> 
> Alberto

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.

Here is an example of what I see before I pass the string to the
xerces-c parser and after:

<Returned_XML>
<parseme
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><ipr:create
xmlns:ipr="urn:afilias:params:xml:ns:ipr-1.1 "
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:afilias:params:xml:ns:ipr-1.1
ipr-1.1.xsd"><ipr:name>édia</ipr:name><ipr:number>12345566</ipr:number><ipr:ccLocality>CA</ipr:ccLocality><ipr:regDate>2001-01-01</ipr:regDate><ipr:appDate>2002-01-01</ipr:appDate><ipr:class>1</ipr:class><ipr:entitlement>owner</ipr:entitlement><ipr:form>corporation</ipr:form><ipr:preVerified>code</ipr:preVerified><ipr:type>sunrise</ipr:type></ipr:create>
</parseme>

</Returned_XML>

The utf-8 string is <ipr:name>édia</ipr:name>.

When it comes back from the parser in the form of DOM_document and
extract the node and print its value  see:

value: édi
value: 12345566
value: CA
value: 2001-01-01
value: 2002-01-01
value: 1
value: owner
value: corporation
value: code
value: sunrise

So it got rid of last character 'a' in édia. 

On Wed, 2008-09-17 at 08:19 +0200, Alberto Massari wrote:
> Anna Simbirtsev wrote:
> > I pass just plain xml string to the DOMParser, so I don't use the
> > transcode function.
> >
> > [...]
> > I just copy utf-8 strings from wikipedia.org and paste it right
> > into the code to test. After I compiled the parser with ICU, it returns
> > the string, but shorter. My xml has UTF-8 encoding set: <?xml
> > version='1.0' encoding='UTF-8'?>.
> >   
> 
> If you just used cut & paste from your browser to your C++ code editor, 
> I can bet you are not pasting UTF-8 codepoints, but something in your 
> local code page. Can you attach your source code to this e-mail 
> (attached, not copied)?
> 
> Alberto

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Alberto Massari <am...@datadirect.com>.

Anna Simbirtsev wrote:
> I pass just plain xml string to the DOMParser, so I don't use the
> transcode function.
>
> [...]
> I just copy utf-8 strings from wikipedia.org and paste it right
> into the code to test. After I compiled the parser with ICU, it returns
> the string, but shorter. My xml has UTF-8 encoding set: <?xml
> version='1.0' encoding='UTF-8'?>.
>   

If you just used cut & paste from your browser to your C++ code editor, 
I can bet you are not pasting UTF-8 codepoints, but something in your 
local code page. Can you attach your source code to this e-mail 
(attached, not copied)?

Alberto

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by David Bertoni <db...@apache.org>.

Anna Simbirtsev wrote:
> I pass just plain xml string to the DOMParser, so I don't use the
> transcode function.
> 
>  const void * const buffer = str.c_str();
> 
>    ::DOMParser parser;
>    parser.setDoNamespaces(true);
>    parser.setToCreateXMLDeclTypeNode(false);
>    MemBufInputSource* memBufIS = new MemBufInputSource
>      (
>       (const XMLByte*)buffer
>       , length
>       , "domtools"
>       , false
>       );
> 
>    try {
>       parser.parse(*memBufIS);
>       DOM_Document doc = parser.getDocument();
>       delete memBufIS;
>       if (!doc.isNull()) return new XercesNode(doc);
>    } catch(...) {
>       delete memBufIS;
>    };
>    return new XercesNode();
> 
> When I had no ICU, it was returning an empty string instead of utf-8
> string. I just copy utf-8 strings from wikipedia.org and paste it right
> into the code to test. After I compiled the parser with ICU, it returns
> the string, but shorter. My xml has UTF-8 encoding set: <?xml
> version='1.0' encoding='UTF-8'?>.
You just posted the exact reply to this list that you posted to Jesse on 
the developer list, but you've not included the necessary information so 
someone can help you.

There is nothing in the code snippet that you posted where you access 
any data in the document, so I don't understand how you can tell any 
strings are truncated.

Dave

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.

I pass just plain xml string to the DOMParser, so I don't use the
transcode function.

 const void * const buffer = str.c_str();

   ::DOMParser parser;
   parser.setDoNamespaces(true);
   parser.setToCreateXMLDeclTypeNode(false);
   MemBufInputSource* memBufIS = new MemBufInputSource
     (
      (const XMLByte*)buffer
      , length
      , "domtools"
      , false
      );

   try {
      parser.parse(*memBufIS);
      DOM_Document doc = parser.getDocument();
      delete memBufIS;
      if (!doc.isNull()) return new XercesNode(doc);
   } catch(...) {
      delete memBufIS;
   };
   return new XercesNode();

When I had no ICU, it was returning an empty string instead of utf-8
string. I just copy utf-8 strings from wikipedia.org and paste it right
into the code to test. After I compiled the parser with ICU, it returns
the string, but shorter. My xml has UTF-8 encoding set: <?xml
version='1.0' encoding='UTF-8'?>.

On Tue, 2008-09-16 at 12:47 -0700, David Bertoni wrote:
> Anna Simbirtsev wrote:
> > Hello,
> > 
> > I compiled xerces-c 1.7.0 with ICU 4.0 to be able to handle UTF-8
> > strings. Now the parser takes in UTF-8 string, but when it comes out its
> > truncated by a couple of characters. Can anybody help?
> Note that Xerces-C can parse documents encoded in UTF-8 _without_ 
> integrating the ICU.
> 
> Perhaps you are calling XMLString::transcode() or 
> DOMString::transcode()?  If so, please search the archives of the 
> mailing list, as this problem comes up often (in fact, just last week).
> 
> If not, then please provide more information about what you mean by 
> "when it comes out" and what characters are truncated.
> 
> Dave

Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by David Bertoni <db...@apache.org>.

Anna Simbirtsev wrote:
> Hello,
> 
> I compiled xerces-c 1.7.0 with ICU 4.0 to be able to handle UTF-8
> strings. Now the parser takes in UTF-8 string, but when it comes out its
> truncated by a couple of characters. Can anybody help?
Note that Xerces-C can parse documents encoded in UTF-8 _without_ 
integrating the ICU.

Perhaps you are calling XMLString::transcode() or 
DOMString::transcode()?  If so, please search the archives of the 
mailing list, as this problem comes up often (in fact, just last week).

If not, then please provide more information about what you mean by 
"when it comes out" and what characters are truncated.

Dave