You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Anna Simbirtsev <as...@ca.afilias.info> on 2008/09/16 21:12:46 UTC

Problems with xerces-c version 1.7.0 and UTF-8

Hello,

I compiled xerces-c 1.7.0 with ICU 4.0 to be able to handle UTF-8
strings. Now the parser takes in UTF-8 string, but when it comes out its
truncated by a couple of characters. Can anybody help?

Thank you
Anna



---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


Re: Problems with xerces-c version 1.7.0 and UTF-8

Posted by David Bertoni <db...@apache.org>.
Anna Simbirtsev wrote:
> It just stores DOM_Document and has functions like getFirstChildElement
> and getNodeData.
As a courtesy, please do not post your question to multiple mailing 
lists.  This question more appropriate for the user list, where I just 
replied to your message.

Since you've yet to show us any code that manipulates the character data 
of the DOM document, it's impossible to offer you any definitive help. 
However, since you indicate you're calling getNodeData() I suspect you 
are indeed transcoding data from UTF-16 to the local code page.

Please reply to my posting on the Xerces-C user list.

Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


RE: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.
It just stores DOM_Document and has functions like getFirstChildElement
and getNodeData.

On Tue, 2008-09-16 at 15:34 -0400, Jesse Pelton wrote:
> What does the XercesNode(doc) constructor do? 
> 
> -----Original Message-----
> From: Anna Simbirtsev [mailto:asimbirt@ca.afilias.info] 
> Sent: Tuesday, September 16, 2008 3:32 PM
> To: c-dev@xerces.apache.org
> Subject: RE: Problems with xerces-c version 1.7.0 and UTF-8
> 
> I pass just plain xml string to the DOMParser. 
> 
>  const void * const buffer = str.c_str();
> 
>    ::DOMParser parser;
>    parser.setDoNamespaces(true);
>    parser.setToCreateXMLDeclTypeNode(false);
>    MemBufInputSource* memBufIS = new MemBufInputSource
>      (
>       (const XMLByte*)buffer
>       , length
>       , "domtools"
>       , false
>       );
> 
>    try {
>       parser.parse(*memBufIS);
>       DOM_Document doc = parser.getDocument();
>       delete memBufIS;
>       if (!doc.isNull()) return new XercesNode(doc);
>    } catch(...) {
>       delete memBufIS;
>    };
>    return new XercesNode();
> 
> When I had no ICU, it was returning an empty string instead of utf-8
> string. I just copy utf-8 strings from wikipedia.org and paste it right
> into the code to test. After I compiled the parser with ICU, it returns
> the string, but shorter. My xml has UTF-8 encoding set: <?xml
> version='1.0' encoding='UTF-8'?>.
> 
> 
> On Tue, 2008-09-16 at 15:22 -0400, Jesse Pelton wrote:
> > First, that's a truly ancient version of Xerces.  (Its successor was
> > released over six years ago.)  You might get more and better help if
> you
> > could use a more recent version.  Note that you don't need ICU to
> handle
> > UTF-8.
> > 
> > Second, you might search the list for questions relating to
> transcoding.
> > Odds are good that you're not transcoding to the encoding you think
> you
> > are, or something similar.
> > 
> > And finally, if the search doesn't yield an answer, a brief code
> sample
> > and sample document (attached to your message, not pasted into the
> > message body) may help diagnose the problem.
> > 
> > -----Original Message-----
> > From: Anna Simbirtsev [mailto:asimbirt@ca.afilias.info] 
> > Sent: Tuesday, September 16, 2008 3:13 PM
> > To: c-dev@xerces.apache.org
> > Subject: Problems with xerces-c version 1.7.0 and UTF-8
> > 
> > Hello,
> > 
> > I compiled xerces-c 1.7.0 with ICU 4.0 to be able to handle UTF-8
> > strings. Now the parser takes in UTF-8 string, but when it comes out
> its
> > truncated by a couple of characters. Can anybody help?
> > 
> > Thank you
> > Anna
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: c-dev-help@xerces.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


RE: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Jesse Pelton <js...@PKC.com>.
What does the XercesNode(doc) constructor do? 

-----Original Message-----
From: Anna Simbirtsev [mailto:asimbirt@ca.afilias.info] 
Sent: Tuesday, September 16, 2008 3:32 PM
To: c-dev@xerces.apache.org
Subject: RE: Problems with xerces-c version 1.7.0 and UTF-8

I pass just plain xml string to the DOMParser. 

 const void * const buffer = str.c_str();

   ::DOMParser parser;
   parser.setDoNamespaces(true);
   parser.setToCreateXMLDeclTypeNode(false);
   MemBufInputSource* memBufIS = new MemBufInputSource
     (
      (const XMLByte*)buffer
      , length
      , "domtools"
      , false
      );

   try {
      parser.parse(*memBufIS);
      DOM_Document doc = parser.getDocument();
      delete memBufIS;
      if (!doc.isNull()) return new XercesNode(doc);
   } catch(...) {
      delete memBufIS;
   };
   return new XercesNode();

When I had no ICU, it was returning an empty string instead of utf-8
string. I just copy utf-8 strings from wikipedia.org and paste it right
into the code to test. After I compiled the parser with ICU, it returns
the string, but shorter. My xml has UTF-8 encoding set: <?xml
version='1.0' encoding='UTF-8'?>.


On Tue, 2008-09-16 at 15:22 -0400, Jesse Pelton wrote:
> First, that's a truly ancient version of Xerces.  (Its successor was
> released over six years ago.)  You might get more and better help if
you
> could use a more recent version.  Note that you don't need ICU to
handle
> UTF-8.
> 
> Second, you might search the list for questions relating to
transcoding.
> Odds are good that you're not transcoding to the encoding you think
you
> are, or something similar.
> 
> And finally, if the search doesn't yield an answer, a brief code
sample
> and sample document (attached to your message, not pasted into the
> message body) may help diagnose the problem.
> 
> -----Original Message-----
> From: Anna Simbirtsev [mailto:asimbirt@ca.afilias.info] 
> Sent: Tuesday, September 16, 2008 3:13 PM
> To: c-dev@xerces.apache.org
> Subject: Problems with xerces-c version 1.7.0 and UTF-8
> 
> Hello,
> 
> I compiled xerces-c 1.7.0 with ICU 4.0 to be able to handle UTF-8
> strings. Now the parser takes in UTF-8 string, but when it comes out
its
> truncated by a couple of characters. Can anybody help?
> 
> Thank you
> Anna
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


RE: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Anna Simbirtsev <as...@ca.afilias.info>.
I pass just plain xml string to the DOMParser. 

 const void * const buffer = str.c_str();

   ::DOMParser parser;
   parser.setDoNamespaces(true);
   parser.setToCreateXMLDeclTypeNode(false);
   MemBufInputSource* memBufIS = new MemBufInputSource
     (
      (const XMLByte*)buffer
      , length
      , "domtools"
      , false
      );

   try {
      parser.parse(*memBufIS);
      DOM_Document doc = parser.getDocument();
      delete memBufIS;
      if (!doc.isNull()) return new XercesNode(doc);
   } catch(...) {
      delete memBufIS;
   };
   return new XercesNode();

When I had no ICU, it was returning an empty string instead of utf-8
string. I just copy utf-8 strings from wikipedia.org and paste it right
into the code to test. After I compiled the parser with ICU, it returns
the string, but shorter. My xml has UTF-8 encoding set: <?xml
version='1.0' encoding='UTF-8'?>.


On Tue, 2008-09-16 at 15:22 -0400, Jesse Pelton wrote:
> First, that's a truly ancient version of Xerces.  (Its successor was
> released over six years ago.)  You might get more and better help if you
> could use a more recent version.  Note that you don't need ICU to handle
> UTF-8.
> 
> Second, you might search the list for questions relating to transcoding.
> Odds are good that you're not transcoding to the encoding you think you
> are, or something similar.
> 
> And finally, if the search doesn't yield an answer, a brief code sample
> and sample document (attached to your message, not pasted into the
> message body) may help diagnose the problem.
> 
> -----Original Message-----
> From: Anna Simbirtsev [mailto:asimbirt@ca.afilias.info] 
> Sent: Tuesday, September 16, 2008 3:13 PM
> To: c-dev@xerces.apache.org
> Subject: Problems with xerces-c version 1.7.0 and UTF-8
> 
> Hello,
> 
> I compiled xerces-c 1.7.0 with ICU 4.0 to be able to handle UTF-8
> strings. Now the parser takes in UTF-8 string, but when it comes out its
> truncated by a couple of characters. Can anybody help?
> 
> Thank you
> Anna
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


RE: Problems with xerces-c version 1.7.0 and UTF-8

Posted by Jesse Pelton <js...@PKC.com>.
First, that's a truly ancient version of Xerces.  (Its successor was
released over six years ago.)  You might get more and better help if you
could use a more recent version.  Note that you don't need ICU to handle
UTF-8.

Second, you might search the list for questions relating to transcoding.
Odds are good that you're not transcoding to the encoding you think you
are, or something similar.

And finally, if the search doesn't yield an answer, a brief code sample
and sample document (attached to your message, not pasted into the
message body) may help diagnose the problem.

-----Original Message-----
From: Anna Simbirtsev [mailto:asimbirt@ca.afilias.info] 
Sent: Tuesday, September 16, 2008 3:13 PM
To: c-dev@xerces.apache.org
Subject: Problems with xerces-c version 1.7.0 and UTF-8

Hello,

I compiled xerces-c 1.7.0 with ICU 4.0 to be able to handle UTF-8
strings. Now the parser takes in UTF-8 string, but when it comes out its
truncated by a couple of characters. Can anybody help?

Thank you
Anna



---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org