You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by pundog <pu...@gmail.com> on 2007/11/14 20:57:46 UTC

Converting XMLCh* to std::string with encoding

I've got an xml file the contains some Hebrew characters instead of English. 
I made sure that i saved the file in UTF-8 encoding (using VS.Net), and i
also added an xml deceleration line with the encoding attribute set to
UTF-8.

when i parse the xml, i get an XMLCh* that contains my Hebrew characters,
and when i browse it's value in the debugger, i can see the actuall Hebrew
characters. However, when i try to convert the XMLCh* to a std::string using
the "transcode" method, my string is filled with "????" characters instead
of Hebrew.

I'm completely stuck right now, and any idea would be helpful.
Thanks
-- 
View this message in context: http://www.nabble.com/Converting-XMLCh*-to-std%3A%3Astring-with-encoding-tf4807661.html#a13755245
Sent from the Xerces - C - Users mailing list archive at Nabble.com.


Re: Converting XMLCh* to std::string with encoding

Posted by David Bertoni <db...@apache.org>.
Alberto Massari wrote:
> pundog wrote:
>> Hi, i've noticed the leak when running the following code:
>>
>> while(true)
>> {
>> XMLTranscoder* utf8Transcoder;
>> XMLTransService::Codes failReason;
>> utf8Transcoder =
>> XMLPlatformUtils::fgTranscService->makeNewTranscoderFor("UTF-8", 
>> failReason,
>> 16*1024);
>>
>> size_t len = XMLString::stringLen(value);
>> XMLByte* utf8 = new XMLByte[(len*4)+1];
>> unsigned int eaten;
>> unsigned int utf8Len = utf8Transcoder->transcodeTo(my_hebrew_string, len,
>> utf8, len*4, eaten, XMLTranscoder::UnRep_Throw);
>>
>> utf8[utf8Len] = '\0';
>> string str = (char*)utf8;
>>
>> delete[] utf8;
>>   
> You also need to release the transcoder
> 
>    utf8Transcoder->release();
> 
> BTW, I would create it just once outside the 'while' loop.
> 
> As for the std::string -> XMLCh* conversion, just use the transcodeFrom 
> method exposed by the same transcoder.
Note that this works only with the proper transcoder for the encoding of 
the bytes.  Obviously, if you transcoded from UTF-16 to UTF-8 to create the 
std::string instance, you're OK.  But if you got the std::string from 
somewhere else, you'll need to know the encoding for the bytes and create 
the appropriate transcoder to convert them to UTF-16.

Dave

Re: Converting XMLCh* to std::string with encoding

Posted by Alberto Massari <am...@datadirect.com>.
Hi,
you need to allocate at least one extra character (i.e. stringLength+1) 
to store the NULL character.

Alberto

pundog wrote:
>
> Alberto Massari wrote:
>   
>> As you can read here 
>> http://xerces.apache.org/xerces-c/apiDocs/classXMLTranscoder.html#b9d5409d562aa54f99dc01617091c457, 
>> the signature is
>>
>> virtual unsigned int XMLTranscoder::transcodeFrom( const XMLByte 
>> <http://xerces.apache.org/xerces-c/apiDocs/XercesDefs_8hpp.html#7470c7a32c59355685ebcd878a33f126> 
>> *const srcData, const unsigned int srcCount, XMLCh *const toFill, const 
>> unsigned int /maxChars/, unsigned int & /bytesEaten/, unsigned char 
>> *const /charSizes)/
>> //
>> //It takes XMLByte* (i.e. unsigned char*, that you should get from 
>> std::string.c_str() ) and fills an XMLCh buffer of maxChars.
>>
>> Alberto
>>
>>     
>
> Hi, first of all, thanks for your reply
> I tried to use the "transcodeFrom" function using the following code:
>
> XMLByte* xmlBytes = (unsigned char*)MY_STRING.c_str();
> size_t stringLength = str.length();
> XMLCh* xmlChars = new XMLCh[stringLength];
>
> XMLTranscoder* utf8Transcoder ;
> XMLTransService::Codes failReason;
> utf8Transcoder =
> XMLPlatformUtils::fgTransService->makeNewTranscoderFor("UTF-8", failReason,
> 16*1024);
>
> unsigned int eaten;
> unsigned char* charSizes = new unsigned char[stringLength];
> unsigned int xmlCharsLength = utf8Transcoder->transcodeFrom(xmlBytes,
> stringLength, xmlChars, stringLength, eaten, charSize);
>
> delete[] charSizes;
> delete utf8Transcoder;
> ...
> XMLString::release(xmlChars);
>
> The problem with this code is that for each string that i transcode i get
> some weird chars from the right side. For example, if i transcode the string
> "Hello", then my XMLCh* will be "Hello$$$$". 
> I tried the place a null ('\0') at the end of the XMLCh but it causes an
> exception to be thrown when i try to call XMLString::release, to deallocate
> it's memory.
>
> How can i solve this?
> again, thanks for your help :)
>   


Re: Converting XMLCh* to std::string with encoding

Posted by pundog <pu...@gmail.com>.


Alberto Massari wrote:
> 
> As you can read here 
> http://xerces.apache.org/xerces-c/apiDocs/classXMLTranscoder.html#b9d5409d562aa54f99dc01617091c457, 
> the signature is
> 
> virtual unsigned int XMLTranscoder::transcodeFrom( const XMLByte 
> <http://xerces.apache.org/xerces-c/apiDocs/XercesDefs_8hpp.html#7470c7a32c59355685ebcd878a33f126> 
> *const srcData, const unsigned int srcCount, XMLCh *const toFill, const 
> unsigned int /maxChars/, unsigned int & /bytesEaten/, unsigned char 
> *const /charSizes)/
> //
> //It takes XMLByte* (i.e. unsigned char*, that you should get from 
> std::string.c_str() ) and fills an XMLCh buffer of maxChars.
> 
> Alberto
> 

Hi, first of all, thanks for your reply
I tried to use the "transcodeFrom" function using the following code:

XMLByte* xmlBytes = (unsigned char*)MY_STRING.c_str();
size_t stringLength = str.length();
XMLCh* xmlChars = new XMLCh[stringLength];

XMLTranscoder* utf8Transcoder ;
XMLTransService::Codes failReason;
utf8Transcoder =
XMLPlatformUtils::fgTransService->makeNewTranscoderFor("UTF-8", failReason,
16*1024);

unsigned int eaten;
unsigned char* charSizes = new unsigned char[stringLength];
unsigned int xmlCharsLength = utf8Transcoder->transcodeFrom(xmlBytes,
stringLength, xmlChars, stringLength, eaten, charSize);

delete[] charSizes;
delete utf8Transcoder;
...
XMLString::release(xmlChars);

The problem with this code is that for each string that i transcode i get
some weird chars from the right side. For example, if i transcode the string
"Hello", then my XMLCh* will be "Hello$$$$". 
I tried the place a null ('\0') at the end of the XMLCh but it causes an
exception to be thrown when i try to call XMLString::release, to deallocate
it's memory.

How can i solve this?
again, thanks for your help :)
-- 
View this message in context: http://www.nabble.com/Converting-XMLCh*-to-std%3A%3Astring-with-encoding-tf4807661.html#a14125854
Sent from the Xerces - C - Users mailing list archive at Nabble.com.


Re: Converting XMLCh* to std::string with encoding

Posted by Alberto Massari <am...@datadirect.com>.
pundog wrote:
>
> Alberto Massari wrote:
>> As for the std::string -> XMLCh* conversion, just use the transcodeFrom
>> method exposed by the same transcoder.
>>
>> Alberto
>>
>
> But the "transcodeFrom" function receives an XMLCh* as the source string.
> but i've got an std::string. How can i convert my std::string to XMLCh*
> using a specific (utf-8) encoding?
As you can read here 
http://xerces.apache.org/xerces-c/apiDocs/classXMLTranscoder.html#b9d5409d562aa54f99dc01617091c457, 
the signature is

virtual unsigned int XMLTranscoder::transcodeFrom( const XMLByte 
<http://xerces.apache.org/xerces-c/apiDocs/XercesDefs_8hpp.html#7470c7a32c59355685ebcd878a33f126> 
*const srcData, const unsigned int srcCount, XMLCh *const toFill, const 
unsigned int /maxChars/, unsigned int & /bytesEaten/, unsigned char 
*const /charSizes)/
//
//It takes XMLByte* (i.e. unsigned char*, that you should get from 
std::string.c_str() ) and fills an XMLCh buffer of maxChars.

Alberto


Re: Converting XMLCh* to std::string with encoding

Posted by pundog <pu...@gmail.com>.

Alberto Massari wrote:
> 
> As for the std::string -> XMLCh* conversion, just use the transcodeFrom 
> method exposed by the same transcoder.
> 
> Alberto
> 

But the "transcodeFrom" function receives an XMLCh* as the source string.
but i've got an std::string. How can i convert my std::string to XMLCh*
using a specific (utf-8) encoding?

Thanks :)
-- 
View this message in context: http://www.nabble.com/Converting-XMLCh*-to-std%3A%3Astring-with-encoding-tf4807661.html#a13996046
Sent from the Xerces - C - Users mailing list archive at Nabble.com.


Re: Converting XMLCh* to std::string with encoding

Posted by Alberto Massari <am...@datadirect.com>.
pundog wrote:
> Hi, i've noticed the leak when running the following code:
>
> while(true)
> {
> XMLTranscoder* utf8Transcoder;
> XMLTransService::Codes failReason;
> utf8Transcoder =
> XMLPlatformUtils::fgTranscService->makeNewTranscoderFor("UTF-8", failReason,
> 16*1024);
>
> size_t len = XMLString::stringLen(value);
> XMLByte* utf8 = new XMLByte[(len*4)+1];
> unsigned int eaten;
> unsigned int utf8Len = utf8Transcoder->transcodeTo(my_hebrew_string, len,
> utf8, len*4, eaten, XMLTranscoder::UnRep_Throw);
>
> utf8[utf8Len] = '\0';
> string str = (char*)utf8;
>
> delete[] utf8;
>   
You also need to release the transcoder

    utf8Transcoder->release();

BTW, I would create it just once outside the 'while' loop.

As for the std::string -> XMLCh* conversion, just use the transcodeFrom 
method exposed by the same transcoder.

Alberto
> }
>
> Also, how can i accomplish the opposite task of convert an std::string to
> XMLCh* without using the regular "transcode" function?
> my string contains non-english characters and i need to convert it to an
> XMLCh* ... how can i do that?
>
> Thanks again
> }
>
>
> David Bertoni wrote:
>   
>> pundog wrote:
>>     
>>> Hi,
>>> I tried the code you posted, i only modified the line in which you create
>>> the XMLByte*, instead of using "()" i replaced it with "[]". (when i used
>>> the () and exception was thrown when i tried to delete the XMLByte*)
>>>       
>> Yes, that was a typo, it should have been:
>>
>> XMLByte* utf8 = new XMLByte[(len * 4) + 1];
>>
>>     
>>> The problem i've got now, is that this code causes a memory leak.. when i
>>> tried to run it in a "while(true)" loop, it produced a serious leak. How
>>> can
>>> i fix it?
>>>       
>> I have no idea why your code would be leaking.  If you post a minimal 
>> sample that exhibits the problem, perhaps someone can help you.
>>
>>     
>>> And another thing, is it ok to convert an XMLByte* to char*? or is there
>>> a
>>> better way for converting and XMLByte* to a std::string?
>>>       
>> There's no problem with casting an XMLByte* to a char*.  However, since 
>> you're using a std::string as a return value, the best way to do this is
>> to 
>> use a fixed size buffer for the transcode() call, then transcode and
>> append 
>> each buffer to the result std::string in a loop.  The parameters for each 
>> call of transcode, and the return value will tell you how much of the 
>> source string has been transcoded and how many bytes have been placed in 
>> the output buffer.
>>
>> Dave
>>
>>
>>     
>
>   


Re: Converting XMLCh* to std::string with encoding

Posted by pundog <pu...@gmail.com>.
Hi, i've noticed the leak when running the following code:

while(true)
{
XMLTranscoder* utf8Transcoder;
XMLTransService::Codes failReason;
utf8Transcoder =
XMLPlatformUtils::fgTranscService->makeNewTranscoderFor("UTF-8", failReason,
16*1024);

size_t len = XMLString::stringLen(value);
XMLByte* utf8 = new XMLByte[(len*4)+1];
unsigned int eaten;
unsigned int utf8Len = utf8Transcoder->transcodeTo(my_hebrew_string, len,
utf8, len*4, eaten, XMLTranscoder::UnRep_Throw);

utf8[utf8Len] = '\0';
string str = (char*)utf8;

delete[] utf8;
}

Also, how can i accomplish the opposite task of convert an std::string to
XMLCh* without using the regular "transcode" function?
my string contains non-english characters and i need to convert it to an
XMLCh* ... how can i do that?

Thanks again
}


David Bertoni wrote:
> 
> pundog wrote:
>> Hi,
>> I tried the code you posted, i only modified the line in which you create
>> the XMLByte*, instead of using "()" i replaced it with "[]". (when i used
>> the () and exception was thrown when i tried to delete the XMLByte*)
> Yes, that was a typo, it should have been:
> 
> XMLByte* utf8 = new XMLByte[(len * 4) + 1];
> 
>> 
>> The problem i've got now, is that this code causes a memory leak.. when i
>> tried to run it in a "while(true)" loop, it produced a serious leak. How
>> can
>> i fix it?
> 
> I have no idea why your code would be leaking.  If you post a minimal 
> sample that exhibits the problem, perhaps someone can help you.
> 
>> 
>> And another thing, is it ok to convert an XMLByte* to char*? or is there
>> a
>> better way for converting and XMLByte* to a std::string?
> 
> There's no problem with casting an XMLByte* to a char*.  However, since 
> you're using a std::string as a return value, the best way to do this is
> to 
> use a fixed size buffer for the transcode() call, then transcode and
> append 
> each buffer to the result std::string in a loop.  The parameters for each 
> call of transcode, and the return value will tell you how much of the 
> source string has been transcoded and how many bytes have been placed in 
> the output buffer.
> 
> Dave
> 
> 

-- 
View this message in context: http://www.nabble.com/Converting-XMLCh*-to-std%3A%3Astring-with-encoding-tf4807661.html#a13988657
Sent from the Xerces - C - Users mailing list archive at Nabble.com.


Re: Converting XMLCh* to std::string with encoding

Posted by David Bertoni <db...@apache.org>.
pundog wrote:
> Hi,
> I tried the code you posted, i only modified the line in which you create
> the XMLByte*, instead of using "()" i replaced it with "[]". (when i used
> the () and exception was thrown when i tried to delete the XMLByte*)
Yes, that was a typo, it should have been:

XMLByte* utf8 = new XMLByte[(len * 4) + 1];

> 
> The problem i've got now, is that this code causes a memory leak.. when i
> tried to run it in a "while(true)" loop, it produced a serious leak. How can
> i fix it?

I have no idea why your code would be leaking.  If you post a minimal 
sample that exhibits the problem, perhaps someone can help you.

> 
> And another thing, is it ok to convert an XMLByte* to char*? or is there a
> better way for converting and XMLByte* to a std::string?

There's no problem with casting an XMLByte* to a char*.  However, since 
you're using a std::string as a return value, the best way to do this is to 
use a fixed size buffer for the transcode() call, then transcode and append 
each buffer to the result std::string in a loop.  The parameters for each 
call of transcode, and the return value will tell you how much of the 
source string has been transcoded and how many bytes have been placed in 
the output buffer.

Dave

Re: Converting XMLCh* to std::string with encoding

Posted by pundog <pu...@gmail.com>.
Hi,
I tried the code you posted, i only modified the line in which you create
the XMLByte*, instead of using "()" i replaced it with "[]". (when i used
the () and exception was thrown when i tried to delete the XMLByte*)

The problem i've got now, is that this code causes a memory leak.. when i
tried to run it in a "while(true)" loop, it produced a serious leak. How can
i fix it?

And another thing, is it ok to convert an XMLByte* to char*? or is there a
better way for converting and XMLByte* to a std::string?

Thanks again :)


David Bertoni wrote:
> 
> The transcoder does not allocate a target buffer for transcoding.  Please 
> make sure you read the comments for any functions you try to use:
> 
> /** Converts from the encoding of the service to the internal XMLCh*
> encoding
>    *
>    * @param srcData the source buffer to be transcoded
>    * @param srcCount number of bytes in the source buffer
>    * @param toFill the destination buffer
>    * @param maxChars the max number of characters in the destination
> buffer
> 
> Since you allocated a single byte, but probably passed in a larger value, 
> your code suffers from a buffer overrun error.  The exception is probably
> a 
> result of your code trashing some heap control information.  Or perhaps
> you 
> used "delete", instead of "delete []".
> 
> Search the code for other uses of this functionality, because it's more 
> complicated than just making a single call to the transcoder, if you want 
> reasonable efficiency.
> 
> If you want a simple, but potentially inefficient implementation, you can 
> just assume 4 bytes of UTF-8 for every byte of the input and allocate a 
> buffer accordingly.
> 
> size_t len = XMLString::stringLen(text);
> XMLByte* utf8 = new XMLByte((len * 4) + 1); // ?
> unsigned int eaten;
> unsigned int utf8Len = utf8Transcoder->transcodeTo(text, len, utf8, len *
> 4,
> eaten, XMLTranscoder::UnRep_Throw);
> 
>   utf8[utf8Len] = '\0';
>   string str = (char*)utf8;
> 
>   delete [] utf8;
> 
> Dave
> 
> 

-- 
View this message in context: http://www.nabble.com/Converting-XMLCh*-to-std%3A%3Astring-with-encoding-tf4807661.html#a13817081
Sent from the Xerces - C - Users mailing list archive at Nabble.com.


Re: Converting XMLCh* to std::string with encoding

Posted by David Bertoni <db...@apache.org>.
pundog wrote:
> Thanks for the quick reply,
> I searched the archive and came up with this:
> 
> const XMLCh* text = MY_HEBREW_TEXT; // initialized by the parser
> 
> XMLTranscoder* utf8Transcoder;
> XMLTransService::Codes failReason;
> utf8Transcoder =
> XMLPlatformUtils::fgTransService->makeNewTranscoderFor("UTF-8", failReason,
> 16*1024);
> 
> int len = XMLString::stringLen(text);
> XMLByte* utf8 = new XMLByte(); // ?
> unsigned int eaten;
> unsigned int utf8Len = utf8Transcoder->transcodeTo(text, len, utf8, len,
> eacten, XMLTranscoder::UnRep_Throw);
> 
> utf8[utf8Len] = '\0';
> string str = (char*)utf8;
> 
> return str;
> 
> It looks like it works, but problem is that i'm getting a serious memory
> leak from this code, and i don't really know why.
> i tried to delete the XMLByte*, but when i try to do that, i get a nasty
> exception..
The transcoder does not allocate a target buffer for transcoding.  Please 
make sure you read the comments for any functions you try to use:

/** Converts from the encoding of the service to the internal XMLCh* encoding
   *
   * @param srcData the source buffer to be transcoded
   * @param srcCount number of bytes in the source buffer
   * @param toFill the destination buffer
   * @param maxChars the max number of characters in the destination buffer

Since you allocated a single byte, but probably passed in a larger value, 
your code suffers from a buffer overrun error.  The exception is probably a 
result of your code trashing some heap control information.  Or perhaps you 
used "delete", instead of "delete []".

Search the code for other uses of this functionality, because it's more 
complicated than just making a single call to the transcoder, if you want 
reasonable efficiency.

If you want a simple, but potentially inefficient implementation, you can 
just assume 4 bytes of UTF-8 for every byte of the input and allocate a 
buffer accordingly.

size_t len = XMLString::stringLen(text);
XMLByte* utf8 = new XMLByte((len * 4) + 1); // ?
unsigned int eaten;
unsigned int utf8Len = utf8Transcoder->transcodeTo(text, len, utf8, len * 4,
eaten, XMLTranscoder::UnRep_Throw);

  utf8[utf8Len] = '\0';
  string str = (char*)utf8;

  delete [] utf8;

Dave

Re: Converting XMLCh* to std::string with encoding

Posted by pundog <pu...@gmail.com>.
Thanks for the quick reply,
I searched the archive and came up with this:

const XMLCh* text = MY_HEBREW_TEXT; // initialized by the parser

XMLTranscoder* utf8Transcoder;
XMLTransService::Codes failReason;
utf8Transcoder =
XMLPlatformUtils::fgTransService->makeNewTranscoderFor("UTF-8", failReason,
16*1024);

int len = XMLString::stringLen(text);
XMLByte* utf8 = new XMLByte(); // ?
unsigned int eaten;
unsigned int utf8Len = utf8Transcoder->transcodeTo(text, len, utf8, len,
eacten, XMLTranscoder::UnRep_Throw);

utf8[utf8Len] = '\0';
string str = (char*)utf8;

return str;

It looks like it works, but problem is that i'm getting a serious memory
leak from this code, and i don't really know why.
i tried to delete the XMLByte*, but when i try to do that, i get a nasty
exception..

Is the code i wrote here is supposed to work? and how can i release the
memory allocated by this code?
Thanks again.


David Bertoni wrote:
> 
> pundog wrote:
>> I've got an xml file the contains some Hebrew characters instead of
>> English. 
>> I made sure that i saved the file in UTF-8 encoding (using VS.Net), and i
>> also added an xml deceleration line with the encoding attribute set to
>> UTF-8.
>> 
>> when i parse the xml, i get an XMLCh* that contains my Hebrew characters,
>> and when i browse it's value in the debugger, i can see the actuall
>> Hebrew
>> characters. However, when i try to convert the XMLCh* to a std::string
>> using
>> the "transcode" method, my string is filled with "????" characters
>> instead
>> of Hebrew.
> 
> You need to read the documentation regarding what transcode() does.  Since 
> it transcodes to the local code page, it's likely your code page doesn't 
> support those characters.  You probably want to transcode to UTF-8,
> instead 
> of the local code page.
> 
> If you search the archives of the mailing list, you'll find many postings 
> regarding this issue.
> 
> Dave
> 
> 

-- 
View this message in context: http://www.nabble.com/Converting-XMLCh*-to-std%3A%3Astring-with-encoding-tf4807661.html#a13765796
Sent from the Xerces - C - Users mailing list archive at Nabble.com.


Re: Converting XMLCh* to std::string with encoding

Posted by David Bertoni <db...@apache.org>.
pundog wrote:
> I've got an xml file the contains some Hebrew characters instead of English. 
> I made sure that i saved the file in UTF-8 encoding (using VS.Net), and i
> also added an xml deceleration line with the encoding attribute set to
> UTF-8.
> 
> when i parse the xml, i get an XMLCh* that contains my Hebrew characters,
> and when i browse it's value in the debugger, i can see the actuall Hebrew
> characters. However, when i try to convert the XMLCh* to a std::string using
> the "transcode" method, my string is filled with "????" characters instead
> of Hebrew.

You need to read the documentation regarding what transcode() does.  Since 
it transcodes to the local code page, it's likely your code page doesn't 
support those characters.  You probably want to transcode to UTF-8, instead 
of the local code page.

If you search the archives of the mailing list, you'll find many postings 
regarding this issue.

Dave