You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Matthew Boulter <mb...@technisyst.com.au> on 2008/09/09 07:21:20 UTC
Losing UTF-8 characters at the end of a string
Hi all, I just wanted some guidance of where to expend my investigation effort
into this topic.
I have a MySQL database that contains names of some Polish tram stops that I am
extracting and encoding as WBXML for transmission.
Now I find when I get them from the database all is good until I get to the part where
I'm at our DomToWbxml task.
I find if the string has a Polish character it loses a character from the end of the string,
if there are two it loses two and so on.
I read Xerces is UTF-16? If so am I losing something (other than my mind) going back to UTF-8 ?
Any help is greatly appreciated.
Regards,
Matthew
Below is an extract from a logfile:
Next stop name == Łukowa
In operator () - node_line.c_str() == 40,676,51.71878333,19.43771667,busstop,620,5,80,25,10,exit,none,,,never, Łukowa
DomToWbxml::Process
AddContent - c_str(40,676,51.71878333,19.43771667,busstop,620,5,80,25,10,exit,none,,,never, Łukow)
AddContent - none (40,676,51.71878333,19.43771667,busstop,620,5,80,25,10,exit,none,,,never, Łukow)
--[Into DOM Code Extract]-----------------------
DOM_Element node_element = vss_doc->createElement("node");
cout << "In operator () - node_line.c_str() == " << node_line.c_str() << endl;
node_element.appendChild(vss_doc->createTextNode(node_line.c_str()));
service_element.appendChild(node_element);
--[/Into DOM Code Extract]-----------------------
There is obviously a lot of code not shown, but I believe the above is where the issue is,
or perhaps it is going into the WBXML parser
--[Into WBXML Code Extract]-----------------------
bool WbxmlParser::convert_to_wbxml(const DOM_Node & node)
{
string node_name = transcode( node.getNodeName() );
string node_value = transcode( node.getNodeValue() );
Using ...
string WbxmlParser::transcode(const DOMString& xstr)
{
if (xstr == 0)
return string();
char * cstr = xstr.transcode();
string result(cstr);
delete [] cstr;
return result;
}
--[/Into WBXML Code Extract]-----------------------
Re: Losing UTF-8 characters at the end of a string
Posted by David Bertoni <db...@apache.org>.
Matthew Boulter wrote:
> Hi all, I just wanted some guidance of where to expend my investigation effort
> into this topic.
>
> I have a MySQL database that contains names of some Polish tram stops that I am
> extracting and encoding as WBXML for transmission.
>
> Now I find when I get them from the database all is good until I get to the part where
> I'm at our DomToWbxml task.
>
> I find if the string has a Polish character it loses a character from the end of the string,
> if there are two it loses two and so on.
>
> I read Xerces is UTF-16? If so am I losing something (other than my mind) going back to UTF-8 ?
>
> Any help is greatly appreciated.
This is probably the number one problem people experience when using
Xerces-C.
Please read the documentation carefully, as the transcoding API you're
using is _not_ transcoding to UTF-8. Rather, it is transcoding to the
local code page, so the disappearing characters are probably not
representable in the local code page. Instead of using
DOM_String::transcode(), you need to create a UTF-8 transcoder and use that.
Also, you're using the deprecated DOM, which will disappear in Xerces-C
3.0. I would suggest you update your code to use the new DOM.
For more information, please search the mailing list archives for
"transcoding." Here's a good place to start:
http://marc.info/?l=xerces-c-users&m=119514889329902&w=2
Dave