You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-users@xerces.apache.org by Matthew Boulter <mb...@technisyst.com.au> on 2008/09/09 07:21:20 UTC

Losing UTF-8 characters at the end of a string

Hi all, I just wanted some guidance of where to expend my investigation effort
into this topic.

 

I have a MySQL database that contains names of some Polish tram stops that I am 
extracting and encoding as WBXML for transmission.

Now I find when I get them from the database all is good until I get to the part where
I'm at our DomToWbxml task.

 

I find if the string has a Polish character it loses a character from the end of the string, 
if there are two it loses two and so on.

I read Xerces is UTF-16? If so am I losing something (other than my mind) going back to UTF-8 ?

 

Any help is greatly appreciated.

Regards, 

Matthew

 

 

Below is an extract from a logfile:

 

Next stop name == Łukowa

In operator () - node_line.c_str() == 40,676,51.71878333,19.43771667,busstop,620,5,80,25,10,exit,none,,,never, Łukowa

DomToWbxml::Process

AddContent - c_str(40,676,51.71878333,19.43771667,busstop,620,5,80,25,10,exit,none,,,never, Łukow)

AddContent - none (40,676,51.71878333,19.43771667,busstop,620,5,80,25,10,exit,none,,,never, Łukow)

 

--[Into DOM Code Extract]-----------------------

 

        DOM_Element node_element = vss_doc->createElement("node");

        cout << "In operator () - node_line.c_str() == " << node_line.c_str() << endl; 

        node_element.appendChild(vss_doc->createTextNode(node_line.c_str()));

        service_element.appendChild(node_element);

 

--[/Into DOM Code Extract]-----------------------

 

There is obviously a lot of code not shown, but I believe the above is where the issue is, 
or perhaps it is going into the WBXML parser

 

--[Into WBXML Code Extract]-----------------------

 

bool WbxmlParser::convert_to_wbxml(const DOM_Node & node)

{

  string node_name         = transcode( node.getNodeName() );

  string node_value         = transcode( node.getNodeValue() );

 

Using ...

 

string WbxmlParser::transcode(const DOMString& xstr)

{

  if (xstr == 0)

    return string();

    

  char * cstr = xstr.transcode();

  string result(cstr);

  delete [] cstr;

  return result;

}

 

--[/Into WBXML Code Extract]-----------------------

Re: Losing UTF-8 characters at the end of a string

Posted by David Bertoni <db...@apache.org>.

Matthew Boulter wrote:
> Hi all, I just wanted some guidance of where to expend my investigation effort
> into this topic.
> 
> I have a MySQL database that contains names of some Polish tram stops that I am 
> extracting and encoding as WBXML for transmission.
> 
> Now I find when I get them from the database all is good until I get to the part where
> I'm at our DomToWbxml task.
> 
> I find if the string has a Polish character it loses a character from the end of the string, 
> if there are two it loses two and so on.
> 
> I read Xerces is UTF-16? If so am I losing something (other than my mind) going back to UTF-8 ?
> 
> Any help is greatly appreciated.
This is probably the number one problem people experience when using 
Xerces-C.

Please read the documentation carefully, as the transcoding API you're 
using is _not_ transcoding to UTF-8.  Rather, it is transcoding to the 
local code page, so the disappearing characters are probably not 
representable in the local code page.  Instead of using 
DOM_String::transcode(), you need to create a UTF-8 transcoder and use that.

Also, you're using the deprecated DOM, which will disappear in Xerces-C 
3.0.  I would suggest you update your code to use the new DOM.

For more information, please search the mailing list archives for 
"transcoding."  Here's a good place to start:

http://marc.info/?l=xerces-c-users&m=119514889329902&w=2

Dave