You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Robert Parker <Ro...@evolving.com> on 2005/06/07 13:53:21 UTC
utf-8 encoded attribute values
Hi
I am parsing an XML string that is encoded in UTF-8 and I am using the
following code to view element attributes:
DOM_NamedNodeMap NodeMap = node.getAttributes();
if ( NodeMap != NULL) {
unsigned int len = NodeMap.getLength();
for ( int i = 0; i < len; ++i) {
DOM_Node attr = NodeMap.item(i);
DOMString tag = attr.getNodeName();
char *t = tag.transcode();
printf (" %s=", t );
delete [] t;
DOMString value = attr.getNodeValue();
t = value.transcode();
printf ("%s\n", t );
delete [] t;
for ( int i = 0; i < value.length() ; i++ )
{
printf( " AT %d %c %02x\n", i, value.charAt(i), value.charAt(i) );
}
}
}
Both the transcode'd value and the "raw" value.charAt() shows my parsed
attribute value as latin-1
It seems to me that Xerces converts the UTF-8 encoded attribute values
during the parse.
How can I get Xerces to return the actual UTF-8 encoded data rather than
the latin-1 representation?
(I am using Xerces 1.5.2 ! - I know it's old but I'm trying to avoid a
massive upgrade exercise if at all possible)
thanks
Robert
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________
Re: utf-8 encoded attribute values
Posted by Alberto Massari <am...@datadirect.com>.
Hi Robert,
once the file has been parsed, all you see is 16-bit Unicode values
(UTF-16); that, if you deal only with english text, will look the same as
latin-1.
Usually you shouldn't care about seeing the original UTF-8 sequence, as you
should be interested in the actual character being represented; but if you
need it for a valid reason, you should instanciate the UTF8Transcoder and
tell him to transcode from Unicode to UTF-8.
Alberto
At 12.53 07/06/2005 +0100, Robert Parker wrote:
>Hi
>
>I am parsing an XML string that is encoded in UTF-8 and I am using the
>following code to view element attributes:
>
> DOM_NamedNodeMap NodeMap = node.getAttributes();
> if ( NodeMap != NULL) {
>
> unsigned int len = NodeMap.getLength();
> for ( int i = 0; i < len; ++i) {
> DOM_Node attr = NodeMap.item(i);
>
> DOMString tag = attr.getNodeName();
> char *t = tag.transcode();
> printf (" %s=", t );
> delete [] t;
>
> DOMString value = attr.getNodeValue();
> t = value.transcode();
> printf ("%s\n", t );
> delete [] t;
>for ( int i = 0; i < value.length() ; i++ )
>{
>printf( " AT %d %c %02x\n", i, value.charAt(i), value.charAt(i) );
>}
> }
> }
>
>Both the transcode'd value and the "raw" value.charAt() shows my parsed
>attribute value as latin-1
>
>It seems to me that Xerces converts the UTF-8 encoded attribute values
>during the parse.
>How can I get Xerces to return the actual UTF-8 encoded data rather than
>the latin-1 representation?
>
>(I am using Xerces 1.5.2 ! - I know it's old but I'm trying to avoid a
>massive upgrade exercise if at all possible)
>
>thanks
>Robert
>
>
>
>______________________________________________________________________
>This email has been scanned by the MessageLabs Email Security System.
>For more information please visit http://www.messagelabs.com/email
>______________________________________________________________________