You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-users@xerces.apache.org by Robert Parker <Ro...@evolving.com> on 2005/06/07 13:53:21 UTC

utf-8 encoded attribute values

Hi
 
I am parsing an XML string that is encoded in UTF-8 and I am using the
following code to view element attributes:
 
    DOM_NamedNodeMap NodeMap    = node.getAttributes();
    if ( NodeMap != NULL) {
 
        unsigned int len = NodeMap.getLength();
        for ( int i = 0; i < len; ++i) {
            DOM_Node attr = NodeMap.item(i);
 
            DOMString tag = attr.getNodeName();
            char *t = tag.transcode();
            printf ("    %s=", t );
            delete [] t;
 
            DOMString value     = attr.getNodeValue();
            t = value.transcode();
            printf ("%s\n", t );
            delete [] t;
for ( int i = 0; i < value.length() ; i++ )
{
printf( " AT %d %c %02x\n", i, value.charAt(i), value.charAt(i) );
}
        }
    }
 
Both the transcode'd value and the "raw" value.charAt() shows my parsed
attribute value as latin-1
 
It seems to me that Xerces converts the UTF-8 encoded attribute values
during the parse. 
How can I get Xerces to return the actual UTF-8 encoded data rather than
the latin-1 representation?
 
(I am using Xerces 1.5.2 ! - I know it's old but I'm trying to avoid a
massive upgrade exercise if at all possible)
 
thanks
Robert 
 


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

Re: utf-8 encoded attribute values

Posted by Alberto Massari <am...@datadirect.com>.

Hi Robert,
once the file has been parsed, all you see is 16-bit Unicode values 
(UTF-16); that, if you deal only with english text, will look the same as 
latin-1.
Usually you shouldn't care about seeing the original UTF-8 sequence, as you 
should be interested in the actual character being represented; but if you 
need it for a valid reason, you should instanciate the UTF8Transcoder and 
tell him to transcode from Unicode to UTF-8.

Alberto

At 12.53 07/06/2005 +0100, Robert Parker wrote:
>Hi
>
>I am parsing an XML string that is encoded in UTF-8 and I am using the
>following code to view element attributes:
>
>     DOM_NamedNodeMap NodeMap    = node.getAttributes();
>     if ( NodeMap != NULL) {
>
>         unsigned int len = NodeMap.getLength();
>         for ( int i = 0; i < len; ++i) {
>             DOM_Node attr = NodeMap.item(i);
>
>             DOMString tag = attr.getNodeName();
>             char *t = tag.transcode();
>             printf ("    %s=", t );
>             delete [] t;
>
>             DOMString value     = attr.getNodeValue();
>             t = value.transcode();
>             printf ("%s\n", t );
>             delete [] t;
>for ( int i = 0; i < value.length() ; i++ )
>{
>printf( " AT %d %c %02x\n", i, value.charAt(i), value.charAt(i) );
>}
>         }
>     }
>
>Both the transcode'd value and the "raw" value.charAt() shows my parsed
>attribute value as latin-1
>
>It seems to me that Xerces converts the UTF-8 encoded attribute values
>during the parse.
>How can I get Xerces to return the actual UTF-8 encoded data rather than
>the latin-1 representation?
>
>(I am using Xerces 1.5.2 ! - I know it's old but I'm trying to avoid a
>massive upgrade exercise if at all possible)
>
>thanks
>Robert
>
>
>
>______________________________________________________________________
>This email has been scanned by the MessageLabs Email Security System.
>For more information please visit http://www.messagelabs.com/email
>______________________________________________________________________