You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "Johnson, Wayne" <Wa...@bmc.com> on 2014/10/02 17:26:21 UTC

Problem writing UTF-8 XML with an Umlaut

I have a Java program that is writing information from a database to an XML file.  I create a DOM document, add an element, and set the value with:

    Element id=parent.createElement("EventRuleInputDefinition");
...
    id.setAttribute("Value", getVal());

Later, I then go to write the Document with:

        StringWriter sw = new StringWriter();
...
            TransformerFactory transformerFactory =
                TransformerFactory.newInstance();
...
                Transformer transformer = transformerFactory.newTransformer();
                transformer.setOutputProperty (OutputKeys.ENCODING, "UTF-8");
                // Puts each stanza on a new line
                transformer.setOutputProperty(OutputKeys.INDENT, "yes");
                DOMSource source = new DOMSource(node); // node is the document root.
                StreamResult result = new StreamResult(sw);
                transformer.transform(source, result);

The XML file is properly generated with the header:
<?xml version="1.0" encoding="UTF-8"?>
But the element is written with an actual umlaut character which when read back in generates the error:

[Fatal Error] :564:103: Invalid byte 1 of 1-byte UTF-8 sequence.
org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.

We're using Xerces 2.8.1 (don't laugh, I know it's a bit old).  Could this be an issue in Xerces, or am I doing something wrong?

Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


RE: Problem writing UTF-8 XML with an Umlaut

Posted by "Johnson, Wayne" <Wa...@bmc.com>.
Thanks for the response.

The missing piece is 
        return sw.toString();

I changed sw to be a ByteArrayOutputStream and it seemed to do the trick.  

Thanks.

-----Original Message-----
From: Michael Glavassevich [mailto:mrglavas@ca.ibm.com] 
Sent: Thursday, October 02, 2014 11:09 AM
To: j-users@xerces.apache.org
Subject: Re: Problem writing UTF-8 XML with an Umlaut

In the code snippet you've shown you're writing to a StringWriter. That is a character stream which collects its output into a StringBuffer so you're not actually writing UTF-8 byte sequences anywhere here. Perhaps there's some conversion code (which you've haven't shown) which takes that String and encodes it into the bytes of some other encoding that isn't UTF-8. 
It's an error if your document declares that it has a certain encoding (e.g. UTF-8) but is encoded as something else (e.g. Windows-1252).

Unless you have a good reason to be using the StringWriter I'd recommend using an OutputStream instead. That would give the Transformer the responsibility of getting the encoding right.

Thanks.

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

"Johnson, Wayne" <Wa...@bmc.com> wrote on 10/02/2014 11:26:21 AM:
 
> I have a Java program that is writing information from a database to 
> an XML file.  I create a DOM document, add an element, and set the 
> value
with:
> 
>     Element id=parent.createElement("EventRuleInputDefinition");
> ...
>     id.setAttribute("Value", getVal());
> 
> Later, I then go to write the Document with:
> 
>         StringWriter sw = new StringWriter(); ...
>             TransformerFactory transformerFactory =
>                 TransformerFactory.newInstance(); ...
>                 Transformer transformer =
transformerFactory.newTransformer();
>                 transformer.setOutputProperty (OutputKeys.ENCODING,
"UTF-8");
>                 // Puts each stanza on a new line
>                 transformer.setOutputProperty(OutputKeys.INDENT, "yes");
>                 DOMSource source = new DOMSource(node); // node is the 
> document root.
>                 StreamResult result = new StreamResult(sw);
>                 transformer.transform(source, result);
> 
> The XML file is properly generated with the header:
> <?xml version="1.0" encoding="UTF-8"?> But the element is written with 
> an actual umlaut character which when read back in generates the 
> error:
> 
> [Fatal Error] :564:103: Invalid byte 1 of 1-byte UTF-8 sequence.
> org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
> 
> We're using Xerces 2.8.1 (don't laugh, I know it's a bit old). 
> Could this be an issue in Xerces, or am I doing something wrong?
> 
> Thanks.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: Problem writing UTF-8 XML with an Umlaut

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
In the code snippet you've shown you're writing to a StringWriter. That is 
a character stream which collects its output into a StringBuffer so you're 
not actually writing UTF-8 byte sequences anywhere here. Perhaps there's 
some conversion code (which you've haven't shown) which takes that String 
and encodes it into the bytes of some other encoding that isn't UTF-8. 
It's an error if your document declares that it has a certain encoding 
(e.g. UTF-8) but is encoded as something else (e.g. Windows-1252).

Unless you have a good reason to be using the StringWriter I'd recommend 
using an OutputStream instead. That would give the Transformer the 
responsibility of getting the encoding right.

Thanks.

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

"Johnson, Wayne" <Wa...@bmc.com> wrote on 10/02/2014 11:26:21 AM:
 
> I have a Java program that is writing information from a database to
> an XML file.  I create a DOM document, add an element, and set the value 
with:
> 
>     Element id=parent.createElement("EventRuleInputDefinition");
> ...
>     id.setAttribute("Value", getVal());
> 
> Later, I then go to write the Document with:
> 
>         StringWriter sw = new StringWriter();
> ...
>             TransformerFactory transformerFactory =
>                 TransformerFactory.newInstance();
> ...
>                 Transformer transformer = 
transformerFactory.newTransformer();
>                 transformer.setOutputProperty (OutputKeys.ENCODING, 
"UTF-8");
>                 // Puts each stanza on a new line
>                 transformer.setOutputProperty(OutputKeys.INDENT, "yes");
>                 DOMSource source = new DOMSource(node); // node is 
> the document root.
>                 StreamResult result = new StreamResult(sw);
>                 transformer.transform(source, result);
> 
> The XML file is properly generated with the header:
> <?xml version="1.0" encoding="UTF-8"?>
> But the element is written with an actual umlaut character which 
> when read back in generates the error:
> 
> [Fatal Error] :564:103: Invalid byte 1 of 1-byte UTF-8 sequence.
> org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
> 
> We're using Xerces 2.8.1 (don't laugh, I know it's a bit old). 
> Could this be an issue in Xerces, or am I doing something wrong?
> 
> Thanks.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org