You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@struts.apache.org by Ashish Kulkarni <as...@gmail.com> on 2007/04/16 20:12:22 UTC

[OT] How to handle non UTF characters in XML

Hi
I have java class which creates an XML file from SQL resultset,
It works fine in USA, but i am having issues when this process runs in
Germany where they have non UTF characters in there database like ü or á.
How do we handle this kind of situation in XML file, i set the XML file to
be of UTF-8 type,

the java code which creates the XML file is as below

Document document = builder.newDocument();
Element root = (Element) document.createElement(rootElement);
document.appendChild(root);
// create element with ResultSetMetaData Name
Element record = document.createElement(rm.getColumnName(i));

// add text node with the actual value
record.appendChild(document.createTextNode(rs.getString(k));


Ashish

Re: [OT] How to handle non UTF characters in XML

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ashish,

Ashish Kulkarni wrote:
> Here is the code where i read the dom tree and then convert it to a String,
> then convert this string into Byte array and then user
> DocumentBuilder().parse to parse it.

Silly question... why are you going that? Is this just a test?

> I get error in factory.newDocumentBuilder().parse(byteArray);
> 
> TransformerFactory tFactory =
>            TransformerFactory.newInstance();
>        Transformer transformer = tFactory.newTransformer();
>        StringWriter writer = new StringWriter();
>        DOMSource source = new DOMSource(doc);
>        transformer.transform(source, new StreamResult(writer));
>        String obj = writer.toString();

Everything up to here is fine, except that there might be a problem with
the XML emitter knowing what type of encoding you /will be/ using. Note
that there are no encoding issues handled above, but you are converting
the DOM tree into a String of characters. How does your XML library know
what to put into "<?xml encoding="????" ?>"?

> ByteArrayInputStream byteArray = new ByteArrayInputStream(obj.getBytes());

Here's one problem: you fail to specify the character encoding of the
bytes returned by String.getBytes(). You should explicitly pass "UTF-8"
(or something like that... ISO-8859-1 should work, too, as that includes
Latin characters like the ones you mentioned).

If you don't specify the character encoding (see the javadoc for
String.getBytes()), I believe you get the default file encoding of the
currently running JVM, which could be anything :(

> Document doc = factory.newDocumentBuilder().parse(byteArray);

This should work as long as you have properly converted characters into
bytes above. The DOM parser ought to sniff the encoding from the
"encoding" attribute of the XML processing instruction.

- -chris

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGJAFK9CaO5/Lv0PARAs1gAJ9a3udjyBpZtJ74VFFx4ldTcJ8nqgCfVShQ
UiYGO31v+SdQCsKaeam1KvI=
=nQFP
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@struts.apache.org
For additional commands, e-mail: user-help@struts.apache.org


Re: [OT] How to handle non UTF characters in XML

Posted by Joe Germuska <jo...@germuska.com>.
See, the problem is that you're not handling the character encoding
correctly in general.  You should use String's getBytes method only when you
know what you're doing, because the whole point of character encodings is
that you can represent any given string with different sequences of bytes.

I'd suggest doing more research on encoding in general: here's one popular
piece, although not Javacentric:
The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No
Excuses!)<http://www.joelonsoftware.com/articles/Unicode.html>

>From there, you may want to review the APIs for java.io.Reader and
java.io.Writer, which are specifically designed to help smooth over the
issues involved in serializing Java strings to bytes.

This looks like it's going way too far off topic to be something that should
be discussed much further on the Struts list.

Best,
  Joe


On 4/16/07, Ashish Kulkarni <as...@gmail.com> wrote:
>
> Hi
> Here is the code where i read the dom tree and then convert it to a
> String,
> then convert this string into Byte array and then user
> DocumentBuilder().parse to parse it.
>
> I get error in factory.newDocumentBuilder().parse(byteArray);
>
>
> TransformerFactory tFactory =
>             TransformerFactory.newInstance();
>         Transformer transformer = tFactory.newTransformer();
>         StringWriter writer = new StringWriter();
>         DOMSource source = new DOMSource(doc);
>         transformer.transform(source, new StreamResult(writer));
>         String obj = writer.toString();
> ByteArrayInputStream byteArray = new ByteArrayInputStream(obj.getBytes());
> Document doc = factory.newDocumentBuilder().parse(byteArray);
>
>
> Ashish
> On 4/16/07, Joe Germuska <jo...@germuska.com> wrote:
> >
> > On 4/16/07, Christopher Schultz <ch...@christopherschultz.net> wrote:
> > >
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: SHA1
> > >
> > > Ashish,
> > >
> > > Ashish Kulkarni wrote:
> > > > I have java class which creates an XML file from SQL resultset,
> > > > It works fine in USA, but i am having issues when this process runs
> in
> > > > Germany where they have non UTF characters in there database like ü
> or
> > > á.
> > >
> > > I think you mean non-lower-ASCII. This characters are certainly
> covered
> > > by UTF-8.
> > >
> > > > How do we handle this kind of situation in XML file, i set the XML
> > file
> > > to
> > > > be of UTF-8 type.
> > >
> > > How do you set the file "type" to UTF-8?
> >
> >
> > I'm assuming Ashish is talking about the "encoding" attribute of the XML
> > declaration in the first line of the file.
> >
> > Chris is correct that the real magic happens when you serialize the DOM
> to
> > a
> > file, but you should be sure to use the same encoding with the writer
> that
> > actually creates the file as you do in the XML declaration.  If your
> > characters aren't UTF-8 then don't use UTF-8.  Any decent XML reading
> > software will recognize the encoding when the file is read.
> >
> > Joe
> >
> > --
> > Joe Germuska
> > Joe@Germuska.com * http://blog.germuska.com
> >
> > "The truth is that we learned from João forever to be out of tune."
> > -- Caetano Veloso
> >
>



-- 
Joe Germuska
Joe@Germuska.com * http://blog.germuska.com

"The truth is that we learned from João forever to be out of tune."
-- Caetano Veloso

Re: [OT] How to handle non UTF characters in XML

Posted by Ashish Kulkarni <as...@gmail.com>.
Hi
Here is the code where i read the dom tree and then convert it to a String,
then convert this string into Byte array and then user
DocumentBuilder().parse to parse it.

I get error in factory.newDocumentBuilder().parse(byteArray);


 TransformerFactory tFactory =
            TransformerFactory.newInstance();
        Transformer transformer = tFactory.newTransformer();
        StringWriter writer = new StringWriter();
        DOMSource source = new DOMSource(doc);
        transformer.transform(source, new StreamResult(writer));
        String obj = writer.toString();
ByteArrayInputStream byteArray = new ByteArrayInputStream(obj.getBytes());
Document doc = factory.newDocumentBuilder().parse(byteArray);


Ashish
On 4/16/07, Joe Germuska <jo...@germuska.com> wrote:
>
> On 4/16/07, Christopher Schultz <ch...@christopherschultz.net> wrote:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Ashish,
> >
> > Ashish Kulkarni wrote:
> > > I have java class which creates an XML file from SQL resultset,
> > > It works fine in USA, but i am having issues when this process runs in
> > > Germany where they have non UTF characters in there database like ü or
> > á.
> >
> > I think you mean non-lower-ASCII. This characters are certainly covered
> > by UTF-8.
> >
> > > How do we handle this kind of situation in XML file, i set the XML
> file
> > to
> > > be of UTF-8 type.
> >
> > How do you set the file "type" to UTF-8?
>
>
> I'm assuming Ashish is talking about the "encoding" attribute of the XML
> declaration in the first line of the file.
>
> Chris is correct that the real magic happens when you serialize the DOM to
> a
> file, but you should be sure to use the same encoding with the writer that
> actually creates the file as you do in the XML declaration.  If your
> characters aren't UTF-8 then don't use UTF-8.  Any decent XML reading
> software will recognize the encoding when the file is read.
>
> Joe
>
> --
> Joe Germuska
> Joe@Germuska.com * http://blog.germuska.com
>
> "The truth is that we learned from João forever to be out of tune."
> -- Caetano Veloso
>

Re: [OT] How to handle non UTF characters in XML

Posted by Joe Germuska <jo...@germuska.com>.
On 4/16/07, Christopher Schultz <ch...@christopherschultz.net> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ashish,
>
> Ashish Kulkarni wrote:
> > I have java class which creates an XML file from SQL resultset,
> > It works fine in USA, but i am having issues when this process runs in
> > Germany where they have non UTF characters in there database like ü or
> á.
>
> I think you mean non-lower-ASCII. This characters are certainly covered
> by UTF-8.
>
> > How do we handle this kind of situation in XML file, i set the XML file
> to
> > be of UTF-8 type.
>
> How do you set the file "type" to UTF-8?


I'm assuming Ashish is talking about the "encoding" attribute of the XML
declaration in the first line of the file.

Chris is correct that the real magic happens when you serialize the DOM to a
file, but you should be sure to use the same encoding with the writer that
actually creates the file as you do in the XML declaration.  If your
characters aren't UTF-8 then don't use UTF-8.  Any decent XML reading
software will recognize the encoding when the file is read.

Joe

-- 
Joe Germuska
Joe@Germuska.com * http://blog.germuska.com

"The truth is that we learned from João forever to be out of tune."
-- Caetano Veloso

Re: [OT] How to handle non UTF characters in XML

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ashish,

Ashish Kulkarni wrote:
> I have java class which creates an XML file from SQL resultset,
> It works fine in USA, but i am having issues when this process runs in
> Germany where they have non UTF characters in there database like ü or á.

I think you mean non-lower-ASCII. This characters are certainly covered
by UTF-8.

> How do we handle this kind of situation in XML file, i set the XML file to
> be of UTF-8 type.

How do you set the file "type" to UTF-8?

> the java code which creates the XML file is as below

This code is not relevant to the character encoding. How do you convert
your DOM tree to XML text? That's where the magic happens. Do you create
your own Writer object?

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGI9bO9CaO5/Lv0PARAm+CAJ9PHzvjbl7ftLyzHwTCG7aZ8r2RYQCgqtxU
MfGdL4vBq8g9K7eFJJxOR6Y=
=Yd6n
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@struts.apache.org
For additional commands, e-mail: user-help@struts.apache.org