You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xalan.apache.org by Cameron McCormack <cl...@csse.monash.edu.au> on 2002/05/29 12:19:01 UTC
Xalan losing my characters
[Originally posted to comp.lang.xml, hoping for better luck here]
Hi everyone.
I'm doing a simple XSLT transformation in Java, using Xalan. When I do
the transformation, though, my non-Latin characters get converted to
question marks (?). Is there some option I have to set for it to
output these characters properly?
This is my test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<test>abc一xyz</test>
This is my test.xsl (just an identity transformation):
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
This is my test.java (the code to do the transformation):
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMResult;
import javax.xml.transform.dom.DOMSource;
import org.w3c.dom.Document;
import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer;
import java.io.StringWriter;
class test
{
public static void main(String args[])
throws java.lang.Exception
{
// Get a DocumentBuilder
DocumentBuilderFactory dFactory =
DocumentBuilderFactory.newInstance();
dFactory.setNamespaceAware(true);
dFactory.setIgnoringElementContentWhitespace(false);
DocumentBuilder dBuilder = dFactory.newDocumentBuilder();
// Get the XSL
Document xslDoc = dBuilder.parse("test.xsl");
DOMSource xslDomSource = new DOMSource(xslDoc);
xslDomSource.setSystemId("test.xsl");
// Get the XML
Document xmlDoc = dBuilder.parse("test.xml");
DOMSource xmlDocSource = new DOMSource(xmlDoc);
xmlDocSource.setSystemId("test.xml");
// A Document for the output
Document docResult = dBuilder.newDocument();
// A Transformer to do the transformation
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer(xslDomSource);
transformer.transform(xmlDocSource, new DOMResult(docResult));
// Serialize the output XML
StringWriter sw = new StringWriter();
OutputFormat format = new OutputFormat(xmlDoc, "UTF-8", true);
format.setIndent(2);
XMLSerializer serializer = new XMLSerializer(sw, format);
serializer.serialize(docResult);
// Print out the serialized output XML
System.out.print(sw.getBuffer().toString());
}
}
And when I run the program like this:
$ java test > output.xml
my output.xml contains:
<?xml version="1.0" encoding="UTF-8"?>
<test>abc?xyz</test>
What's going on here?
Thanks,
Cameron
--
Cameron McCormack
// clm@csse.monash.edu.au
// http://www.csse.monash.edu.au/~clm/
// icq 26955922
Re: Xalan losing my characters
Posted by Cameron McCormack <cl...@csse.monash.edu.au>.
Hi Peter.
Peter Davis:
> I'm no expert on character-encoding issues, but did you try looking at the
> output in a Hex editor? Make sure that the sequence for the '?' is actually
> 0x3f before blaming Xalan.
Yep, I checked. I opened the output file in vim. It definitely was an
actual '?' character. If it was the E00; character, there should've
been the three bytes E4 BD A0 in there (that's the UTF-8 encoding of
4E00).
> Since you are using a character entity in the source doc, it seems unlikely
> that your text editor for the source could be the cause, and since the output
> encoding is UTF-8 it seems unlikely that the character is lost on output. My
> guess is that the editor you are using to view the output either just doesn't
> understand UTF-8 or doesn't have a font that includes the offending
> characters.
Wish it was such a simple error on my part.
Thanks,
Cameron
--
Cameron McCormack
// clm@csse.monash.edu.au
// http://www.csse.monash.edu.au/~clm/
// icq 26955922
Re: Xalan losing my characters
Posted by Peter Davis <pe...@pdavis.cx>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I'm no expert on character-encoding issues, but did you try looking at the
output in a Hex editor? Make sure that the sequence for the '?' is actually
0x3f before blaming Xalan.
Since you are using a character entity in the source doc, it seems unlikely
that your text editor for the source could be the cause, and since the output
encoding is UTF-8 it seems unlikely that the character is lost on output. My
guess is that the editor you are using to view the output either just doesn't
understand UTF-8 or doesn't have a font that includes the offending
characters.
On Wednesday 29 May 2002 03:19, Cameron McCormack wrote:
> And when I run the program like this:
>
> $ java test > output.xml
>
> my output.xml contains:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <test>abc?xyz</test>
>
> What's going on here?
- --
Peter Davis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org
iD8DBQE89LFvNSZCJx7tYycRAlFNAJ91JOLN31ZDpm8MOWOrXTVXDigSFQCffq7j
QyU9pIm9K2kr2JEtZnGcqjU=
=t9Ti
-----END PGP SIGNATURE-----