You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xalan.apache.org by Cameron McCormack <cl...@csse.monash.edu.au> on 2002/05/29 12:19:01 UTC

Xalan losing my characters

[Originally posted to comp.lang.xml, hoping for better luck here]

Hi everyone.

I'm doing a simple XSLT transformation in Java, using Xalan.  When I do
the transformation, though, my non-Latin characters get converted to
question marks (&#x3f;).  Is there some option I have to set for it to
output these characters properly?

This is my test.xml:

  <?xml version="1.0" encoding="UTF-8"?>
  <test>abc&#x4e00;xyz</test>

This is my test.xsl (just an identity transformation):

  <xsl:stylesheet version="1.0"
                  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="/">
      <xsl:copy-of select="."/>
    </xsl:template>
  </xsl:stylesheet>

This is my test.java (the code to do the transformation):

  import javax.xml.parsers.DocumentBuilder;
  import javax.xml.parsers.DocumentBuilderFactory;
  import javax.xml.transform.Transformer;
  import javax.xml.transform.TransformerFactory;
  import javax.xml.transform.dom.DOMResult;
  import javax.xml.transform.dom.DOMSource;
  import org.w3c.dom.Document;
  import org.apache.xml.serialize.OutputFormat;
  import org.apache.xml.serialize.XMLSerializer;
  import java.io.StringWriter;
  
  class test
  {
      public static void main(String args[])
          throws java.lang.Exception
      {
          // Get a DocumentBuilder
          DocumentBuilderFactory dFactory =
	    DocumentBuilderFactory.newInstance();
          dFactory.setNamespaceAware(true);
          dFactory.setIgnoringElementContentWhitespace(false);
          DocumentBuilder dBuilder = dFactory.newDocumentBuilder();
  
          // Get the XSL
          Document xslDoc = dBuilder.parse("test.xsl");
          DOMSource xslDomSource = new DOMSource(xslDoc);
          xslDomSource.setSystemId("test.xsl");
  
          // Get the XML
          Document xmlDoc = dBuilder.parse("test.xml");
          DOMSource xmlDocSource = new DOMSource(xmlDoc);
          xmlDocSource.setSystemId("test.xml");
  
          // A Document for the output
          Document docResult = dBuilder.newDocument();
  
          // A Transformer to do the transformation
          TransformerFactory tFactory = TransformerFactory.newInstance();
          Transformer transformer = tFactory.newTransformer(xslDomSource);
          transformer.transform(xmlDocSource, new DOMResult(docResult));
  
          // Serialize the output XML
          StringWriter sw = new StringWriter();
          OutputFormat format = new OutputFormat(xmlDoc, "UTF-8", true);
          format.setIndent(2);
          XMLSerializer serializer = new XMLSerializer(sw, format);
          serializer.serialize(docResult);
  
          // Print out the serialized output XML
          System.out.print(sw.getBuffer().toString());
      }
  }

And when I run the program like this:

  $ java test > output.xml

my output.xml contains:

  <?xml version="1.0" encoding="UTF-8"?>
  <test>abc?xyz</test>

What's going on here?

Thanks,

Cameron

-- 
Cameron McCormack
  // clm@csse.monash.edu.au
  // http://www.csse.monash.edu.au/~clm/
  // icq 26955922

Re: Xalan losing my characters

Posted by Cameron McCormack <cl...@csse.monash.edu.au>.

Hi Peter.

Peter Davis:
> I'm no expert on character-encoding issues, but did you try looking at the 
> output in a Hex editor?  Make sure that the sequence for the '?' is actually 
> 0x3f before blaming Xalan.

Yep, I checked.  I opened the output file in vim.  It definitely was an
actual '?' character.  If it was the &#4E00; character, there should've
been the three bytes E4 BD A0 in there (that's the UTF-8 encoding of
4E00).

> Since you are using a character entity in the source doc, it seems unlikely 
> that your text editor for the source could be the cause, and since the output 
> encoding is UTF-8 it seems unlikely that the character is lost on output.  My 
> guess is that the editor you are using to view the output either just doesn't 
> understand UTF-8 or doesn't have a font that includes the offending 
> characters.

Wish it was such a simple error on my part.

Thanks,

Cameron

-- 
Cameron McCormack
  // clm@csse.monash.edu.au
  // http://www.csse.monash.edu.au/~clm/
  // icq 26955922

Re: Xalan losing my characters

Posted by Peter Davis <pe...@pdavis.cx>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm no expert on character-encoding issues, but did you try looking at the 
output in a Hex editor?  Make sure that the sequence for the '?' is actually 
0x3f before blaming Xalan.

Since you are using a character entity in the source doc, it seems unlikely 
that your text editor for the source could be the cause, and since the output 
encoding is UTF-8 it seems unlikely that the character is lost on output.  My 
guess is that the editor you are using to view the output either just doesn't 
understand UTF-8 or doesn't have a font that includes the offending 
characters.

On Wednesday 29 May 2002 03:19, Cameron McCormack wrote:
> And when I run the program like this:
>
>   $ java test > output.xml
>
> my output.xml contains:
>
>   <?xml version="1.0" encoding="UTF-8"?>
>   <test>abc?xyz</test>
>
> What's going on here?

- -- 
Peter Davis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE89LFvNSZCJx7tYycRAlFNAJ91JOLN31ZDpm8MOWOrXTVXDigSFQCffq7j
QyU9pIm9K2kr2JEtZnGcqjU=
=t9Ti
-----END PGP SIGNATURE-----