You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xalan.apache.org by Thangalin <th...@gmail.com> on 2022/04/01 06:05:47 UTC

XML Entities

Hi all!

Back in 2013, a question was asked about how to preserve entities (e.g.,
unicode and emojis) when transforming:

"My XSLT transformations have been successful for months until I ran across
an XML file with Unicode characters (emoji characters). I need to preserve
the Unicode but XSLT is converting it to HTML Entities. I thought that
setting the encoding to UTF-8 would solve my problem but I'm still having
issues."

The answer was to look at the 'xalan:entities' serializer:

http://xml.apache.org/xalan-j/usagepatterns.html#outputprops

I've switched from Xalan to Saxon to handle the conversion flawlessly,
using a single line of code:

      System.setProperty(
        "javax.xml.transform.TransformerFactory",
        "net.sf.saxon.TransformerFactoryImpl" );

The downside is adding 6MB to encode emojis, which Xalan is already doing,
just not quite as needed (&#55357;&#56397; is generated instead of
&#x1F44D;, for example).

Is there an example showing how to use the xalan:entities serializer to
preserve entities?

Thank you!

Re: XML Entities

Posted by Thangalin <th...@gmail.com>.
Thank you, Stanimir.

Changing the output encoding from UTF-8 to UTF-16 produces the desired
results using Xalan-J:

      private static Transformer sTransformer;
      sTransformer = TransformerFactory.newInstance().newTransformer();
      sTransformer.setOutputProperty( ENCODING, UTF_16.toString() );
Much appreciated.

Re: XML Entities

Posted by Stanimir Stamenkov <s7...@netscape.net>.
Thu, 31 Mar 2022 23:05:47 -0700, /Thangalin/:

> Back in 2013, a question was asked about how to preserve entities (e.g., 
> unicode and emojis) when transforming:
> 
> "My XSLT transformations have been successful for months until I ran 
> across an XML file with Unicode characters (emoji characters). I need to 
> preserve the Unicode but XSLT is converting it to HTML Entities. I 
> thought that setting the encoding to UTF-8 would solve my problem but 
> I'm still having issues."
> 
> The answer was to look at the 'xalan:entities' serializer:
> 
> http://xml.apache.org/xalan-j/usagepatterns.html#outputprops
> 
> I've switched from Xalan to Saxon to handle the conversion flawlessly, 
> using a single line of code:
> 
>        System.setProperty(
>          "javax.xml.transform.TransformerFactory",
>          "net.sf.saxon.TransformerFactoryImpl" );
> 
> The downside is adding 6MB to encode emojis, which Xalan is already 
> doing, just not quite as needed (&#55357;&#56397; is generated instead 
> of &#x1F44D;, for example).
> 
> Is there an example showing how to use the xalan:entities serializer to 
> preserve entities?

Let's clarify &#55357; &#56397; &#x1F44D; are character (Unicode code 
point) references and not (named) entity references.  For setting up 
your own xalan:entities I guess you could have a look at the source:

   * 
http://svn.apache.org/viewvc/xalan/java/trunk/src/org/apache/xml/serializer/XMLEntities.properties?view=markup
   * 
http://svn.apache.org/viewvc/xalan/java/trunk/src/org/apache/xml/serializer/HTMLEntities.properties?view=markup

You may notice these provide mapping between character (code point) and 
entity name to substitute in the result.  However your problem appears 
that Xalan doesn't support non-BMP (past the Basic Multilingual Plane) 
code points > Hex: FFFF (Dec: 65535).  The java char type can't 
represent any Unicode code point – it is just a UTF-16 unit.  Thus a 
non-BMP character is encoded into two char values – a surrogate-pair. 
Java 5 introduced APIs for decoding these to a Unicode code point for 
example:

   * 
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int-

but Xalan doesn't seem to support non-BMP characters currently/still:

*   https://issues.apache.org/jira/browse/XALANJ-2595

FWIW, the following example works as expected with the forked Xalan 
version included in the Oracle/OpenJDK:

import java.io.StringReader;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

public class TransformTest {

     public static void main(String[] args) throws Exception {
         TransformerFactory tf = TransformerFactory.newInstance();
         Transformer transformer = tf.newTransformer();
         transformer.setOutputProperty(OutputKeys.ENCODING, "US-ASCII");
         transformer.setOutputProperty(OutputKeys.INDENT, "yes");
         transformer.setOutputProperty(
                 OutputKeys.OMIT_XML_DECLARATION, "yes");

         String xmlSource = "<foo>&#x1F44D;</foo>";
         transformer.transform(
                 new StreamSource(new StringReader(xmlSource)),
                 new StreamResult(System.out));
     }

}

I'm getting a result of:

     <foo>&#128077;</foo>

Plugging in the official Xalan, I'm getting:

     <foo>&#55357;&#56397;</foo>

-- 
Stanimir