You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xalan.apache.org by Thangalin <th...@gmail.com> on 2022/04/01 06:05:47 UTC
XML Entities
Hi all!
Back in 2013, a question was asked about how to preserve entities (e.g.,
unicode and emojis) when transforming:
"My XSLT transformations have been successful for months until I ran across
an XML file with Unicode characters (emoji characters). I need to preserve
the Unicode but XSLT is converting it to HTML Entities. I thought that
setting the encoding to UTF-8 would solve my problem but I'm still having
issues."
The answer was to look at the 'xalan:entities' serializer:
http://xml.apache.org/xalan-j/usagepatterns.html#outputprops
I've switched from Xalan to Saxon to handle the conversion flawlessly,
using a single line of code:
System.setProperty(
"javax.xml.transform.TransformerFactory",
"net.sf.saxon.TransformerFactoryImpl" );
The downside is adding 6MB to encode emojis, which Xalan is already doing,
just not quite as needed (�� is generated instead of
👍, for example).
Is there an example showing how to use the xalan:entities serializer to
preserve entities?
Thank you!
Re: XML Entities
Posted by Thangalin <th...@gmail.com>.
Thank you, Stanimir.
Changing the output encoding from UTF-8 to UTF-16 produces the desired
results using Xalan-J:
private static Transformer sTransformer;
sTransformer = TransformerFactory.newInstance().newTransformer();
sTransformer.setOutputProperty( ENCODING, UTF_16.toString() );
Much appreciated.
Re: XML Entities
Posted by Stanimir Stamenkov <s7...@netscape.net>.
Thu, 31 Mar 2022 23:05:47 -0700, /Thangalin/:
> Back in 2013, a question was asked about how to preserve entities (e.g.,
> unicode and emojis) when transforming:
>
> "My XSLT transformations have been successful for months until I ran
> across an XML file with Unicode characters (emoji characters). I need to
> preserve the Unicode but XSLT is converting it to HTML Entities. I
> thought that setting the encoding to UTF-8 would solve my problem but
> I'm still having issues."
>
> The answer was to look at the 'xalan:entities' serializer:
>
> http://xml.apache.org/xalan-j/usagepatterns.html#outputprops
>
> I've switched from Xalan to Saxon to handle the conversion flawlessly,
> using a single line of code:
>
> System.setProperty(
> "javax.xml.transform.TransformerFactory",
> "net.sf.saxon.TransformerFactoryImpl" );
>
> The downside is adding 6MB to encode emojis, which Xalan is already
> doing, just not quite as needed (�� is generated instead
> of 👍, for example).
>
> Is there an example showing how to use the xalan:entities serializer to
> preserve entities?
Let's clarify � � 👍 are character (Unicode code
point) references and not (named) entity references. For setting up
your own xalan:entities I guess you could have a look at the source:
*
http://svn.apache.org/viewvc/xalan/java/trunk/src/org/apache/xml/serializer/XMLEntities.properties?view=markup
*
http://svn.apache.org/viewvc/xalan/java/trunk/src/org/apache/xml/serializer/HTMLEntities.properties?view=markup
You may notice these provide mapping between character (code point) and
entity name to substitute in the result. However your problem appears
that Xalan doesn't support non-BMP (past the Basic Multilingual Plane)
code points > Hex: FFFF (Dec: 65535). The java char type can't
represent any Unicode code point – it is just a UTF-16 unit. Thus a
non-BMP character is encoded into two char values – a surrogate-pair.
Java 5 introduced APIs for decoding these to a Unicode code point for
example:
*
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int-
but Xalan doesn't seem to support non-BMP characters currently/still:
* https://issues.apache.org/jira/browse/XALANJ-2595
FWIW, the following example works as expected with the forked Xalan
version included in the Oracle/OpenJDK:
import java.io.StringReader;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
public class TransformTest {
public static void main(String[] args) throws Exception {
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "US-ASCII");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
OutputKeys.OMIT_XML_DECLARATION, "yes");
String xmlSource = "<foo>👍</foo>";
transformer.transform(
new StreamSource(new StringReader(xmlSource)),
new StreamResult(System.out));
}
}
I'm getting a result of:
<foo>👍</foo>
Plugging in the official Xalan, I'm getting:
<foo>��</foo>
--
Stanimir