You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@xalan.apache.org by mi...@ca.ibm.com on 2002/12/12 17:33:51 UTC

Merging Xalan-J Interpretive and XSLTC serialization code

Hi,
I will be working on merging the two serializers in Xalan and XSLTC.  This
would cover stream, SAX and DOM output.

The primary goals are:
      - consistent output for between Xalan-J Interpretive and XSLTC
      - have one serializer to support rather than two
      - have the serializer not depend on either Xalan-J Interpretive or
XSLTC, although Xalan and XSLTC can depend on the serializer.

A secondary goal is some performance improvement.  For example less
processing of namespaces would be done for HTML output.

I have done an analysis of the following items. If you have any comments
please feel free to add them to the list.
-----------------------------------------------
1. Packaging
The serializer used by Xalan-J Interpretive has dependencies on classes in
Xalan, as well as error messages.  The same is true for the serializer used
by XSLTC with respect to classes and error messages in XSLTC. The new
serializer would be in its own package with no dependencies on either side.
Helper classes, and supported interfaces would be defined within the new
serializer package.   In this way the new serializer can be used by either
XSLTC or Xalan-J Interpretive without dragging in other things.
-----------------------------------------------
2. Interfaces
The major interfaces used in Xalan-J Interpretive are: ContentHandler,
LexicalHandler, DeclHandler, Serializer, DOMSerializer. These are standard
interfaces, except for the last two.  Serializer provides a way to set or
get the output stream or writer, and to set properties (e.g.
cdata-sections), and get a ContentHandler interface from this interface.

The DOMSerializer interface has one method, serialize(Node), which I don't
see being called from within Xalan-J Interpretive. This can either be
dropped, or if it is used externally, it could be moved into
org.apache.xml.utils because it only uses the ContentHandler interface of
the serializer. This method has dependencies on org.apache.xpath.DOM2Helper
(deprecated) and on org.apache.xml.TreeWalker.TreeWalker, and I'm a little
afraid of what else this can pull in.

The interface used by XSLTC is the non-standard TransletOutputHandler,
which is a sort of mixture of the interfaces used by Xalan-J Interpretive,
with the major difference being that it has SAX-like ContentHandler
methods, not true ContentHandler methods.

In order to satisfy both Xalan-J Interpretive and XSLTC without impacting
either side too much I'm suggesting that the new serializer support both
the standard SAX interfaces and some extensions. So the new serializer
could implement a single interface, e.g.  "XalanOutputHandler", which
itself is composed of ContentHandler (plus extensions), LexicalHandler
(plus extensions), DeclHandler and Serializer.
-----------------------------------------------
3. Stream Output

Stream output would have an inheritance tree as follows:

SerializerBase
      ToStream
            ToXMLStream
            ToHTMLStream
            ToTextStream
            ToUnknownStream
The XML, HTML and Text streams are obvious and would be a modified version
of the current Xalan-J Interpretive classes SerializerToXML,
SerializerToHTML and SerializerToText.  The modifications would be to more
easily handle the type of calls that XSLTC emits via the
TransletOutputHandler interface, such as a startElement() call with only
the element name, but no attributes.

The ToUnknownStream would wrap either a ToXMLStream or ToHTMLStream. It
would have to make that decision based on whether the first element is
<html> or not.  Any events up to that point would be cached and emitted to
the wrapped ToXMLStream or ToHTMLStream once that is decided.  This code
serves the same purpose as the "serializer switcher" code in Xalan-J
Interpretive.
-----------------------------------------------
4. SAX Output
SAX Output would have an inheritance tree as follows:

SerializerBase
      ToSAXHandler
            ToXMLSAXHandler
            ToHTMLSAXHandler
            ToTextSAXHandler

These classes would provide a bridge from the SAX-like calls to true SAX
calls of the wrapped SAX handler.  These would be modifications of the
XSLTC classes SAXXMLOutput, SAXHTMLOutput and SAXTextOutput.

Any wrapped SAX handler would not get all of the SAX calls, but irrelevant
ones for the specified output type, which would be one of text, html, or
xml, would be filtered out. This would be a change in the current behavior
for Xalan-J Interpretive, where all SAX calls are propagated to the SAX
handler.  For example the ToHTMLSAXHandler would absorb, and not emit calls
related to namespaces, such as startPrefixMapping.

There is a potential issue that a users SAX handler should receive all
events, and it should make the decision on what to throw away.  If this is
the case, then perhaps a ToUnknownSAXHandler is needed.

-----------------------------------------------
5. DOM Output

DOM Output would be achieved by(strikethrough:  )using
org.apache.xml.utils.DOMBuilder as the receiver of the SAX calls from a
ToSAXHandler.
This would allow a DOM to be built up either via true ContentHandler
methods (for Xalan-J Interpretive) or via SAX-like calls (XSLTC).  It may
be possible that some events are not propagated through to the DOMBuilder.
This would be a change in the current behavior for Xalan-J Interpretive,
where all events are propagated to DOMBuilder. For example if the output
method is text, a ToTextSAXHandler in front of it would filter out all SAX
calls to DOMBuilder, except for the character() ones.
-----------------------------------------------
6. User defined Stream serializer

Currently in the property files output_xml.properties,
output_html.properties and output_text.properties there are content-handler
properties that specify the class name of the serializer for the given
output method (xml, html, text).   This class needs to implement Serializer
and ContentHandler, at a minimum. This is documented in
http://xml.apache.org/xalan-j/usagepatterns.html#outputprops.

This support could go unchanged, except that the Serializer interface would
move inside of the new serializer package (i.e. the package name of the
Serializer interface would change.  Such a user defined class would be
wrapped by a ToSAXHandler, so that the ToSAXHandler would collect SAX-like
events in Xalan-J Interpretive or XSLTC and emit pure SAX calls to the
user's serializer.  Again, there is a possibility here that some
"irrelevant" SAX calls for the given output type (text, html or xml) will
be filtered from the users SAX handler, such as startPrefixMapping() in the
case of html output.   If this is the case, then perhaps a
ToUnknownSAXHandler is needed.

Another possibility is to drop supporting user defined stream handlers.  Do
users fully define their own stream handlers, or extend existing ones such
as org.apache.xalan.serialize.SerializerToHTML and only slightly modify a
few methods?

-----------------------------------------------
7. Error Messages

Both of the current serializers emit a few error messages. These would be
moved into messages in the new serializer package.
-----------------------------------------------
8. Helper classes

There are a number of classes currently in Xalan-J Interpretive that would
need to move into the new serializer, in order to cut dependencies.  These
classes would include:
      - CharInfo
      - ElemDesc
      - EncodingInfo
      - Encodings
      - WriterToASCI
      - WriterToUTF8
      - WriterToUTF8Buffered
-----------------------------------------------
9. Migration

The areas of change, from an end users point of view are that the
TransletOutputHandler interface would change, that is be replaced with
XalanOutputHandler, so there would be some binary incompatibility with
previously compiled stylesheets.  Also the movement of the Serializer
interface into the new serializer package would break those who currently
define their own stream serializers in Xalan-J Interpretive.

For Xalan-J Interpretive the DOM or SAX output from the new serializer
would differ in that some irrelevant calls (depending on output method
type) would be filtered out.

Users of the DOMSerializer interface (serialize(Node) method on the Xalan-J
Interpretive serializer) might be impacted.

-----------------------------------------------

Comments and new ideas are welcome.




Brian Minchau
XSLT Development, IBM Toronto
e-mail:        minchau@ca.ibm.com

Re: Merging Xalan-J Interpretive and XSLTC serialization code

Posted by sc...@us.ibm.com.




I don't think there's any question that these changes will have to be
backed up by perf measurements.  I'm pretty sure that's a major part of
Brian's strategy.

-scott

"Santiago Pericas-Geertsen" <Sa...@sun.com> wrote on
12/12/2002 01:49:43 PM:

> From: "Joseph Kesselman" <ke...@us.ibm.com>
> > Just one quick point: Serialization performance is critical to the
> > perfomance of Xalan overall. Be careful when working on this code;
cleaner
> > and slower is not necessarily a good trade-off!
> >
> > ______________________________________
> > Joe Kesselman  / IBM Research
> >
>
>  I second that. Brian, I think you should compare the performance of the
two
> processors before and after the integration with the new output system.
(As
> a side note, I believe instead of "SerializerBase" you should call that
> class "OutputHandlerBase" or something like that. In my opinion, the use
of
> the term "serializer" is a misnomer when the output not STREAM).
>
> -- Santiago
>
>

Re: Merging Xalan-J Interpretive and XSLTC serialization code

Posted by Santiago Pericas-Geertsen <Sa...@sun.com>.

From: "Joseph Kesselman" <ke...@us.ibm.com>
> Just one quick point: Serialization performance is critical to the
> perfomance of Xalan overall. Be careful when working on this code; cleaner
> and slower is not necessarily a good trade-off!
>
> ______________________________________
> Joe Kesselman  / IBM Research
>

 I second that. Brian, I think you should compare the performance of the two
processors before and after the integration with the new output system. (As
a side note, I believe instead of "SerializerBase" you should call that
class "OutputHandlerBase" or something like that. In my opinion, the use of
the term "serializer" is a misnomer when the output not STREAM).

-- Santiago

Re: Merging Xalan-J Interpretive and XSLTC serialization code

Posted by sc...@us.ibm.com.




Besides code-reuse and output consistency, one of the things driving this
refactoring is indeed performance.  Xalan interpretive writes out to
ResultTreeHandler first and resolves namespaces, collects attributes
(needed because of xsl:attribute), sets up start/endPrefixMapping, etc.  A
lot of this work doesn't need to be done in the case of stream
serialization.  XSLTC, on the other hand, writes directly to the
TransletOutputHandler interface, and then delegates to the true SAX
interfaces when it is interfacing to a ContentHandler, etc.    I think
XSLTC probably got it right.  Brian didn't mention this in his note, or it
didn't come across as such, but the idea is for Xalan to write directly to
the new XalanOutputHandler, thus skipping the layer of indirection for
streams.  As Brian noted to me offline, we don't really know how much of a
perf improvement this will be for Xalan interpretive until we measure it,
but, in theory, it could be a big help.

Also, having a single code base for the serializers in XSLTC and Xalan
interpretive will allow us to better focus on general perf improvements in
the serializers.  I suspect the "cleaner" part of what he's proposing will
also help us in measuring performance... as you know, the overhead of
ResultTreeHandler has always been a complicating factor in our profiling.

-scott

Joseph Kesselman <ke...@us.ibm.com> wrote on 12/12/2002 01:15:39 PM:

> Just one quick point: Serialization performance is critical to the
> perfomance of Xalan overall. Be careful when working on this code;
cleaner
> and slower is not necessarily a good trade-off!
>
> ______________________________________
> Joe Kesselman  / IBM Research
>

Re: Merging Xalan-J Interpretive and XSLTC serialization code

Posted by Joseph Kesselman <ke...@us.ibm.com>.

Just one quick point: Serialization performance is critical to the 
perfomance of Xalan overall. Be careful when working on this code; cleaner 
and slower is not necessarily a good trade-off!

______________________________________
Joe Kesselman  / IBM Research