You are viewing a plain text version of this content. The canonical link for it is here.
Posted to fop-commits@xmlgraphics.apache.org by pb...@apache.org on 2003/03/12 15:41:16 UTC
cvs commit: xml-fop/src/documentation/content/xdocs/design/alt.design xml-parsing.ehtml
pbwest 2003/03/12 06:41:16
Added: src/documentation/content/xdocs/design/alt.design
xml-parsing.ehtml
Log:
Replacement for xml-parsing.xml
Revision Changes Path
1.1 xml-fop/src/documentation/content/xdocs/design/alt.design/xml-parsing.ehtml
Index: xml-parsing.ehtml
===================================================================
<?xml version="1.0"?>
<html>
<body text="#000000" bgcolor="#FFFFFF">
<script type="text/javascript" src="codedisplay.js" />
<div class="content">
<h1>Implementing Pull Parsing</h1>
<p>
<font size="-2">by Peter B. West</font>
</p>
<ul class="minitoc">
<li>
<a href="#An+alternative+parsing+methodology">An alternative
parsing methodology</a>
<ul class="minitoc">
<li>
<a href="#Structure+of+SAX+parsing">Structure of SAX parsing</a>
</li>
<li>
<a href="#Cluttered+callbacks">Cluttered callbacks</a>
</li>
<li>
<a href="#From+">From push to pull parsing</a>
</li>
<li>
<a href="#FoXMLEvent+me%5Bthods">FoXMLEvent me[thods</a>
</li>
<li>
<a href="#FOP+modularisation">FOP modularisation</a>
</li>
</ul>
</li>
</ul>
<a name="N101C5"></a><a name="An+alternative+parsing+methodology"></a>
<h3>An alternative parsing methodology</h3>
<div style="margin-left: 0 ; border: 2px">
<p>
This note proposes an alternative method of integrating the
output of the SAX parsing of the Flow Object (FO) tree into
FOP processing. The pupose of the proposed changes is to
provide for:
</p>
<ul>
<li>
better decomposition of FOP into processing phases
</li>
<li>
top-down FO tree building, providing
</li>
<li>
integrated validation of FO tree input.
</li>
</ul>
<a name="N101DA"></a><a name="Structure+of+SAX+parsing"></a>
<h4>Structure of SAX parsing</h4>
<div style="margin-left: 0 ; border: 2px">
<p>
Figure 1 is a schematic representation of the process of
SAX parsing of an input source. SAX parsing involves the
registration, with an object implementing the <span
class="codefrag">XMLReader</span> interface, of a <span
class="codefrag">ContentHandler</span> which contains a
callback routine for each of the event types encountered
by the parser, e.g., <span
class="codefrag">startDocument()</span>, <span
class="codefrag">startElement()</span>, <span
class="codefrag">characters()</span>, <span
class="codefrag">endElement()</span> and <span
class="codefrag">endDocument()</span>. Parsing is
initiated by a call to the <span
class="codefrag">parser()</span> method of the <span
class="codefrag">XMLReader</span>. Note that the call to
<span class="codefrag">parser()</span> and the calls to
individual callback methods are synchronous: <span
class="codefrag">parser()</span> will only return when the
last callback method returns, and each callback must
complete before the next is called.<br/> <br/>
<strong>Figure 1</strong>
</p>
<div align="center">
<img class="figure" alt="SAX parsing schematic"
src="images/design/alt.design/SAXParsing.png" /></div>
<p>
In the process of parsing, the hierarchical structure of the
original FO tree is flattened into a number of streams of
events of the same type which are reported in the sequence
in which they are encountered. Apart from that, the API
imposes no structure or constraint which expresses the
relationship between, e.g., a startElement event and the
endElement event for the same element. To the extent that
such relationship information is required, it must be
managed by the callback routines.
</p>
<p>
The most direct approach here is to build the tree
"invisibly"; to bury within the callback routines the
necessary code to construct the tree. In the simplest
case, the whole of the FO tree is built within the call
to <span class="codefrag">parser()</span>, and that
in-memory tree is subsequently processed to (a) validate
the FO structure, and (b) construct the Area tree. The
problem with this approach is the potential size of the
FO tree in memory. FOP has suffered from this problem
in the past.
</p>
</div>
<a name="N10218"></a><a name="Cluttered+callbacks"></a>
<h4>Cluttered callbacks</h4>
<div style="margin-left: 0 ; border: 2px">
<p>
On the other hand, the callback code may become
increasingly complex as tree validation and the triggering
of the Area tree processing and subsequent rendering is
moved into the callbacks, typically the <span
class="codefrag">endElement()</span> method. In order to
overcome acute memory problems, the FOP code was recently
modified in this way, to trigger Area tree building and
rendering in the <span
class="codefrag">endElement()</span> method, when the end
of a page-sequence was detected.
</p>
<p>
The drawback with such a method is that it becomes difficult
to detemine the order of events and the circumstances in
which any particular processing events are triggered. When
the processing events are inherently self-contained, this is
irrelevant. But the more complex and context-dependent the
relationships are among the processing elements, the more
obscurity is engendered in the code by such "side-effect"
processing.
</p>
</div>
<a name="N1022B"></a><a name="From+"></a>
<h4>From push to pull parsing</h4>
<div style="margin-left: 0 ; border: 2px">
<p>
In order to solve the simultaneous problems of exposing
the structure of the processing and minimising in-memory
requirements, the experimental code separates the
parsing of the input source from the building of the FO
tree and all downstream processing. The callback
routines become minimal, consisting of the creation and
buffering of <span class="codefrag">XMLEvent</span>
objects as a <em>producer</em>. All of these objects
are effectively merged into a single event stream, in
strict event order, for subsequent access by the FO tree
building process, acting as a <em>consumer</em>. This,
essentially, is the difference between <em>push</em> and
<em>pull</em> parsing. In itself, this does not reduce
the footprint. This occurs when the approach is
generalised to modularise FOP processing.<br/> <br/>
<strong>Figure 2</strong>
</p>
<div align="center">
<img class="figure" alt="XML event buffer"
src="images/design/alt.design/pull-parsing.png" /></div>
<p>
The most useful change that this brings about is the switch
from <em>passive</em> to <em>active</em> XML element
processing. The process of parsing now becomes visible to
the controlling process. All local validation requirements,
all object and data structure building, are initiated by the
process(es) <em>get</em>ting from the queue - in the case
above, the FO tree builder.
</p>
</div>
<a name="N10260"></a><a name="FoXMLEvent+methods"></a>
<h4>FoXMLEvent methods</h4>
<div style="margin-left: 0 ; border: 2px">
<a name="FoXMLEvent-methods"></a>
<p>
The experimental code uses a class <span id = "span00"
/><span class = "codefrag" ><a
href="javascript:toggleCode( 'span00',
'FoXMLEvent.html#FoXMLEventClass', '400', '100%'
)">FoXMLEvent</a></span > to provide the objects which are
placed in the queue. <em>FoXMLEvent</em> includes a
variety of methods to access elements in the queue.
Namespace URIs encountered in parsing are maintained in an
<span id = "span01" /><span class="codefrag"><a
href="javascript:toggleCode( 'span01',
'XMLNamespaces.html#XMLNamespacesClass', '400', '100%'
)">XMLNamespaces</a></span> object where they are
associated with a unique integer index. This integer
value is used in the signature of some of the access
methods.
</p>
<p>
The class which manages the buffer is <span id = "span02"
/><span class = "codefrag" ><a href =
"javascript:toggleCode( 'span02',
'SyncedFoXmlEventsBuffer.html#SyncedFoXmlEventsBufferClass',
'400', '100%' )" >SyncedFoXmlEventsBuffer</a>.</span >
</p>
<dl>
<dt>
<span id = "span03" /><a href="javascript:toggleCode(
'span03', 'SyncedFoXmlEventsBuffer.html#getEvent',
'400', '100%' )">FoXMLEvent
getEvent(SyncedCircularBuffer events)</a>
</dt>
<dd>
This is the basis of all of the queue access methods. It
returns the next element from the queue, which may be a
pushback element.
</dd>
<dt>
<span id = "span04" /><a href="javascript:toggleCode(
'span04', 'SyncedFoXmlEventsBuffer.html#getTypedEvent',
'400', '100%' )">FoXMLEvent getTypedEvent()</a>
</dt>
<dd>
A series of these methods provide for the recovery only
of events of a particular event type, and possibly other
specific characteristics. <em>Get</em> methods discard
input which does not meet the requirements. E.g.
<dl>
<dt>
<span id = "span040" /><a
href="javascript:toggleCode( 'span040',
'SyncedFoXmlEventsBuffer.html#getEndDocument',
'400', '100%' )">FoXMLEvent getEndDocument()</a>
</dt>
<dd>
Discard input until and EndDocument event occurs.
Return this event.
</dd>
<dt>
<span id = "span041" /><a
href="javascript:toggleCode( 'span041',
'SyncedFoXmlEventsBuffer.html#getStartElement',
'400', '100%' )">FoXMLEvent getStartElement()</a>
</dt>
<dd>
A series of <span class = "codefrag"
>getStartElement</span > methods provide for
discarding input until a StartElement event of the
appropriate type occurs. This event is returned.
This series of methods includes some which accept a
list of Element specifiers.
</dd>
</dl>
</dd>
<dt>
<span id = "span05" /><a href="javascript:toggleCode(
'span05',
'SyncedFoXmlEventsBuffer.html#expectTypedEvent', '400',
'100%' )">FoXMLEvent expectTypedEvent()</a>
</dt>
<dd>
A series of these methods provide for the recovery only
of events of a particular event type, and possibly other
specific characteristics. <em>Expect</em> methods throw
an exception on input which does not meet the
requirements. <em>Expect</em> methods generally take a
<span class = "codefrag" >boolean</span> argument
specifying whitespace treatment. Examples include:
<dl>
<dt>
<span id = "span050" /><a
href="javascript:toggleCode( 'span050',
'SyncedFoXmlEventsBuffer.html#expectEndDocument',
'400', '100%' )">FoXMLEvent expectEndDocument()</a>
</dt>
<dd>
Expect an EndDocument event. Return this event.
</dd>
<dt>
<span id = "span051" /><a
href="javascript:toggleCode( 'span051',
'SyncedFoXmlEventsBuffer.html#expectStartElement',
'400', '100%' )">FoXMLEvent expectStartElement()</a>
</dt>
<dd>
A series of <span class = "codefrag"
>expectStartElement</span > methods provide for
examinging the pending input for a StartElement
event of the appropriate type. This event is
returned. This series of methods includes some
which accept a list of Element specifiers.
</dd>
</dl>
</dd>
</dl>
</div>
<a name="N102FE"></a><a name="FOP+modularisation"></a>
<h4>FOP modularisation</h4>
<div style="margin-left: 0 ; border: 2px">
<p>
This same principle can be extended to the other major
sub-systems of FOP processing. In each case, while it is
possible to hold a complete intermediate result in memory,
the memory costs of that approach are too high. The
sub-systems - xml parsing, FO tree construction, Area tree
construction and rendering - must run in parallel if the
footprint is to be kept manageable. By creating a series of
producer-consumer pairs linked by synchronized buffers,
logical isolation can be achieved while rates of processing
remain coupled. By introducing feedback loops conveying
information about the completion of processing of the
elements, sub-systems can dispose of or precis those
elements without having to be tightly coupled to downstream
processes.
<br/>
<br/>
<strong>Figure 3</strong>
</p>
<div align="center">
<img class="figure" alt="FOP modularisation"
src="images/design/alt.design/processPlumbing.png" />
</div>
<p>
In the case of communication between the FO tree
building process and the layout process, feedback is
required in order to parse expressions containing
lengths expressed as a percentage of some enclosing
area. This communication is incorporated within the
general model of inter-phase communication discussed above.
<br/><br/>
<strong>Figure 4</strong>
</p>
<div align="center">
<img class="figure" alt="FO - layout interaction"
src="images/design/alt.design/fo-layout-interaction.png" />
</div>
</div>
</div>
</div>
</body>
</html>
---------------------------------------------------------------------
To unsubscribe, e-mail: fop-cvs-unsubscribe@xml.apache.org
For additional commands, e-mail: fop-cvs-help@xml.apache.org