You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by Christian Haul <ha...@dvs1.informatik.tu-darmstadt.de> on 2003/01/11 21:32:35 UTC

pruning - current status?

Hi.
What is the current status of pruning? The last time I found it mentioned
on this list is Sep 02. Is someone working on it? Is there a special branch
dedicated to that? 
I'm especially interested in doing pruning and stylesheet analysis with a 
DTD / XMLSchema present, perhaps in combination with XSLTC.

	Chris.
-- 
C h r i s t i a n       H a u l
haul@informatik.tu-darmstadt.de
    fingerprint: 99B0 1D9D 7919 644A 4837  7D73 FEF9 6856 335A 9E08

Re: pruning - current status?

Posted by Christian Haul <ha...@dvs1.informatik.tu-darmstadt.de>.
On 13.Jan.2003 -- 11:29 AM, Santiago Pericas-Geertsen wrote:
> From: "Joseph Kesselman" <ke...@us.ibm.com>
> >
> > I've been out of the office for a month, but my understanding is that we
> > currently have two to three branches active -- main line development
> > (HEAD), XSLT 2.0 prep/development ("xslt20"), and -- if I remember
> > correctly -- the xsltc folks had also forked off a branch for some of
> > their own early development, though that may have been folded back in by
> > now. (At some point, of course, all the changes have to be reconciled.)
> >
> 
>  The XSLTC folks are using the "xslt20" branch as well, so that should be a
> good starting point for Christian's work.

Thank you all for the encouraging words! Especially to Joe for the
great round up of the previous discussions.

I've checked out the xslt20 branch now and will work with that. I'll
be looking first into analysing the stylesheet to eliminate unused
templates (ie cannot match according to schema) and possibly
simplifying templates with the help of a schema. But don't hold your
breath ;-)

Later I hope to find time to look at pruning. But since that seems to
touch fundamentals, I'd like to have that with a lower priority.

BTW loading the xslt20 branch into eclipse finds some 3000 unused
imports and a number of "static method XYZ() from type Foo should be
accessed in a static way" eg
org.apache.xalan.client.XSLTProcessorApplet.java line 755:

        m_trustedWorker.yield();

m_trustedWorker is of type Thread and 

		public static native void yield();

Most of that could be fixed automatically by eclipse requesting
"organize imports". If you care for a patch (and don't mind some
reordering of imports), I could send it.

	Chris.
-- 
C h r i s t i a n       H a u l
haul@informatik.tu-darmstadt.de
    fingerprint: 99B0 1D9D 7919 644A 4837  7D73 FEF9 6856 335A 9E08

Re: pruning - current status?

Posted by Santiago Pericas-Geertsen <Sa...@sun.com>.
From: "Joseph Kesselman" <ke...@us.ibm.com>
>
> I've been out of the office for a month, but my understanding is that we
> currently have two to three branches active -- main line development
> (HEAD), XSLT 2.0 prep/development ("xslt20"), and -- if I remember
> correctly -- the xsltc folks had also forked off a branch for some of
> their own early development, though that may have been folded back in by
> now. (At some point, of course, all the changes have to be reconciled.)
>

 The XSLTC folks are using the "xslt20" branch as well, so that should be a
good starting point for Christian's work.

-- Santiago



Re: pruning - current status?

Posted by Joseph Kesselman <ke...@us.ibm.com>.
On Saturday, 01/11/2003 at 11:16 CET, Christian Haul 
<ha...@dvs1.informatik.tu-darmstadt.de> wrote:
> On 11.Jan.2003 -- 04:48 PM, Joseph Kesselman wrote:
> > Current status: We are *not* currently applying pruning to source
> > documents. Logic to analyse a stylesheet, find opportunities for 
pruning,
> > and apply those insights is possible but nontrivial, and we simply 
haven't
> > had time to pursue it.
> thank you for your fast answer. Actually, I'd like to work on that but 
before
> I'd like to check existing efforts. I'm most interested in the first 
point.
> So, are there any pointers you could provide that don't show up when 
searching
> the list for "pruning"?

There hasn't been a lot of discussion beyond the "it would be a good idea 
if we can do it" level.

> Which would be the best CVS branch to work with? HEAD?

I've been out of the office for a month, but my understanding is that we 
currently have two to three branches active -- main line development 
(HEAD), XSLT 2.0 prep/development ("xslt20"), and -- if I remember 
correctly -- the xsltc folks had also forked off a branch for some of 
their own early development, though that may have been folded back in by 
now. (At some point, of course, all the changes have to be reconciled.)

The xslt20 branch includes some significant changes to DTM in order to 
carry additional type information, since 2.0 will be schema-type 
sensitive, as well as some other generalizations to the XSLT data model 
(for example, much of the distinction between nodesets and temporary trees 
goes away in 2.0), So if you're planning on significantly changing DTM 
behaviors, you might want to start with that branch;. If you want to start 
by working on the stylesheet analysis code and just invoke the existing 
tail-pruning logic (or try a filtering approach, or wait until you have 
some analysis results before you start trying to implement the actual 
filter/prune operations) you ought to be able to start with HEAD with 
reasonable confidence that we'll be able to merge your work when the time 
comes. 


Brief brain-dump of pruning-related issues:


Most current implementations of DTM (those based on DTMDefaultBase) were 
explicitly designed to be a write-once-read-many data model, constructed 
in document order (though possibly incrementally) and never altered 
thereafter, and some of our traversal code made simplifying assumptions 
based on the resulting sequential numbering. 

Tail-pruning (removing the nodes most recently added) wasn't too hard to 
fit to that model, since very few of the nodes actually have to be altered 
-- basically, only the parent and previous sibling immediately preceeding 
the pruned section have to be updated, and the storage occupied by the 
removed nodes can easily be reused by overwriting them as further nodes 
arrive.


Pruning an arbitrary tree out of the center of a DTM, on the other hand, 
is much more difficult. Femoving a subtree by patching its parents and 
siblings is conceptually straightforward... but breaks some of the 
simplifying assumptions I mentioned in the traversal code. You can no 
longer assume that an element is immediately followed by its attributes 
(if any), namespace nodes (if any), and children; the next un-pruned child 
may be considerably farther on in the tables. The traversers and iterators 
would have to be checked to make sure they could handle this.

Also: Since one goal of pruning is not just to simplify the model (though 
that could improve search time) but to actually reduce storage overhead, 
we'd need to figure out how to excise some or all of the removed subtree. 
Normally, to remove rows from a table you copy the later rows down into 
the slots previously occupied by the removed rows... but that could be 
hugely expensive. My proposal was that we take advantage of the fact that 
DTMDefaultBase uses suballocated ("chunked") arrays, and excise only as 
much as could be removed as complete chunks -- so only the high-level 
chunk table would need to be updated, which would be far less expensive. 
The disadvantage would be that we'd have to compute which chunks fell 
entirely within the excised subtree -- and portions of that subtree which 
spilled over into adjacent chunks would be left over as waste space, 
though they might be recovered if/when a higher-level subtree was pruned 
later.

Also: Compaction would involve changing the DTM internal index numbers, 
and external node handles, of nodes following the prune. I beleive most 
local references within DTM are now stored as relative to the current 
node, so a subtree could be moved down without breaking its internal 
connections. But  any references from outside the DTM tree -- stored maps 
from IDs to nodes, or iterators currently in progress, or variables (!?!) 
-- would have to be found and updated (which suggests the prune should be 
done at a time when few of those references exist, if possible -- and, 
obviously, when nothing is looking inside the subtree which will be pruned 
away!).  I hadn't even started sketching that part of the code, back when 
I was analysing this proposal.


It may be that DTM, as it stands, isn't the optimal data model for a 
pruning XSLT processor. This is one of several reasons we're considering 
changing the DTM API  to access the model through cursor/iterator objects 
instead of node handles; that'd give us more freedom to play with 
alternative implementations without paying an overhead for translating 
between numeric references and what ever the native form is. (It should 
also save us some conversion overhead, which ought to be a good thing even 
for the existing DTM implementations... and it might be a bit cleaner 
architecturally.)

______________________________________
Joe Kesselman  / IBM Research


Re: pruning - current status?

Posted by Christian Haul <ha...@dvs1.informatik.tu-darmstadt.de>.
On 11.Jan.2003 -- 04:48 PM, Joseph Kesselman wrote:
> Current status: We are *not* currently applying pruning to source 
> documents. Logic to analyse a stylesheet, find opportunities for pruning, 
> and apply those insights is possible but nontrivial, and we simply haven't 
> had time to pursue it.
 
> General reminder: Xalan is open-source. If there's a feature you need that 
> we aren't spending enough time on, that may be an opportunity to 
> contribute...

Joe,
thank you for your fast answer. Actually, I'd like to work on that but before
I'd like to check existing efforts. I'm most interested in the first point.
So, are there any pointers you could provide that don't show up when searching
the list for "pruning"?
Which would be the best CVS branch to work with? HEAD?

	Chris.
-- 
C h r i s t i a n       H a u l
haul@informatik.tu-darmstadt.de
    fingerprint: 99B0 1D9D 7919 644A 4837  7D73 FEF9 6856 335A 9E08

Re: pruning - current status?

Posted by Joseph Kesselman <ke...@us.ibm.com>.
Current status: We are *not* currently applying pruning to source 
documents. Logic to analyse a stylesheet, find opportunities for pruning, 
and apply those insights is possible but nontrivial, and we simply haven't 
had time to pursue it.

"Tail-pruning" (removing the nodes most recently added to a DTM) is 
working, and is in use as part of our "shared DTM" mechanism for Temporary 
Trees/Result Tree Fragments. 

Pruning subtrees from the middle of a DTM is *not* currently implemented; 
that's considerably harder given our current data structures, though we 
made some changes to prepare for it. (I sketched out one approach to that 
code some time ago, but shelved it due to other demands on my time and 
concerns about its likely performance.)

Some of the proposed changes to our use of DTMs may make pruning a bit 
easier to implement.


General reminder: Xalan is open-source. If there's a feature you need that 
we aren't spending enough time on, that may be an opportunity to 
contribute...

______________________________________
Joe Kesselman  / IBM Research