You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xalan.apache.org by Martin Dengler <ma...@xades.com> on 2003/03/20 22:17:27 UTC

Xalan sort algorithm performance?

Hi,

   Please yell if this is an FAQ or my googling skills need help.

   What sorting algorithm does Xalan use -- and what is the algorithm(s)
runtime and memory footprint in big-O notation?

   I am using Xalan to sort large XML documents: approx 3 levels deep of
a tree with 2-20,000 first child (attached to root) nodes and 30
grandchildren per child node.  Sorting is done on an attribute of the
grandchild nodes.  The performance -- especially in memory usage -- is
abysmal: 10-60 seconds sort time, memory usage is at least O(n^2).

   I've read the w3c performance pointers, but cutting the document down
and doing fewer transforms really doesn't cut it (neither are applicable
to a one-pass sort of a large document).

   Are there any ways to change the sort algortihm's memory/runtime
behavior?

Thanks,
Martin

PS -- test specs: Xerces, Xalan (latest), Solaris 2.8, JVM 1.2.2_07, Sun
E450, 2GB real mem, 1GB JVM heap, 2x450mhz processors.

Re: Xalan sort algorithm performance?

Posted by Martin Dengler <ma...@xades.com>.
On Joe Kesselman's advice (thanks Joe!) I've provided a test case in
case anyone can figure out what's taking up time/memory.

Thanks,
Martin

Data to sort:
<DataSet>
  <ResultSetMetaData>
    <ColumnMetaData dtype="number" name="Quantity">
    <ColumnMetaData dtype="text" name="Description">
  </ResultSetMetaData>

  <DataRow type="detail">
    <column name="Quantity">255.00</column>
    <column name="Description">IBM</column>
  </DataRow>

  <DataRow type="detail">
    <column name="Quantity">100.00</column>
    <column name="Description">Sun Microsystems</column>
  </DataRow>
</DataSet>

XSL used for sorting (attachment):




On Thu, 2003-03-20 at 21:17, Martin Dengler wrote:
> Hi,
> 
>    Please yell if this is an FAQ or my googling skills need help.
> 
>    What sorting algorithm does Xalan use -- and what is the algorithm(s)
> runtime and memory footprint in big-O notation?
> 
>    I am using Xalan to sort large XML documents: approx 3 levels deep of
> a tree with 2-20,000 first child (attached to root) nodes and 30
> grandchildren per child node.  Sorting is done on an attribute of the
> grandchild nodes.  The performance -- especially in memory usage -- is
> abysmal: 10-60 seconds sort time, memory usage is at least O(n^2).
> 
>    I've read the w3c performance pointers, but cutting the document down
> and doing fewer transforms really doesn't cut it (neither are applicable
> to a one-pass sort of a large document).
> 
>    Are there any ways to change the sort algortihm's memory/runtime
> behavior?
> 
> Thanks,
> Martin
> 
> PS -- test specs: Xerces, Xalan (latest), Solaris 2.8, JVM 1.2.2_07, Sun
> E450, 2GB real mem, 1GB JVM heap, 2x450mhz processors.


Re: Xalan sort algorithm performance?

Posted by Martin Dengler <ma...@xades.com>.
Thanks Scott & Dave.  It is great to be able to get insight and pointers
to direct my investigation so quickly!

Would it be correct to conclude that mergesort: 1) theoretically
requires linear space for the merge step that likely is prohibitive for
large XML documents (assuming linked lists / similar are not involved,
which eliminate the memory requirements); and 2) practically may require
quadractic space for the merge step due to successive merge steps not
re-using memory space allocated to prior merge steps (prior merge-step's
space is left for garbage collection -- this is obviously
implementation-specific)?

Scott/Dave/anyone, what do you think?

More practically: "Large" (XML documents in #1 above) can be a relative
term based on real and/or virtual memory available to the JVM; based on
our results I'd say a document is large if it's in-memory footprint is
above 1/4 the size of the JVM heap (default 64M in 1.2.2).  Obviously
DOM-based representations suffer hugely here.  In my example, 20,000 *
30 = 600,000 objects and with 250bytes each that's approx 143MB.  With a
1GB heap, a mergesort which just required linear space would potentially
take more than 1/4 of heap.  If my conclusion above as to practical
memory usage (need to investigate the NodeSorter class you mentioned) is
correct then the memory needed for this operation would greatly exceed
the heap.

Any comments on this welcome as well.  I'll post what I find in
NodeSorter.

Thanks,
Martin

On Fri, 2003-03-21 at 22:10, David N Bertoni/Cambridge/IBM wrote:
> 
> 
> 
> 
> Just thought I'd chime in.  One of my experiences in doing the sorting work
> in Xalan-C was, that for large data sets, the cost of caching could start
> to overwhelm the comparison cost.  As a result, I left some conditional
> code in that disabled caching, which helped for large documents where the
> sort key values were large.
> 
> Dave
> 
> 
> 
>                                                                                                                          
>                       Scott                                                                                              
>                       Boag/Cambridge/I         To:      Martin Dengler <ma...@xades.com>                                
>                       BM                       cc:      xalan-j-users@xml.apache.org, (bcc: David N                      
>                       <scott_boag@us.i         Bertoni/Cambridge/IBM)                                                    
>                       bm.com>                  Subject: Re: Xalan sort algorithm performance?                            
>                                                                                                                          
>                       03/21/2003 12:10                                                                                   
>                       PM                                                                                                 
>                                                                                                                          
> 
> 
> 
> 
> It's a mergesort (see Robert Sedgewick's Algorithms book).  An issue may be
> the match-and-select it has to do, though these should be cached so that
> once the match-and-select is done, only the comparison need be made.  At
> one time I am under the impression that our sorting was comparatively
> pretty good.  But it's been I while since I've been in that code.
> 
> You might want to root around in the
> org.apache.xalan.transformer.NodeSorter code, or do a line profile of that
> class.
> 
> -scott
> 
> Martin Dengler <ma...@xades.com> wrote on 03/20/2003 04:17:27 PM:
> 
> > Hi,
> >
> >    Please yell if this is an FAQ or my googling skills need help.
> >
> >    What sorting algorithm does Xalan use -- and what is the algorithm(s)
> > runtime and memory footprint in big-O notation?
> >
> >    I am using Xalan to sort large XML documents: approx 3 levels deep of
> > a tree with 2-20,000 first child (attached to root) nodes and 30
> > grandchildren per child node.  Sorting is done on an attribute of the
> > grandchild nodes.  The performance -- especially in memory usage -- is
> > abysmal: 10-60 seconds sort time, memory usage is at least O(n^2).
> >
> >    I've read the w3c performance pointers, but cutting the document down
> > and doing fewer transforms really doesn't cut it (neither are applicable
> > to a one-pass sort of a large document).
> >
> >    Are there any ways to change the sort algortihm's memory/runtime
> > behavior?
> >
> > Thanks,
> > Martin
> >
> > PS -- test specs: Xerces, Xalan (latest), Solaris 2.8, JVM 1.2.2_07, Sun
> > E450, 2GB real mem, 1GB JVM heap, 2x450mhz processors.
> > [attachment "signature.asc" deleted by Scott Boag/Cambridge/IBM]


Re: Xalan sort algorithm performance?

Posted by David N Bertoni/Cambridge/IBM <da...@us.ibm.com>.



Just thought I'd chime in.  One of my experiences in doing the sorting work
in Xalan-C was, that for large data sets, the cost of caching could start
to overwhelm the comparison cost.  As a result, I left some conditional
code in that disabled caching, which helped for large documents where the
sort key values were large.

Dave



                                                                                                                         
                      Scott                                                                                              
                      Boag/Cambridge/I         To:      Martin Dengler <ma...@xades.com>                                
                      BM                       cc:      xalan-j-users@xml.apache.org, (bcc: David N                      
                      <scott_boag@us.i         Bertoni/Cambridge/IBM)                                                    
                      bm.com>                  Subject: Re: Xalan sort algorithm performance?                            
                                                                                                                         
                      03/21/2003 12:10                                                                                   
                      PM                                                                                                 
                                                                                                                         




It's a mergesort (see Robert Sedgewick's Algorithms book).  An issue may be
the match-and-select it has to do, though these should be cached so that
once the match-and-select is done, only the comparison need be made.  At
one time I am under the impression that our sorting was comparatively
pretty good.  But it's been I while since I've been in that code.

You might want to root around in the
org.apache.xalan.transformer.NodeSorter code, or do a line profile of that
class.

-scott

Martin Dengler <ma...@xades.com> wrote on 03/20/2003 04:17:27 PM:

> Hi,
>
>    Please yell if this is an FAQ or my googling skills need help.
>
>    What sorting algorithm does Xalan use -- and what is the algorithm(s)
> runtime and memory footprint in big-O notation?
>
>    I am using Xalan to sort large XML documents: approx 3 levels deep of
> a tree with 2-20,000 first child (attached to root) nodes and 30
> grandchildren per child node.  Sorting is done on an attribute of the
> grandchild nodes.  The performance -- especially in memory usage -- is
> abysmal: 10-60 seconds sort time, memory usage is at least O(n^2).
>
>    I've read the w3c performance pointers, but cutting the document down
> and doing fewer transforms really doesn't cut it (neither are applicable
> to a one-pass sort of a large document).
>
>    Are there any ways to change the sort algortihm's memory/runtime
> behavior?
>
> Thanks,
> Martin
>
> PS -- test specs: Xerces, Xalan (latest), Solaris 2.8, JVM 1.2.2_07, Sun
> E450, 2GB real mem, 1GB JVM heap, 2x450mhz processors.
> [attachment "signature.asc" deleted by Scott Boag/Cambridge/IBM]


Re: Xalan sort algorithm performance?

Posted by Scott Boag/Cambridge/IBM <sc...@us.ibm.com>.
It's a mergesort (see Robert Sedgewick's Algorithms book).  An issue may 
be the match-and-select it has to do, though these should be cached so 
that once the match-and-select is done, only the comparison need be made. 
At one time I am under the impression that our sorting was comparatively 
pretty good.  But it's been I while since I've been in that code.

You might want to root around in the 
org.apache.xalan.transformer.NodeSorter code, or do a line profile of that 
class.

-scott

Martin Dengler <ma...@xades.com> wrote on 03/20/2003 04:17:27 PM:

> Hi,
> 
>    Please yell if this is an FAQ or my googling skills need help.
> 
>    What sorting algorithm does Xalan use -- and what is the algorithm(s)
> runtime and memory footprint in big-O notation?
> 
>    I am using Xalan to sort large XML documents: approx 3 levels deep of
> a tree with 2-20,000 first child (attached to root) nodes and 30
> grandchildren per child node.  Sorting is done on an attribute of the
> grandchild nodes.  The performance -- especially in memory usage -- is
> abysmal: 10-60 seconds sort time, memory usage is at least O(n^2).
> 
>    I've read the w3c performance pointers, but cutting the document down
> and doing fewer transforms really doesn't cut it (neither are applicable
> to a one-pass sort of a large document).
> 
>    Are there any ways to change the sort algortihm's memory/runtime
> behavior?
> 
> Thanks,
> Martin
> 
> PS -- test specs: Xerces, Xalan (latest), Solaris 2.8, JVM 1.2.2_07, Sun
> E450, 2GB real mem, 1GB JVM heap, 2x450mhz processors.
> [attachment "signature.asc" deleted by Scott Boag/Cambridge/IBM]