You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "Raghavendra, Karthik" <kr...@proxicom.com> on 2002/02/06 04:44:09 UTC

Performance Problems using large XML DOM and XPath

Hi,

I have a XML file containing approximately 14800 product catalog
information records and a file size of 80 MB.  I am using Xerces 1.4.4
to validate and parse the XML document and create a DOM tree.  I am
using a Solaris box with 2 GB of RAM and so memory is not an issue.  I
am using the XPath capabilities of Xalan 2 to traverse the DOM tree and
access required data.  I am noticing that the performance degrades over
time.  For example, initial processing (first hundred records) returns
within 800 milliseconds while the time for the last set of records
(14000 and above) is about 77 seconds.  I am using the CachedXPathAPI
instead of the XPathAPI since the XPathAPI resulted in significantly
greater response times.  

The code snippets below:

Main Method:
-------------
DOMParser parser = new DOMParser();
parser.parse(XMLFile);
Document doc = parser.getDocument();
NodeList list = doc.getElementsByTagName("PRODUCT");
CachedXPathAPI xPath = new CachedXPathAPI();
Node prdNode = null;

for (int index=0;index<list.getlength();index++)
{
	prdNode = list.item(index);
	processProduct(prdNode,xPath);
	...
	...	
}

ProcessProduct Method (this method uses XPath to get the relevant
information.  The product node for the current product is passed as a
parameter):
------------------------------------------------------------------------
------------------------------------------------------------------------
String prdNum =
getText(xPath.selectSingleNode(prdNode,"child::ProductNumber"));
String title = getText(xPath.selectSingleNode(prdNode,"child::Title"));
String language =
getText(xPath.selectSingleNode(prdNode,"child::Language"));
String publisher =
getText(xPath.selectSingleNode(prdNode,"child::Rights/child::Publisher")
);
...
...
...

getText Method (this method concatenates all the text node information
and returns it):
------------------------------------------------------------------------
-------------
StringBuffer text = new StringBuffer();
NodeList list = node.getChildNodes();
for (int index=0;index < list.getLength();index++)
{
	if (list.item(index).getNodeType() == Node.TEXT_NODE)
		text.append(list.item(index).getNodeValue());
}
return (text.toString());


I excuted the main method as indicated in the code snippet and observed
that the processing time increased gradually.  I then reversed the loop
in the main method to start from the last product record (14800) and
noticed that the processing time was 77 seconds.  However, it dropped
significantly after a few iterations (100 records) to about 25 seconds.

What I do not understand is the reason for this increase in processing
time.  I have compared the XML records to confirm that there is nothing
wrong with the data.  The structure and content for all 14800 records is
the same.  Am I doing something wrong?  Is there an issue with using
XPath for large DOMs?  Is there a XPath bug?  I am passing the product
node in the DOM to be traversed and I would think that the XPath lookup
should be the same for product #14000 as it is for product #1.  However,
the pattern suggests that there is some kind of processing overhead when
using XPath and going deeper in the DOM to retrieve data. 

Any and all help is greatly appreciated.

Thanks in advance,
Karthik


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Performance Problems using large XML DOM and XPath

Posted by Elena Litani <el...@ca.ibm.com>.
"Raghavendra, Karthik" wrote:
> I then reversed the loop
> in the main method to start from the last product record (14800) and
> noticed that the processing time was 77 seconds.  However, it dropped
> significantly after a few iterations (100 records) to about 25 seconds.
 
> What I do not understand is the reason for this increase in processing
> time.  

In Xerces by default uses deferred DOM - the parser creates a compact
structure corresponding to the DOM in memory, and it does not create any
nodes. The nodes are created when user tries to access one of the nodes. 
In your example, you are trying to access the last product record and it
triggers the fluffing of the tree - the Xerces DOM implementation is
trying to create all nodes in memory that have a path to the node that
you are trying to access. Given the size of the document, it takes a
long time.

On the other hand, over time your tree becomes more and more expanded
(since you access more and more nodes) and processing time drops
significantly.

If memory is not a concern to you, you might want to turn off the
deferred node expansion
("http://apache.org/xml/features/dom/defer-node-expansion"), so that
parser creates the full DOM tree in memory. The processing time for
access should drop in this case.



-- 
Elena Litani / IBM Toronto

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Performance Problems using large XML DOM and XPath

Posted by Dane Foster <df...@equitytg.com>.
Use dom4j.  It has built in XPath support and a SAX Filtering mechanism.
http://www.dom4j.org.  You can plug in the Apache parser to provide support
for Schema validation.

Dane Foster
http://www.equitytg.com
954.360.9800
----- Original Message -----
From: "Raghavendra, Karthik" <kr...@proxicom.com>
To: <xe...@xml.apache.org>
Sent: Tuesday, February 05, 2002 10:44 PM
Subject: Performance Problems using large XML DOM and XPath


Hi,

I have a XML file containing approximately 14800 product catalog
information records and a file size of 80 MB.  I am using Xerces 1.4.4
to validate and parse the XML document and create a DOM tree.  I am
using a Solaris box with 2 GB of RAM and so memory is not an issue.  I
am using the XPath capabilities of Xalan 2 to traverse the DOM tree and
access required data.  I am noticing that the performance degrades over
time.  For example, initial processing (first hundred records) returns
within 800 milliseconds while the time for the last set of records
(14000 and above) is about 77 seconds.  I am using the CachedXPathAPI
instead of the XPathAPI since the XPathAPI resulted in significantly
greater response times.

The code snippets below:

Main Method:
-------------
DOMParser parser = new DOMParser();
parser.parse(XMLFile);
Document doc = parser.getDocument();
NodeList list = doc.getElementsByTagName("PRODUCT");
CachedXPathAPI xPath = new CachedXPathAPI();
Node prdNode = null;

for (int index=0;index<list.getlength();index++)
{
prdNode = list.item(index);
processProduct(prdNode,xPath);
...
...
}

ProcessProduct Method (this method uses XPath to get the relevant
information.  The product node for the current product is passed as a
parameter):
------------------------------------------------------------------------
------------------------------------------------------------------------
String prdNum =
getText(xPath.selectSingleNode(prdNode,"child::ProductNumber"));
String title = getText(xPath.selectSingleNode(prdNode,"child::Title"));
String language =
getText(xPath.selectSingleNode(prdNode,"child::Language"));
String publisher =
getText(xPath.selectSingleNode(prdNode,"child::Rights/child::Publisher")
);
...
...
...

getText Method (this method concatenates all the text node information
and returns it):
------------------------------------------------------------------------
-------------
StringBuffer text = new StringBuffer();
NodeList list = node.getChildNodes();
for (int index=0;index < list.getLength();index++)
{
if (list.item(index).getNodeType() == Node.TEXT_NODE)
text.append(list.item(index).getNodeValue());
}
return (text.toString());


I excuted the main method as indicated in the code snippet and observed
that the processing time increased gradually.  I then reversed the loop
in the main method to start from the last product record (14800) and
noticed that the processing time was 77 seconds.  However, it dropped
significantly after a few iterations (100 records) to about 25 seconds.

What I do not understand is the reason for this increase in processing
time.  I have compared the XML records to confirm that there is nothing
wrong with the data.  The structure and content for all 14800 records is
the same.  Am I doing something wrong?  Is there an issue with using
XPath for large DOMs?  Is there a XPath bug?  I am passing the product
node in the DOM to be traversed and I would think that the XPath lookup
should be the same for product #14000 as it is for product #1.  However,
the pattern suggests that there is some kind of processing overhead when
using XPath and going deeper in the DOM to retrieve data.

Any and all help is greatly appreciated.

Thanks in advance,
Karthik


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Performance Problems using large XML DOM and XPath

Posted by Niels Peter Strandberg <ni...@npstrandberg.com>.
I made an experimental SAX filter some time ago. This might help you!

Instead of getting the result as a w3c document, you can make your own 
contenthandler that makes the file or what ever you want to do!


 From my first presentation of the filter:
------------------------------------------------------------------------------------------------
I have made an experimental SAX XMLFilter. It allows you to "filter" out 
the information in an xml document that you want to work with  - using 
xpath - and skip the rest. You can place the filter anywhere in your 
application where a XMLFilter can be used.

- I don't know if this has already been done by others?

The whole idea is to "filter" out the fragments from the xml document 
that you specifies using an xpath expression. ex. 
SaxXPathFragmentFilter(saxparser, "/cellphone/*/model[@id='1234']", 
"result").  Build a dom tree from the result, or why not feed the sax 
event into a xslt transformer and do some xslt transformations.

The big win is that you don't have to build a large dom tree, if you 
only needs part of the information in a large xml document. You just 
specify what fragments you want using xpath and the result will be a 
much smaller dom tree, witch requires less processing, memory etc.

Let us say that you have a large document with spare parts to Volvo 
vehicles. You want to do a list of engine parts for the S80 car model. 
What you do is specify the xpath (locationpath) that you want to cut out 
from the document ex. "/catalog/cars/s70/parts/engine".

           // your sax parser here
           XMLReader parser =
                     XMLReaderFactory.createXMLReader(
                               "org.apache.xerces.parsers.SAXParser");
		
           // Get instances of your handlers
           SAXHandler jdomsaxhandler = new SAXHandler();

           String xpath = "/catalog/cars/s70/parts/engine";
           String rootName = "s70engineparts"; // this will be the new 
root.
	
           // set SaxXPathFragmentFilter
           SaxXPathFragmentFilter xpathfilter =
                     new SaxXPathFragmentFilter(parser, xpath, 
resultrootname);
           xpathfilter.setContentHandler(jdomsaxhandler);

           // Parse the document
           xpathfilter.parse(uri);
	
           // get the Document
           Document doc = jdomsaxhandler.getDocument();


This SaxXPathFragmentFilter is pure experimental. It is spaghetti code. 
I just sat down with an idea and started to code, and the code is not 
very pretty. It needs to be rewritten.


The xpath support is very limited for now. Here is the xpath you can do 
today with this filter:
      "/a/b" - An absolute path.
      "/a/*/c" - An absolute  path but where element no 2 "*" could be 
any element.
      "/a/*/c[@att='value']" - If element c has an attribute with 'value'.
      "/a/*/c[contains='value']" - If element c first child node is a 
text node that contains 'value'.
      "/a/*/c[starts-with='value']" - If element c first child node is a 
text node that starts with 'value'.
      "/a/*/c[ends-with='value']" - If element c first child node is a 
text node that ends with 'value'.
      "/a/*/c['value']" - If element c first child node is a text node 
that is 'value'.
      "/a/*/c[is='value']" - As above.

As you can see the xpath options is very limited. But I think that when 
I find a way to implement the "//" pattern, the filter will be even more 
powerful.

I have problems with building a dom tree from the result using xerces 
and saxon. But with jdom it works great. This needs to be fixed.

You can not rely on that the result is allways correct, so don't use 
this in any application, just use if for expermentation.

You can find the code at: 
http://www.npstrandberg.com/projects/saxxpathfragmentfilter/saxxpathfragmentfilter.
tar.gz

My goal with this filter is to keep it realiable, simple, fast and 
clean. If you want to contribute to this project, then you will be 
wellcome. The filter will be realeased under som kind of opensource 
license (if we get that fare!).

Test it an give me some feedback, on what you think.


-----------------------------------------------------------------------------------------------


Regards, Niels Peter Strandberg




On onsdag, februari 6, 2002, at 04:44 , Raghavendra, Karthik wrote:

> Hi,
>
> I have a XML file containing approximately 14800 product catalog
> information records and a file size of 80 MB.  I am using Xerces 1.4.4
> to validate and parse the XML document and create a DOM tree.  I am
> using a Solaris box with 2 GB of RAM and so memory is not an issue.  I
> am using the XPath capabilities of Xalan 2 to traverse the DOM tree and
> access required data.  I am noticing that the performance degrades over
> time.  For example, initial processing (first hundred records) returns
> within 800 milliseconds while the time for the last set of records
> (14000 and above) is about 77 seconds.  I am using the CachedXPathAPI
> instead of the XPathAPI since the XPathAPI resulted in significantly
> greater response times.
>
> The code snippets below:
>
> Main Method:
> -------------
> DOMParser parser = new DOMParser();
> parser.parse(XMLFile);
> Document doc = parser.getDocument();
> NodeList list = doc.getElementsByTagName("PRODUCT");
> CachedXPathAPI xPath = new CachedXPathAPI();
> Node prdNode = null;
>
> for (int index=0;index<list.getlength();index++)
> {
> 	prdNode = list.item(index);
> 	processProduct(prdNode,xPath);
> 	...
> 	...	
> }
>
> ProcessProduct Method (this method uses XPath to get the relevant
> information.  The product node for the current product is passed as a
> parameter):
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> String prdNum =
> getText(xPath.selectSingleNode(prdNode,"child::ProductNumber"));
> String title = getText(xPath.selectSingleNode(prdNode,"child::Title"));
> String language =
> getText(xPath.selectSingleNode(prdNode,"child::Language"));
> String publisher =
> getText(xPath.selectSingleNode(prdNode,"child::Rights/child::Publisher")
> );
> ...
> ...
> ...
>
> getText Method (this method concatenates all the text node information
> and returns it):
> ------------------------------------------------------------------------
> -------------
> StringBuffer text = new StringBuffer();
> NodeList list = node.getChildNodes();
> for (int index=0;index < list.getLength();index++)
> {
> 	if (list.item(index).getNodeType() == Node.TEXT_NODE)
> 		text.append(list.item(index).getNodeValue());
> }
> return (text.toString());
>
>
> I excuted the main method as indicated in the code snippet and observed
> that the processing time increased gradually.  I then reversed the loop
> in the main method to start from the last product record (14800) and
> noticed that the processing time was 77 seconds.  However, it dropped
> significantly after a few iterations (100 records) to about 25 seconds.
>
> What I do not understand is the reason for this increase in processing
> time.  I have compared the XML records to confirm that there is nothing
> wrong with the data.  The structure and content for all 14800 records is
> the same.  Am I doing something wrong?  Is there an issue with using
> XPath for large DOMs?  Is there a XPath bug?  I am passing the product
> node in the DOM to be traversed and I would think that the XPath lookup
> should be the same for product #14000 as it is for product #1.  However,
> the pattern suggests that there is some kind of processing overhead when
> using XPath and going deeper in the DOM to retrieve data.
>
> Any and all help is greatly appreciated.
>
> Thanks in advance,
> Karthik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org