You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xalan.apache.org by Thomas Maschutznig <tm...@new10.com> on 2007/09/04 11:22:13 UTC

Decreasing XPath Performance on large files?

I am using xalan-j 2.7.0 on java 1.5 together with JAXP 1.3 similar  
to the xalan-j ApplyXPathJAXP sample. I have to read data from a 20MB  
XML file with approx. 3000 nodes directly below the document root;  
each one of these nodes contains some sub-nodes with attributes. I  
want to partially extract data from this file and create Java beans  
so I choose XPath expressions to extract exactly the tag and  
attribute data I need.
First, I search for all of those 3000 nodes directly below root like  
this:
   XPath xPath = XPathFactory.newInstance().newXPath();
   org.w3c.dom.NodeList nodes = (NodeList) xPath.evaluate("/Waveset/ 
User", inputSource, XPathConstants.NODESET);

Then I go through all matching nodes in a for-loop and extract data  
from each node's content using around 5 to 10 relative XPath  
expressions.
   for(int i=0; i < nodes.getLength(); i++) {
     System.out.println("Identity Count is : " + i);
     node = (org.w3c.dom.Element) nodes.item(i);
     firstName = xPath.evaluate("Attribute[@name='firstname']/ 
@value", node);
     lastName = xPath.evaluate("Attribute[@name='lastname']/@value",  
node);
     // some more similar lines here...
   }

I can read "Identity Count is: x" for the first 60 to 90 lines very  
fast, within 2 or 3 seconds, but then it seems to start slowing down  
and finally at a count of around 1500 it takes up to 10 seconds and  
later maybe even more for one node to be processed (even so after JVM  
and gc options were tuned; before that it was significantly worse).
I tuned JVM options, maximizing heap-space and resizing eden-space; I  
can see garbage collections happen every 20 to 30 seconds. My JVM  
options (on Windows 2003 x64, jdk 1.5.0_11 64bit) right now are:
   -Xms4g -Xmx4g -XX:NewSize=2g -XX:ThreadStackSize=16384 -XX: 
+UseParallelGC -server -XX:+AggressiveOpts

Classpath is: .:IMR_Import_Lib.jar:antlr-2.7.6.jar:asm.jar:asm- 
attrs.jar:c3p0-0.9.1.jar:cglib-2.1.3.jar:commons- 
collections-2.1.1.jar:commons-logging-1.0.4.jar:dom4j-1.6.1.jar:ejb3- 
persistence.jar:hibernate3.jar:jdbc2_0- 
stdext.jar:jta.jar:log4j-1.2.14.jar:ojdbc14.jar:serializer.jar:xalan.jar 
:xercesImpl.jar:xml-apis.jar:hibernate-annotations.jar:hibernate- 
commons-annotations.jar

(there is a .properties file on .)

I also tried a modified version of the first xPath.evaluate(),  
explicitly creating a Document object of the XML, to no avail:
     DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
     DocumentBuilder db = dbf.newDocumentBuilder();
     Document d = db.parse(new File(this.xmlFilePathName));

     XPath xPath = XPathFactory.newInstance().newXPath();

     NodeList nodes = (NodeList) xPath.evaluate("/Waveset/User", d,  
XPathConstants.NODESET);

I am a little stuck here with the drastically decreasing performance  
around half way through the XML file. Did I miss anything in my code?  
I know using a lot of XPath expressions like I do is very expensive  
but why would the second half of the file take 5 times as long as the  
first one while the first 100 /Wave/User nodes are parsed within  
seconds?

  Thomas

Re: Decreasing XPath Performance on large files?

Posted by jimmy Zhang <jz...@ximpleware.com>.
you should try vtd-xml
http://vtd-xml.sf.net
http://www.devx.com/xml/Article/34045

----- Original Message ----- 
From: "Santiago Pericas-Geertsen" <Sa...@Sun.COM>
To: "Thomas Maschutznig" <tm...@new10.com>
Cc: <xa...@xml.apache.org>
Sent: Tuesday, September 04, 2007 7:45 AM
Subject: Re: Decreasing XPath Performance on large files?


> Thomas,
> 
>  I honestly think that you should try to solve this problem without  
> using XPath, or at the very least the XPath API in JAXP. Xalan does  
> not run XPath queries directly on a W3C DOM instance, instead it  
> creates its own internal tree called DTM. Because of how the XPath  
> API is designed, this will happen every time you call evaluate(). The  
> memory footprint of your application must be enormous and increasing  
> the heap size helps for a while until the VM needs to manage/ 
> housekeep it.
> 
>  From your description, it seems that your queries are quite simple  
> and do not involve negative axes. Why can't you just stream through  
> the document using StAX or SAX and pick up the values you need? No  
> matter how fast the the XPath implementation, streaming will be  
> several X's faster on large documents like yours.
> 
> -- Santiago
> 
> On Sep 4, 2007, at 5:22 AM, Thomas Maschutznig wrote:
> 
>> I am using xalan-j 2.7.0 on java 1.5 together with JAXP 1.3 similar  
>> to the xalan-j ApplyXPathJAXP sample. I have to read data from a  
>> 20MB XML file with approx. 3000 nodes directly below the document  
>> root; each one of these nodes contains some sub-nodes with  
>> attributes. I want to partially extract data from this file and  
>> create Java beans so I choose XPath expressions to extract exactly  
>> the tag and attribute data I need.
>> First, I search for all of those 3000 nodes directly below root  
>> like this:
>>   XPath xPath = XPathFactory.newInstance().newXPath();
>>   org.w3c.dom.NodeList nodes = (NodeList) xPath.evaluate("/Waveset/ 
>> User", inputSource, XPathConstants.NODESET);
>>
>> Then I go through all matching nodes in a for-loop and extract data  
>> from each node's content using around 5 to 10 relative XPath  
>> expressions.
>>   for(int i=0; i < nodes.getLength(); i++) {
>>     System.out.println("Identity Count is : " + i);
>>     node = (org.w3c.dom.Element) nodes.item(i);
>>     firstName = xPath.evaluate("Attribute[@name='firstname']/ 
>> @value", node);
>>     lastName = xPath.evaluate("Attribute[@name='lastname']/@value",  
>> node);
>>     // some more similar lines here...
>>   }
>>
>> I can read "Identity Count is: x" for the first 60 to 90 lines very  
>> fast, within 2 or 3 seconds, but then it seems to start slowing  
>> down and finally at a count of around 1500 it takes up to 10  
>> seconds and later maybe even more for one node to be processed  
>> (even so after JVM and gc options were tuned; before that it was  
>> significantly worse).
>> I tuned JVM options, maximizing heap-space and resizing eden-space;  
>> I can see garbage collections happen every 20 to 30 seconds. My JVM  
>> options (on Windows 2003 x64, jdk 1.5.0_11 64bit) right now are:
>>   -Xms4g -Xmx4g -XX:NewSize=2g -XX:ThreadStackSize=16384 -XX: 
>> +UseParallelGC -server -XX:+AggressiveOpts
>>
>> Classpath is: .:IMR_Import_Lib.jar:antlr-2.7.6.jar:asm.jar:asm- 
>> attrs.jar:c3p0-0.9.1.jar:cglib-2.1.3.jar:commons- 
>> collections-2.1.1.jar:commons- 
>> logging-1.0.4.jar:dom4j-1.6.1.jar:ejb3- 
>> persistence.jar:hibernate3.jar:jdbc2_0- 
>> stdext.jar:jta.jar:log4j-1.2.14.jar:ojdbc14.jar:serializer.jar:xalan.j 
>> ar:xercesImpl.jar:xml-apis.jar:hibernate-annotations.jar:hibernate- 
>> commons-annotations.jar
>>
>> (there is a .properties file on .)
>>
>> I also tried a modified version of the first xPath.evaluate(),  
>> explicitly creating a Document object of the XML, to no avail:
>>     DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
>>     DocumentBuilder db = dbf.newDocumentBuilder();
>>     Document d = db.parse(new File(this.xmlFilePathName));
>>
>>     XPath xPath = XPathFactory.newInstance().newXPath();
>>
>>     NodeList nodes = (NodeList) xPath.evaluate("/Waveset/User", d,  
>> XPathConstants.NODESET);
>>
>> I am a little stuck here with the drastically decreasing  
>> performance around half way through the XML file. Did I miss  
>> anything in my code? I know using a lot of XPath expressions like I  
>> do is very expensive but why would the second half of the file take  
>> 5 times as long as the first one while the first 100 /Wave/User  
>> nodes are parsed within seconds?
>>
>>  Thomas
> 
>


Re: Decreasing XPath Performance on large files?

Posted by Santiago Pericas-Geertsen <Sa...@Sun.COM>.
Thomas,

  I honestly think that you should try to solve this problem without  
using XPath, or at the very least the XPath API in JAXP. Xalan does  
not run XPath queries directly on a W3C DOM instance, instead it  
creates its own internal tree called DTM. Because of how the XPath  
API is designed, this will happen every time you call evaluate(). The  
memory footprint of your application must be enormous and increasing  
the heap size helps for a while until the VM needs to manage/ 
housekeep it.

  From your description, it seems that your queries are quite simple  
and do not involve negative axes. Why can't you just stream through  
the document using StAX or SAX and pick up the values you need? No  
matter how fast the the XPath implementation, streaming will be  
several X's faster on large documents like yours.

-- Santiago

On Sep 4, 2007, at 5:22 AM, Thomas Maschutznig wrote:

> I am using xalan-j 2.7.0 on java 1.5 together with JAXP 1.3 similar  
> to the xalan-j ApplyXPathJAXP sample. I have to read data from a  
> 20MB XML file with approx. 3000 nodes directly below the document  
> root; each one of these nodes contains some sub-nodes with  
> attributes. I want to partially extract data from this file and  
> create Java beans so I choose XPath expressions to extract exactly  
> the tag and attribute data I need.
> First, I search for all of those 3000 nodes directly below root  
> like this:
>   XPath xPath = XPathFactory.newInstance().newXPath();
>   org.w3c.dom.NodeList nodes = (NodeList) xPath.evaluate("/Waveset/ 
> User", inputSource, XPathConstants.NODESET);
>
> Then I go through all matching nodes in a for-loop and extract data  
> from each node's content using around 5 to 10 relative XPath  
> expressions.
>   for(int i=0; i < nodes.getLength(); i++) {
>     System.out.println("Identity Count is : " + i);
>     node = (org.w3c.dom.Element) nodes.item(i);
>     firstName = xPath.evaluate("Attribute[@name='firstname']/ 
> @value", node);
>     lastName = xPath.evaluate("Attribute[@name='lastname']/@value",  
> node);
>     // some more similar lines here...
>   }
>
> I can read "Identity Count is: x" for the first 60 to 90 lines very  
> fast, within 2 or 3 seconds, but then it seems to start slowing  
> down and finally at a count of around 1500 it takes up to 10  
> seconds and later maybe even more for one node to be processed  
> (even so after JVM and gc options were tuned; before that it was  
> significantly worse).
> I tuned JVM options, maximizing heap-space and resizing eden-space;  
> I can see garbage collections happen every 20 to 30 seconds. My JVM  
> options (on Windows 2003 x64, jdk 1.5.0_11 64bit) right now are:
>   -Xms4g -Xmx4g -XX:NewSize=2g -XX:ThreadStackSize=16384 -XX: 
> +UseParallelGC -server -XX:+AggressiveOpts
>
> Classpath is: .:IMR_Import_Lib.jar:antlr-2.7.6.jar:asm.jar:asm- 
> attrs.jar:c3p0-0.9.1.jar:cglib-2.1.3.jar:commons- 
> collections-2.1.1.jar:commons- 
> logging-1.0.4.jar:dom4j-1.6.1.jar:ejb3- 
> persistence.jar:hibernate3.jar:jdbc2_0- 
> stdext.jar:jta.jar:log4j-1.2.14.jar:ojdbc14.jar:serializer.jar:xalan.j 
> ar:xercesImpl.jar:xml-apis.jar:hibernate-annotations.jar:hibernate- 
> commons-annotations.jar
>
> (there is a .properties file on .)
>
> I also tried a modified version of the first xPath.evaluate(),  
> explicitly creating a Document object of the XML, to no avail:
>     DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
>     DocumentBuilder db = dbf.newDocumentBuilder();
>     Document d = db.parse(new File(this.xmlFilePathName));
>
>     XPath xPath = XPathFactory.newInstance().newXPath();
>
>     NodeList nodes = (NodeList) xPath.evaluate("/Waveset/User", d,  
> XPathConstants.NODESET);
>
> I am a little stuck here with the drastically decreasing  
> performance around half way through the XML file. Did I miss  
> anything in my code? I know using a lot of XPath expressions like I  
> do is very expensive but why would the second half of the file take  
> 5 times as long as the first one while the first 100 /Wave/User  
> nodes are parsed within seconds?
>
>  Thomas


Re: Decreasing XPath Performance on large files?

Posted by ke...@us.ibm.com.
Alternatively, you may want to look at the CachedXPathAPI class. This keeps
a copy of the document in memory in Xalan's native format, so you only pay
the model-construction cost once for multiple XPath evaluations. If you
_alter_ the document the cached copy becomes invalid, so this isn't good
for search-and-alter-and-search-again kinds of tasks... but if you're just
retrieving data, it works well. (This is the approach Xalan's stylesheet
processing uses, for example.)

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish
(http://www.ovff.org/pegasus/songs/threes-rev-11.html)

Re: Decreasing XPath Performance on large files?

Posted by Thomas Maschutznig <tm...@new10.com>.
Thank you all for your suggestions and explanation. I could already  
increase performance a lot by skipping some very simple XPath  
expressions and using the (w3c) DOM Nodes and Elements instead. Some  
of these expressions must have been evaluated several thousand or ten- 
thousand times and voilĂ  instead of more than 20 hours I am down to  
one hour on JDK 1.5. Using JDK 6 speeds things up some more.
Now I know and understand why SAX would've been a faster way to go...

Cheers,
  thomas