You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Alessandro Bologna <al...@gmail.com> on 2007/03/15 17:34:50 UTC

Query performances

Hi,

We have been incurring in an interesting behavior doing searches on a quite
large repository (~1,000,000 nodes).
The test data is made of a tree of nodes of type nt:unstructured, reference
able, with two numeric properties (a sequential count of the node and a
random number between 0 and the count). Each node has a reference to the
parent, and up to 100 child nodes, and is named n<m> where m is the index of
the node, related to the parent node.
So, for instance, /load/n0 is the first node, /load/n1 the second to
/load/n99.
Then each one of them has 100 children and so on, so that a valid path, for
instance, is /load/n23/n34/n50.
One node out of 6 has attached a nt:file node as well, in order to test full
text searches. If requested, I can provide the code to create the test set.

The strange behavior that prompted me to write to this mailing list, is the
following:

Say that I am searching for a node that contains the word 'beatles' at some
level under the node /load/n40 and I use the following query:
*/jcr:root/load/n40//*[jcr:contains(.,'beatles')]* the execution time is
1672ms
If I use instead:
*/jcr:root/load/n40/*/*/*/*[jcr:contains(.,'beatles')]*  the execution time
is 19749ms

The second query, in theory, could execute faster than the first, because I
am providing more information (only nodes at the 4th level under /load/n40)
but takes 10 times longer to execute.
Is there a reason why?

The other, way more worrisome problem, appears to be the opposite:
I have executed the following two queries
/jcr:root/load/n50/n2/* ==> 931ms
/jcr:root/load/n50/n2/*/* ==> 661ms

The first is returning all nodes one level below /load/n50/n2 and the second
two levels below. There are no other nodes under that.
When I tried the following query, which would return the same nodes in one
operation, the result was surprising (in a bad way)
/jcr:root/load/n50/n2//* ==>*353769ms*
**
The CPU goes 100%, I see in the jackrabbit logs a lot of entries similar to:
DocNumberCache: size=1024/1024, #accesses=17039, #hits=167, #misses=16872,
cacheRatio=1% (DocNumberCache.java, line 155)

and then finally, *some 5 minutes later*, I get the result.
Even if I restrict the query, it still takes the same time:
/jcr:root/load/n50/n2/m96//* and there's maybe only an hundred nodes under
that.

I have the exact same behavior if I try with the SQL syntax: select * from
nt:base where jcr:path like '/load/n50/n2/n96/%'

The version of JR is 1.2.2. The backend is Oracle 10g, and I am running the
application on Tomcat 5.5 with jdk 1.5 and 1GB assigned to the JVM (on
Windows)

Does anybody have any idea on why is this happening and if there is a
workaround?
Thanks
Alessandro Bologna


**

Re: Query performances

Posted by Alessandro Bologna <al...@gmail.com>.
Marcel,

just wanted to get back to you (and the list as well). I downloaded
jackrabbit-webapp-1.3-SNAPSHOT and run the same tests again.
Performances are much better and queries seem to be much more optimized.
Congratulations for the improvements.
Alessandro



On 3/28/07, Marcel Reutegger <ma...@gmx.net> wrote:
>
> Hi Alessandro,
>
> Alessandro Bologna wrote:
> > Now I have found another unusual behavior, and I was hoping you could
> > explain this too...
> > These queries have been executed in sequence (without restarting):
> >
> >
> > Executing query: /jcr:root/load/n10/n33/*[@random>10000]
> > Query execution time:10245ms
> > Number of nodes:91
> >
> >
> >
> > Executing query: /jcr:root/load/n10/n33/*[@random>10000 and
> > @random<10000000]
> > Query execution time:20409ms
> > Number of nodes:91
> >
> >
> >
> > Executing query: /jcr:root/load/n10/n33/*[@random>10000 and
> > @random<10000000 and @random<10000001]
> > Query execution time:30053ms
> > Number of nodes:91
> >
> >
> > I think that the execution time on the first query is already quite high
> > (an
> > equality query takes just a few millisecond),
>
> This has already been improved with
> http://issues.apache.org/jira/browse/JCR-804
>
> > but what I am more
> > disconcerted about is that the second query (with two condition, the
> second
> > being a 'dummy' one since it is true for each of the 91 nodes returned
> by
> > the second query) takes double the time, and the third query (with the
> > third
> > condition being basically the same as the first one) takes three times
> as
> > much.
> >
> > Typically I would expect an 'and' query to be executed on the results of
> > the
> > first one, and therefore to take just a little bit less.
> >
> > So the questions are:
> > 1. why does it takes so long to find 91 nodes in the first query
>
> this is caused by:
> - MultiTermDocs is expensive on large value ranges (-> fixed in JCR-804)
> - @random>10000 (probably) selects a great number of nodes, which are
> later
> excluded again because of the path constraint
>
> > 2. why the second and third query take as much time as the first times
> the
> > number of expressions?
>
> each of the expressions is evaluated independently and in a second step
> 'and'ed
> together. therefore the predominant cost in your query seems to be the
> individual expressions. because each of the range expressions selects a
> lot of
> nodes lucene cannot optimize the execution well. see above for a
> workaround.
>
> > 3. is there a workaround to do range queries?
>
> partitioning the random property into multiple properties may help. the
> basic
> idea is that you split the random number into a sum of multiple values.
>
> @random = 34045
>
> would become:
>
> @random1 = 5
> @random10 = 4
> @random100 = 0
> @random1000 = 4
> @random10000 = 3
>
> later if you search for all random properties with a value larger than
> 12000 you
> would have a query:
> //*[(@random10000 = 1 and @random1000 >= 2) or (random10000 >= 2)]
>
> because the distinct values of the split up properties are small, lucene
> can
> much better optimize the query execution.
>
> regards
> marcel
>

Re: Query performances

Posted by Marcel Reutegger <ma...@gmx.net>.
Hi Alessandro,

Alessandro Bologna wrote:
> Now I have found another unusual behavior, and I was hoping you could
> explain this too...
> These queries have been executed in sequence (without restarting):
> 
> 
> Executing query: /jcr:root/load/n10/n33/*[@random>10000]
> Query execution time:10245ms
> Number of nodes:91
> 
> 
> 
> Executing query: /jcr:root/load/n10/n33/*[@random>10000 and 
> @random<10000000]
> Query execution time:20409ms
> Number of nodes:91
> 
> 
> 
> Executing query: /jcr:root/load/n10/n33/*[@random>10000 and
> @random<10000000 and @random<10000001]
> Query execution time:30053ms
> Number of nodes:91
> 
> 
> I think that the execution time on the first query is already quite high 
> (an
> equality query takes just a few millisecond),

This has already been improved with http://issues.apache.org/jira/browse/JCR-804

> but what I am more
> disconcerted about is that the second query (with two condition, the second
> being a 'dummy' one since it is true for each of the 91 nodes returned by
> the second query) takes double the time, and the third query (with the 
> third
> condition being basically the same as the first one) takes three times as
> much.
> 
> Typically I would expect an 'and' query to be executed on the results of 
> the
> first one, and therefore to take just a little bit less.
> 
> So the questions are:
> 1. why does it takes so long to find 91 nodes in the first query

this is caused by:
- MultiTermDocs is expensive on large value ranges (-> fixed in JCR-804)
- @random>10000 (probably) selects a great number of nodes, which are later 
excluded again because of the path constraint

> 2. why the second and third query take as much time as the first times the
> number of expressions?

each of the expressions is evaluated independently and in a second step 'and'ed 
together. therefore the predominant cost in your query seems to be the 
individual expressions. because each of the range expressions selects a lot of 
nodes lucene cannot optimize the execution well. see above for a workaround.

> 3. is there a workaround to do range queries?

partitioning the random property into multiple properties may help. the basic 
idea is that you split the random number into a sum of multiple values.

@random = 34045

would become:

@random1 = 5
@random10 = 4
@random100 = 0
@random1000 = 4
@random10000 = 3

later if you search for all random properties with a value larger than 12000 you 
would have a query:
//*[(@random10000 = 1 and @random1000 >= 2) or (random10000 >= 2)]

because the distinct values of the split up properties are small, lucene can 
much better optimize the query execution.

regards
  marcel

Re: Query performances

Posted by Alessandro Bologna <al...@gmail.com>.
Hi Marcel,

sorry it took me quite some time to reply, and thanks for your explanation.
I have done the to tests that you asked about below, being sure to shutdown
the repository every time.

Executing query: /jcr:root/load/n50/n2/*
Query execution time:501ms
Number of nodes:101



Executing query: /jcr:root/load/n50/n2/*/*
Query execution time:2824ms
Number of nodes:41



So the result is pretty much in line with what you were expecting.

Now I have found another unusual behavior, and I was hoping you could
explain this too...
These queries have been executed in sequence (without restarting):


Executing query: /jcr:root/load/n10/n33/*[@random>10000]
Query execution time:10245ms
Number of nodes:91



Executing query: /jcr:root/load/n10/n33/*[@random>10000 and @random<10000000]
Query execution time:20409ms
Number of nodes:91



Executing query: /jcr:root/load/n10/n33/*[@random>10000 and
@random<10000000 and @random<10000001]
Query execution time:30053ms
Number of nodes:91


I think that the execution time on the first query is already quite high (an
equality query takes just a few millisecond), but what I am more
disconcerted about is that the second query (with two condition, the second
being a 'dummy' one since it is true for each of the 91 nodes returned by
the second query) takes double the time, and the third query (with the third
condition being basically the same as the first one) takes three times as
much.

Typically I would expect an 'and' query to be executed on the results of the
first one, and therefore to take just a little bit less.

So the questions are:
1. why does it takes so long to find 91 nodes in the first query
2. why the second and third query take as much time as the first times the
number of expressions?
3. is there a workaround to do range queries?

Thanks,
Alessandro Bologna






/jcr:root/load/n10/n33/*[@random%3C1000%20and%20@random%3E140]


On 3/19/07, Marcel Reutegger <ma...@gmx.net> wrote:
>
> Hi Alessandro,
>
> Alessandro Bologna wrote:
> > We have been incurring in an interesting behavior doing searches on a
> quite
> > large repository (~1,000,000 nodes).
> > The test data is made of a tree of nodes of type nt:unstructured,
> reference
> > able, with two numeric properties (a sequential count of the node and a
> > random number between 0 and the count). Each node has a reference to the
>
> > parent, and up to 100 child nodes, and is named n<m> where m is the
> > index of
> > the node, related to the parent node.
> > So, for instance, /load/n0 is the first node, /load/n1 the second to
> > /load/n99.
> > Then each one of them has 100 children and so on, so that a valid path,
> for
> > instance, is /load/n23/n34/n50.
> > One node out of 6 has attached a nt:file node as well, in order to test
> > full
> > text searches. If requested, I can provide the code to create the test
> set.
> >
> > The strange behavior that prompted me to write to this mailing list, is
> the
> > following:
> >
> > Say that I am searching for a node that contains the word 'beatles' at
> some
> > level under the node /load/n40 and I use the following query:
> > */jcr:root/load/n40//*[jcr:contains(.,'beatles')]* the execution time is
> > 1672ms
> > If I use instead:
> > */jcr:root/load/n40/*/*/*/*[jcr:contains(.,'beatles')]*  the execution
> time
> > is 19749ms
> >
> > The second query, in theory, could execute faster than the first,
> because I
> > am providing more information (only nodes at the 4th level under
> /load/n40)
> > but takes 10 times longer to execute.
> > Is there a reason why?
>
> there are basically two reasons why the second query takes more time to
> execute:
> - the index does not contain depth (level) information of a node. the
> depth of a
> node is not stable and may change even if the node itself is not changed.
> if a
> subtree of nodes is moved to another location the depth of all nodes in
> the
> subtree changes. the query handler would have to re-index the whole
> subtree.
> - multiple child axis with just a * as name test are not optimized.
> /jcr:root/load/n40/*/*/*/*[jcr:contains(.,'beatles')] is translated into
> multiple ChildAxisQuerys each resolves the context nodes and provides a
> new
> context with the nodes that are the child nodes of the previous context.
> internally the query handler will temporarily have a set of nodes that
> includes
> all nodes at level 4 under /jcr:root/load/n40. for the query
> /jcr:root/load/n40//*[jcr:contains(.,'beatles')] the index will look up
> the
> nodes that match the fulltext condition and then filter out the ones that
> do not
> have /jcr:root/load/n40 as an ancestor. that operation involves less nodes
> and
> executes faster.
>
> > The other, way more worrisome problem, appears to be the opposite:
> > I have executed the following two queries
> > /jcr:root/load/n50/n2/* ==> 931ms
> > /jcr:root/load/n50/n2/*/* ==> 661ms
>
> that's indeed strange. maybe you get this result because the cache is
> filled up
> by the first query and the second one can take advantage of the pre-filled
> cache. can you please run those queries a second time just to make sure
> that
> both run against the same cache state?
>
> > The first is returning all nodes one level below /load/n50/n2 and the
> > second
> > two levels below. There are no other nodes under that.
> > When I tried the following query, which would return the same nodes in
> one
> > operation, the result was surprising (in a bad way)
> > /jcr:root/load/n50/n2//* ==>*353769ms*
> > **
> > The CPU goes 100%, I see in the jackrabbit logs a lot of entries similar
> > to:
> > DocNumberCache: size=1024/1024, #accesses=17039, #hits=167,
> #misses=16872,
> > cacheRatio=1% (DocNumberCache.java, line 155)
> >
> > and then finally, *some 5 minutes later*, I get the result.
> > Even if I restrict the query, it still takes the same time:
> > /jcr:root/load/n50/n2/m96//* and there's maybe only an hundred nodes
> under
> > that.
>
> unfortunately those are queries that are not optimized at all and will
> result in
> a full index traversal. see below for a workaround.
>
> > I have the exact same behavior if I try with the SQL syntax: select *
> from
> > nt:base where jcr:path like '/load/n50/n2/n96/%'
>
> that's because this query is equivalent to the above XPath statement and
> if
> that's the case the lucene query, which is executed ultimately, is the
> same.
>
> > The version of JR is 1.2.2. The backend is Oracle 10g, and I am running
> the
> > application on Tomcat 5.5 with jdk 1.5 and 1GB assigned to the JVM (on
> > Windows)
> >
> > Does anybody have any idea on why is this happening and if there is a
> > workaround?
>
> I thought no one would actually execute such queries and didn't bother to
> optimize them because there's a simple workaround:
>
> Node rootNode = ...
> rootNode.getNode("load/n50/n2").accept(new TraversingItemVisitor.Default()
> {
>             protected void entering(Node node, int level)
>                                  throws RepositoryException {
>                 node.getPath();
>             }
>         });
>
> If you feel this should be done efficiently by the query handler please
> file a
> jira issue. Thanks.
>
> regards
> marcel
>

Re: Query performances

Posted by Marcel Reutegger <ma...@gmx.net>.
Hi Alessandro,

Alessandro Bologna wrote:
> We have been incurring in an interesting behavior doing searches on a quite
> large repository (~1,000,000 nodes).
> The test data is made of a tree of nodes of type nt:unstructured, reference
> able, with two numeric properties (a sequential count of the node and a
> random number between 0 and the count). Each node has a reference to the
> parent, and up to 100 child nodes, and is named n<m> where m is the 
> index of
> the node, related to the parent node.
> So, for instance, /load/n0 is the first node, /load/n1 the second to
> /load/n99.
> Then each one of them has 100 children and so on, so that a valid path, for
> instance, is /load/n23/n34/n50.
> One node out of 6 has attached a nt:file node as well, in order to test 
> full
> text searches. If requested, I can provide the code to create the test set.
> 
> The strange behavior that prompted me to write to this mailing list, is the
> following:
> 
> Say that I am searching for a node that contains the word 'beatles' at some
> level under the node /load/n40 and I use the following query:
> */jcr:root/load/n40//*[jcr:contains(.,'beatles')]* the execution time is
> 1672ms
> If I use instead:
> */jcr:root/load/n40/*/*/*/*[jcr:contains(.,'beatles')]*  the execution time
> is 19749ms
> 
> The second query, in theory, could execute faster than the first, because I
> am providing more information (only nodes at the 4th level under /load/n40)
> but takes 10 times longer to execute.
> Is there a reason why?

there are basically two reasons why the second query takes more time to execute:
- the index does not contain depth (level) information of a node. the depth of a 
node is not stable and may change even if the node itself is not changed. if a 
subtree of nodes is moved to another location the depth of all nodes in the 
subtree changes. the query handler would have to re-index the whole subtree.
- multiple child axis with just a * as name test are not optimized. 
/jcr:root/load/n40/*/*/*/*[jcr:contains(.,'beatles')] is translated into 
multiple ChildAxisQuerys each resolves the context nodes and provides a new 
context with the nodes that are the child nodes of the previous context. 
internally the query handler will temporarily have a set of nodes that includes 
all nodes at level 4 under /jcr:root/load/n40. for the query 
/jcr:root/load/n40//*[jcr:contains(.,'beatles')] the index will look up the 
nodes that match the fulltext condition and then filter out the ones that do not 
have /jcr:root/load/n40 as an ancestor. that operation involves less nodes and 
executes faster.

> The other, way more worrisome problem, appears to be the opposite:
> I have executed the following two queries
> /jcr:root/load/n50/n2/* ==> 931ms
> /jcr:root/load/n50/n2/*/* ==> 661ms

that's indeed strange. maybe you get this result because the cache is filled up 
by the first query and the second one can take advantage of the pre-filled 
cache. can you please run those queries a second time just to make sure that 
both run against the same cache state?

> The first is returning all nodes one level below /load/n50/n2 and the 
> second
> two levels below. There are no other nodes under that.
> When I tried the following query, which would return the same nodes in one
> operation, the result was surprising (in a bad way)
> /jcr:root/load/n50/n2//* ==>*353769ms*
> **
> The CPU goes 100%, I see in the jackrabbit logs a lot of entries similar 
> to:
> DocNumberCache: size=1024/1024, #accesses=17039, #hits=167, #misses=16872,
> cacheRatio=1% (DocNumberCache.java, line 155)
> 
> and then finally, *some 5 minutes later*, I get the result.
> Even if I restrict the query, it still takes the same time:
> /jcr:root/load/n50/n2/m96//* and there's maybe only an hundred nodes under
> that.

unfortunately those are queries that are not optimized at all and will result in 
a full index traversal. see below for a workaround.

> I have the exact same behavior if I try with the SQL syntax: select * from
> nt:base where jcr:path like '/load/n50/n2/n96/%'

that's because this query is equivalent to the above XPath statement and if 
that's the case the lucene query, which is executed ultimately, is the same.

> The version of JR is 1.2.2. The backend is Oracle 10g, and I am running the
> application on Tomcat 5.5 with jdk 1.5 and 1GB assigned to the JVM (on
> Windows)
> 
> Does anybody have any idea on why is this happening and if there is a
> workaround?

I thought no one would actually execute such queries and didn't bother to 
optimize them because there's a simple workaround:

Node rootNode = ...
rootNode.getNode("load/n50/n2").accept(new TraversingItemVisitor.Default() {
             protected void entering(Node node, int level)
                                  throws RepositoryException {
                 node.getPath();
             }
         });

If you feel this should be done efficiently by the query handler please file a 
jira issue. Thanks.

regards
  marcel