You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Ard Schrijvers <a....@hippo.nl> on 2008/01/22 22:17:27 UTC

Explanation and solutions of some Jackrabbit queries regarding performance

Hello Martin Zdila regarding JCR-1196 et al,

from time to time I see mails regarding performance of queries and slow
things like queryResult.getNodes().hasNext(). There are queries which
can be slow, there are data modelling structures which might be slow,
and there are seemingly trivial things like
queryResult.getNodes().hasNext() which might be slow. I write 'might'
all the time, because everything can and must be blistering fast with
millions of documents, and most of the time, solutions are extremely
simple to achieve this. We just have to document some pitfalls of easy
made mistakes. I'll try to find some time in the near future to document
some parts I am aware of in the form of a FAQ, like the rest of this
mail will be. For now just some frequently made mistakes from the top of
my head:

@Martin Zdila : if you are not interested in reading the rest of this
mail, just add <param name="respectDocumentOrder" value="false"/> to the
<SearchIndex> element of your workspace.xml (and repository.xml). Also
try to avoid 4000 node childs (certainly same name nodes) under one
node, try to create a larger tree where nodes to not contains many child
nodes. This is just like your filesystem not fast


Question 1: why is search for xpath '/jcr:root/a/b/c' slower than '//c'
or '//*[@someprop]' ?

Answer 1: When using a path like '/jcr:root/a/b/c' or '/jcr:root/a//*/c'
will be executed, the hierarchy manager has to check all found nodes
wether their parents are correct. Since Jackrabbit does not store
hierarchical data (if it would, it could not efficiently move a node
anymore, at least in the current architecture), hierarchies need to be
checked by iterating through the lucene indexes to find parent nodes of
a result. This is cpu consuming. Although since Jackrabbit 1.4 the
hierarchy is cached properly, returning many results is still an
expensive operation. The first execution of a query might be slow
because the hierarchy cache needs to be build up. Queries like '//c' or
'//*[@someprop]' do not need to check hierarchies, because results do
not need to check wether they are allowed according their parent node. 

Conclusion 1: When the resultset of the search is expected to be large,
try to avoid path info in the xpath. Try to distinguish based on for
example nodetype or some property.

Question 2: My xpath was '//c' and the result size is 10.000 nodes. When
I call queryResult.getNodes().hasNext() it takes up to minutes to
complete this call. 

Answer 2: For Jackrabbit version < 1.5 , the default setting in the
<SearchIndex> configuration in repository.xml is 
<param name="respectDocumentOrder" value="true"/>. This means that when
a query does *not* have a 'order by' clause, result nodes will be in
document order. Returning nodes in document order for many results (>
1000) will become increasingly slow. You can fix this by either setting
respectDocumentOrder to false in your repository.xml (and in
workspace.xml if you have an existing workspace already) *or* by adding
an 'order by' clause in your query. Minutes delay will be decreased to
0-15ms

Conclusion 2: When you have a lot of results, either include an 'order
by' clause or set respectDocumentOrder to false. Modelling your content
in having many child nodes below one single node will make the problem
even larger when you have respectDocumentOrder = true and do not define
an 'order by' clause

Question 3: My xpath is '//*[jcr:like(@propertyName, '%somevalue%')]'
and it takes minutes to complete. 

Answer 3: a jcr:like with % will be translated to a WildcardQuery lucene
query. In order to prevent extremely slow WildcardQueries, a Wildcard
term should not start with one of the wildcards * or ?. So this is not a
Jackrabbit implementation detail, but a general Lucene (and I think
inverted indexes in general) issue [1]

Conclusion 3: Avoid % prefixes in jcr:like. Use jcr:contains when
searching for a specific word. If jcr:contains is not suitable, you can
work around the problem by creating a custom lucene analyzer for the
specific propery (see IndexingConfiguration [2] at Index Analyzers).

Question 4: I am not searching through nodes, but traversing, and this
is slow

Answer 4: Model your repository to not have very many child nodes
directly below a node. Try to structure your repository to have not
extremely 'large folders', comparable to how your FileSystem would
become slow

This mail is getting to long :-) I'll come up with ssome extra FAQ's
from time to time, and if people are interested I will make a (wiki?)
document for it. I though might need some help because at some parts my
knowledge might be insufficient

To be continued,

Regards Ard

[1]
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or
g/apache/lucene/search/WildcardQuery.html
[2] http://wiki.apache.org/jackrabbit/IndexingConfiguration

-- 

Hippo
Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-------------------------------------------------------------
a.schrijvers@hippo.nl / ard@apache.org / http://www.hippo.nl
-------------------------------------------------------------- 

RE: Explanation and solutions of some Jackrabbit queries regarding performance

Posted by Ard Schrijvers <a....@hippo.nl>.
Hello Marcel,

> 
> Hi Ard,
> 
> excellent work. this should definitively be placed on a query 
> faq wiki page.

I'll try to find time in short notice (this weekend most likely) to
create the query faq wiki (I always feel like it is much easier to write
an email, because when on a wiki it *must* / should be correct, and a
mail can easily be corrected). Since on some parts my knowledge is not
enough, I might like some feedback on the things I write, but I
suppose/hope people will get back with
corrections/suggestiuons/enhancements. An organically growing wiki FAQ
document about queries might be benificial to a lot of users. 

-Ard




Re: Explanation and solutions of some Jackrabbit queries regarding performance

Posted by Marcel Reutegger <ma...@gmx.net>.
Hi Ard,

excellent work. this should definitively be placed on a query faq wiki page.

regards
  marcel

Ard Schrijvers wrote:
> Hello Martin Zdila regarding JCR-1196 et al,
> 
> from time to time I see mails regarding performance of queries and slow
> things like queryResult.getNodes().hasNext(). There are queries which
> can be slow, there are data modelling structures which might be slow,
> and there are seemingly trivial things like
> queryResult.getNodes().hasNext() which might be slow. I write 'might'
> all the time, because everything can and must be blistering fast with
> millions of documents, and most of the time, solutions are extremely
> simple to achieve this. We just have to document some pitfalls of easy
> made mistakes. I'll try to find some time in the near future to document
> some parts I am aware of in the form of a FAQ, like the rest of this
> mail will be. For now just some frequently made mistakes from the top of
> my head:
> 
> @Martin Zdila : if you are not interested in reading the rest of this
> mail, just add <param name="respectDocumentOrder" value="false"/> to the
> <SearchIndex> element of your workspace.xml (and repository.xml). Also
> try to avoid 4000 node childs (certainly same name nodes) under one
> node, try to create a larger tree where nodes to not contains many child
> nodes. This is just like your filesystem not fast
> 
> 
> Question 1: why is search for xpath '/jcr:root/a/b/c' slower than '//c'
> or '//*[@someprop]' ?
> 
> Answer 1: When using a path like '/jcr:root/a/b/c' or '/jcr:root/a//*/c'
> will be executed, the hierarchy manager has to check all found nodes
> wether their parents are correct. Since Jackrabbit does not store
> hierarchical data (if it would, it could not efficiently move a node
> anymore, at least in the current architecture), hierarchies need to be
> checked by iterating through the lucene indexes to find parent nodes of
> a result. This is cpu consuming. Although since Jackrabbit 1.4 the
> hierarchy is cached properly, returning many results is still an
> expensive operation. The first execution of a query might be slow
> because the hierarchy cache needs to be build up. Queries like '//c' or
> '//*[@someprop]' do not need to check hierarchies, because results do
> not need to check wether they are allowed according their parent node. 
> 
> Conclusion 1: When the resultset of the search is expected to be large,
> try to avoid path info in the xpath. Try to distinguish based on for
> example nodetype or some property.
> 
> Question 2: My xpath was '//c' and the result size is 10.000 nodes. When
> I call queryResult.getNodes().hasNext() it takes up to minutes to
> complete this call. 
> 
> Answer 2: For Jackrabbit version < 1.5 , the default setting in the
> <SearchIndex> configuration in repository.xml is 
> <param name="respectDocumentOrder" value="true"/>. This means that when
> a query does *not* have a 'order by' clause, result nodes will be in
> document order. Returning nodes in document order for many results (>
> 1000) will become increasingly slow. You can fix this by either setting
> respectDocumentOrder to false in your repository.xml (and in
> workspace.xml if you have an existing workspace already) *or* by adding
> an 'order by' clause in your query. Minutes delay will be decreased to
> 0-15ms
> 
> Conclusion 2: When you have a lot of results, either include an 'order
> by' clause or set respectDocumentOrder to false. Modelling your content
> in having many child nodes below one single node will make the problem
> even larger when you have respectDocumentOrder = true and do not define
> an 'order by' clause
> 
> Question 3: My xpath is '//*[jcr:like(@propertyName, '%somevalue%')]'
> and it takes minutes to complete. 
> 
> Answer 3: a jcr:like with % will be translated to a WildcardQuery lucene
> query. In order to prevent extremely slow WildcardQueries, a Wildcard
> term should not start with one of the wildcards * or ?. So this is not a
> Jackrabbit implementation detail, but a general Lucene (and I think
> inverted indexes in general) issue [1]
> 
> Conclusion 3: Avoid % prefixes in jcr:like. Use jcr:contains when
> searching for a specific word. If jcr:contains is not suitable, you can
> work around the problem by creating a custom lucene analyzer for the
> specific propery (see IndexingConfiguration [2] at Index Analyzers).
> 
> Question 4: I am not searching through nodes, but traversing, and this
> is slow
> 
> Answer 4: Model your repository to not have very many child nodes
> directly below a node. Try to structure your repository to have not
> extremely 'large folders', comparable to how your FileSystem would
> become slow
> 
> This mail is getting to long :-) I'll come up with ssome extra FAQ's
> from time to time, and if people are interested I will make a (wiki?)
> document for it. I though might need some help because at some parts my
> knowledge might be insufficient
> 
> To be continued,
> 
> Regards Ard
> 
> [1]
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or
> g/apache/lucene/search/WildcardQuery.html
> [2] http://wiki.apache.org/jackrabbit/IndexingConfiguration
> 


Re: Explanation and solutions of some Jackrabbit queries regarding performance

Posted by Alessandro Bologna <al...@gmail.com>.
+1 for putting this in the wiki. It's the better explanation i have
read insofar on how to optimize some queries on jackrabbit and why
some behave unexpectedly. The //foo being faster of /bar/baz/foo was
one of them.

Thanks!
Alessandro


On Jan 22, 2008 4:17 PM, Ard Schrijvers <a....@hippo.nl> wrote:
> Hello Martin Zdila regarding JCR-1196 et al,
>
> from time to time I see mails regarding performance of queries and slow
> things like queryResult.getNodes().hasNext(). There are queries which
> can be slow, there are data modelling structures which might be slow,
> and there are seemingly trivial things like
> queryResult.getNodes().hasNext() which might be slow. I write 'might'
> all the time, because everything can and must be blistering fast with
> millions of documents, and most of the time, solutions are extremely
> simple to achieve this. We just have to document some pitfalls of easy
> made mistakes. I'll try to find some time in the near future to document
> some parts I am aware of in the form of a FAQ, like the rest of this
> mail will be. For now just some frequently made mistakes from the top of
> my head:
>
> @Martin Zdila : if you are not interested in reading the rest of this
> mail, just add <param name="respectDocumentOrder" value="false"/> to the
> <SearchIndex> element of your workspace.xml (and repository.xml). Also
> try to avoid 4000 node childs (certainly same name nodes) under one
> node, try to create a larger tree where nodes to not contains many child
> nodes. This is just like your filesystem not fast
>
>
> Question 1: why is search for xpath '/jcr:root/a/b/c' slower than '//c'
> or '//*[@someprop]' ?
>
> Answer 1: When using a path like '/jcr:root/a/b/c' or '/jcr:root/a//*/c'
> will be executed, the hierarchy manager has to check all found nodes
> wether their parents are correct. Since Jackrabbit does not store
> hierarchical data (if it would, it could not efficiently move a node
> anymore, at least in the current architecture), hierarchies need to be
> checked by iterating through the lucene indexes to find parent nodes of
> a result. This is cpu consuming. Although since Jackrabbit 1.4 the
> hierarchy is cached properly, returning many results is still an
> expensive operation. The first execution of a query might be slow
> because the hierarchy cache needs to be build up. Queries like '//c' or
> '//*[@someprop]' do not need to check hierarchies, because results do
> not need to check wether they are allowed according their parent node.
>
> Conclusion 1: When the resultset of the search is expected to be large,
> try to avoid path info in the xpath. Try to distinguish based on for
> example nodetype or some property.
>
> Question 2: My xpath was '//c' and the result size is 10.000 nodes. When
> I call queryResult.getNodes().hasNext() it takes up to minutes to
> complete this call.
>
> Answer 2: For Jackrabbit version < 1.5 , the default setting in the
> <SearchIndex> configuration in repository.xml is
> <param name="respectDocumentOrder" value="true"/>. This means that when
> a query does *not* have a 'order by' clause, result nodes will be in
> document order. Returning nodes in document order for many results (>
> 1000) will become increasingly slow. You can fix this by either setting
> respectDocumentOrder to false in your repository.xml (and in
> workspace.xml if you have an existing workspace already) *or* by adding
> an 'order by' clause in your query. Minutes delay will be decreased to
> 0-15ms
>
> Conclusion 2: When you have a lot of results, either include an 'order
> by' clause or set respectDocumentOrder to false. Modelling your content
> in having many child nodes below one single node will make the problem
> even larger when you have respectDocumentOrder = true and do not define
> an 'order by' clause
>
> Question 3: My xpath is '//*[jcr:like(@propertyName, '%somevalue%')]'
> and it takes minutes to complete.
>
> Answer 3: a jcr:like with % will be translated to a WildcardQuery lucene
> query. In order to prevent extremely slow WildcardQueries, a Wildcard
> term should not start with one of the wildcards * or ?. So this is not a
> Jackrabbit implementation detail, but a general Lucene (and I think
> inverted indexes in general) issue [1]
>
> Conclusion 3: Avoid % prefixes in jcr:like. Use jcr:contains when
> searching for a specific word. If jcr:contains is not suitable, you can
> work around the problem by creating a custom lucene analyzer for the
> specific propery (see IndexingConfiguration [2] at Index Analyzers).
>
> Question 4: I am not searching through nodes, but traversing, and this
> is slow
>
> Answer 4: Model your repository to not have very many child nodes
> directly below a node. Try to structure your repository to have not
> extremely 'large folders', comparable to how your FileSystem would
> become slow
>
> This mail is getting to long :-) I'll come up with ssome extra FAQ's
> from time to time, and if people are interested I will make a (wiki?)
> document for it. I though might need some help because at some parts my
> knowledge might be insufficient
>
> To be continued,
>
> Regards Ard
>
> [1]
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or
> g/apache/lucene/search/WildcardQuery.html
> [2] http://wiki.apache.org/jackrabbit/IndexingConfiguration
>
> --
>
> Hippo
> Oosteinde 11
> 1017WT Amsterdam
> The Netherlands
> Tel  +31 (0)20 5224466
> -------------------------------------------------------------
> a.schrijvers@hippo.nl / ard@apache.org / http://www.hippo.nl
> --------------------------------------------------------------
>

Re: Explanation and solutions of some Jackrabbit queries regarding performance

Posted by Martin Zdila <m....@mwaysolutions.com>.
Many thanks to Ard Schrijvers for the explanation!
+1 for putting this to FAQ :-)

-- 
Martin Zdila 
CTO

M-Way Solutions Slovakia s.r.o.
Letna 27, 040 01 Kosice
Slovakia

tel:+421-908-363-848
mailto:m.zdila@mwaysolutions.com
http://www.mwaysolutions.com
xmpp:zdila@jabbim.sk (Jabber)
skype:m.zdila