You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by daveg0 <ba...@googlemail.com> on 2009/04/01 09:10:27 UTC

Help with query performance required

Hi,

I am trying to do "wildcard" queries that should return multiple nodes such
as:

/jcr:root/portal/wap/images//element(*,
atom:Entry)[jcr:like(@atom:titletext,'soccer%'] 

the performance has degraded over time with more entries to take nearly 8
seconds which is unacceptable. I am aware that wildcard queries take longer,
but shouldn't this type of query create a Lucene PrefixQuery which is much
quicker. Most of our "wildcard" queries will be "prefix" queries as they
will typically be searches for matching entries that start with a specific
value eg "st*".

I tried looking through the source code and I can't see any use of Lucene
PrefixQuery only WildcardQuery, is this a design decision? 

Am I missing something or is it possible for Jackrabbit to perform a
PrefixQuery for queries like this.

I also tried to use "jcr:contains" e.g:

/jcr:root/portal/wap/images//element(*,
atom:Entry)[jcr:contains(@atom:titletext,'soccer'] 

but this only returns the first matching entry. Am I
misunderstanding/misusing "jcr:contains" in this way or would you expect it
to return the same as the query with "jcr:like"

Can you give me some pointers to how to work around this

regards,

Dave Gough

-- 
View this message in context: http://www.nabble.com/Help-with-query-performance-required-tp22820996p22820996.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Help with query performance required

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi,

On Wed, Apr 1, 2009 at 09:10, daveg0 <ba...@googlemail.com> wrote:
> I am trying to do "wildcard" queries that should return multiple nodes such
> as:
>
> /jcr:root/portal/wap/images//element(*,
> atom:Entry)[jcr:like(@atom:titletext,'soccer%']
>
> the performance has degraded over time with more entries to take nearly 8
> seconds which is unacceptable. I am aware that wildcard queries take longer,
> but shouldn't this type of query create a Lucene PrefixQuery which is much
> quicker. Most of our "wildcard" queries will be "prefix" queries as they
> will typically be searches for matching entries that start with a specific
> value eg "st*".
>
> I tried looking through the source code and I can't see any use of Lucene
> PrefixQuery only WildcardQuery, is this a design decision?

Yes, it is. There are basically two reasons:

- PrefixQuery is basically a boolean query that consists of optional
TermQueries (one for each term that matches the prefix). This design
has an inherent limit, because as soon as you have more than 1024
distinct terms that match the prefix the BooleanQuery will throw a
TooManyClauses exception.
- Jackrabbit supports prefix queries in combination with lower- and
upper-casing. This is not possible with the lucene PrefixQuery

In any case, prefix queries perform linearly to the number of distinct
terms in the index that match the prefix. Is it possible that your
prefix matches lots of distinct terms? i.e. the prefix is very short
or very common.

> Am I missing something or is it possible for Jackrabbit to perform a
> PrefixQuery for queries like this.
>
> I also tried to use "jcr:contains" e.g:
>
> /jcr:root/portal/wap/images//element(*,
> atom:Entry)[jcr:contains(@atom:titletext,'soccer']

that's not exactly the same, because it matches only terms that were
indexed as soccer. You could use:

/jcr:root/portal/wap/images//element(*,
atom:Entry)[jcr:contains(@atom:titletext,'soccer*']

but I'd say the performance is about the same.

> but this only returns the first matching entry. Am I
> misunderstanding/misusing "jcr:contains" in this way or would you expect it
> to return the same as the query with "jcr:like"

jcr:contains and jcr:like behave differently. see the specification for details.

regards
 marcel