You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Ard Schrijvers <a....@hippo.nl> on 2007/08/08 16:47:39 UTC

improving the scalability in searching part 2

Problem 2:

2) The XPath jcr:like implementation, for example : //*[jcr:like(@mytext,'%foo bar qu%')]

The jcr:like implementation (for sql holds the same) is translated to a JackRabbit WildcardQuery which in turn uses a WildcardTermEnum which has a "protected boolean termCompare(Term term)" method (though I haven't sorted out where the exact bottleneck is).

Now, it boils down that when you search for nodes which have some string in some property, this implies scanning UN_TOKENIZED fields in lucene, which is IMHO, not the way to do it (though, I haven't yet got *the* solution for the wildcard parts. Without the wildcards, obviously a PhraseQuery would do on the indexed TOKENIZED property <X:FULL:myproperty> field. 

Anyway, the current jcr:like results in queries taking up to 10 seconds to complete for only 1000 nodes with one property, "mytext" which is on average 500 words long. A cached IndexReader won't be faster in it. 

The jcr:like is I think not useable according the current implementation. Perhaps somebody else know how to be able to use the PhraseQuery in a way that suits our needs (I already posted to the lucene list if there is some best way to implement an 'like' functionality)

Regards Ard

-- 

Hippo
Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-------------------------------------------------------------
a.schrijvers@hippo.nl / ard@apache.org / http://www.hippo.nl
--------------------------------------------------------------

Re: improving the scalability in searching part 2

Posted by Bertrand Delacretaz <bd...@apache.org>.

On 8/8/07, Ard Schrijvers <a....@hippo.nl> wrote:
> ...2) The XPath jcr:like implementation, for example : //*[jcr:like(@mytext,'%foo bar qu%')]
> ...the current jcr:like results in queries taking up to 10 seconds to complete for only
> 1000 nodes with one property, "mytext" which is on average 500 words long....

Just curious, is

  %foo bar qu%

much slower than

  foo bar qu%

?

I'd guess so, as Lucene-based indexes are usually inefficient with
leading wildcards. Do your tests confirm that?

-Bertrand