You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Ard Schrijvers <a....@hippo.nl> on 2007/08/08 16:16:47 UTC
improving the scalability in searching
Hello,
As mentioned in https://issues.apache.org/jira/browse/JCR-1051 I think there might be some optimization in scalability and performance in some parts of the current lucene implementation.
For now, I have two major performance/scalability concerns in the current indexing/searching implementation :
1) The XPath implementation for //*[@mytext] (sql same problem)
2) The XPath jcr:like implementation, for example : //*[jcr:like(@mytext,'%foo bar qu%')]
Problem 1):
//*[@mytext] is transformed into the org.apache.jackrabbit.core.query.lucene.MatchAllQuery, that through the MatchAllWeight uses the MatchAllScorer. In this MatchAllScorer there is a calculateDocFilter() that IMO does not scale. Suppose, I have 100.000 nodes with a property 'title'. Suppose there are no duplicate titles (or few).
Now, suppose I have XPath /rootnode/articles/2007[@mytitle]. Then, the while loop in calculateDocFilter() is done 100.000 times (See code below). 100.000 times
terms.term().text().startsWith(FieldNames.createNamedValue(field, "")
docs.seek(terms)
docFilter.set(docs.doc());
This scales linearly AFAIU, and becomes slow pretty fast (I can add a unit test that shows this, but on my modest machine I see for 100.000 nodes searches take already 400 ms with a cached reader, while it can easily be 0 ms IIULC : "if i understand lucene correcly" :-) ).
Solution 1):
IMO, we should index more (derived) data about a documents properties (I'll return to this in a mail about IndexingConfiguration which I think we can add some features that might tackle this) if we want to be able to query fast. For this specific problem, the solution would be very simple:
Beside
/**
* Name of the field that contains all values of properties that are indexed
* as is without tokenizing. Terms are prefixed with the property name.
*/
public static final String PROPERTIES = "_:PROPERTIES".intern();
I suggest to add
/**
* Name of the field that contains all available properties that present for a certain node
*/
public static final String PROPERTIES_SET = "_:PROPERTIES_SET".intern();
and when indexing a node, each property name of that node is added to its index (few lines of code in NodeIndexer):
Then, when searching for all nodes that have a property, is one single docs.seek(terms); and set the docFilter. This approach scales to millions of documents easily with times close to 0 ms. WDOT? Ofcourse, I can implement this in the trunk.
I will do problem (2) in a next mail because my mail is getting a little long,
Regards Ard
---------------------------------------------
TermEnum terms = reader.terms(new Term(FieldNames.PROPERTIES, FieldNames.createNamedValue(field, "")));
try {
TermDocs docs = reader.termDocs();
try {
while (terms.term() != null
&& terms.term().field() == FieldNames.PROPERTIES
&& terms.term().text().startsWith(FieldNames.createNamedValue(field, ""))) {
docs.seek(terms);
counter++;
while (docs.next()) {
docFilter.set(docs.doc());
}
terms.next();
}
} finally {
docs.close();
}
} finally {
terms.close();
}
---------------------------------------------
--
Hippo
Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel +31 (0)20 5224466
-------------------------------------------------------------
a.schrijvers@hippo.nl / ard@apache.org / http://www.hippo.nl
--------------------------------------------------------------
Re: improving the scalability in searching
Posted by Christoph Kiehl <ck...@sulu3000.de>.
Marcel Reutegger wrote:
> Ard Schrijvers wrote:
>> IMO, we should index more (derived) data about a documents properties
>> (I'll
>> return to this in a mail about IndexingConfiguration which I think we
>> can add
>> some features that might tackle this) if we want to be able to query
>> fast.
>> For this specific problem, the solution would be very simple:
>>
>> I suggest to add
>>
>> /** * Name of the field that contains all available properties that
>> present
>> for a certain node */ public static final String PROPERTIES_SET =
>> "_:PROPERTIES_SET".intern();
>>
>> and when indexing a node, each property name of that node is added to its
>> index (few lines of code in NodeIndexer):
>>
>> Then, when searching for all nodes that have a property, is one single
>> docs.seek(terms); and set the docFilter. This approach scales to
>> millions of
>> documents easily with times close to 0 ms. WDOT? Ofcourse, I can
>> implement
>> this in the trunk.
>
> I agree with you that the current implementation is not optimized for
> queries that check the existence of a property. Your proposed solution
> seems reasonable, I would implement it the same way. There's just one
> minor obstacle, how do we implement this change in a backward compatible
> way? an existing index without this additional field should still work.
We could use IndexReader.getFieldNames() at startup to check if such a
field already exists which means we have an index in the new format and
then use this information in MatchAllScorer to decide which
implementation to use.
Cheers,
Christoph
Re: improving the scalability in searching
Posted by Marcel Reutegger <ma...@gmx.net>.
Ard Schrijvers wrote:
> IMO, we should index more (derived) data about a documents properties (I'll
> return to this in a mail about IndexingConfiguration which I think we can add
> some features that might tackle this) if we want to be able to query fast.
> For this specific problem, the solution would be very simple:
>
> I suggest to add
>
> /** * Name of the field that contains all available properties that present
> for a certain node */ public static final String PROPERTIES_SET =
> "_:PROPERTIES_SET".intern();
>
> and when indexing a node, each property name of that node is added to its
> index (few lines of code in NodeIndexer):
>
> Then, when searching for all nodes that have a property, is one single
> docs.seek(terms); and set the docFilter. This approach scales to millions of
> documents easily with times close to 0 ms. WDOT? Ofcourse, I can implement
> this in the trunk.
I agree with you that the current implementation is not optimized for queries
that check the existence of a property. Your proposed solution seems reasonable,
I would implement it the same way. There's just one minor obstacle, how do we
implement this change in a backward compatible way? an existing index without
this additional field should still work.
regards
marcel