You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jörg Kiegeland <ki...@ikv.de> on 2007/11/21 21:09:20 UTC

Performance problems for OR-queries

I have N keywords and execute a query of the form

keyword1 OR keyword2 OR .. OR keywordN

The search result would be very large (some million), so I defined a 
result limit of 100.
However Solr seems now to calculates for every possible result document 
the number of matched keywords and to order them and to give the 100 
documents back with the highest number of matched keywords.
This seems to take linear time to the size of all possible matched 
documents. So my questions are:

1. Does Solr support this kind of index access with better performance ? 
Is there anything special to define in schema.xml?

2. Can one switch off this ordering and just return any 100 documents 
fullfilling the query (though  getting best-matching documents would be 
a nice feature if it would be fast)?

Thanks

RE: Performance problems for OR-queries

Posted by "Norskog, Lance" <la...@divvio.com>.

https://issues.apache.org/jira/browse/lucene-997 is a patch to limit the time used for a query. 

Google clearly estimates the total # of results, and over-estimates.

Lance 

-----Original Message-----
From: Mike Klaas [mailto:mike.klaas@gmail.com] 
Sent: Thursday, November 22, 2007 1:37 PM
To: solr-user@lucene.apache.org
Subject: Re: Performance problems for OR-queries

On 22-Nov-07, at 6:02 AM, Jörg Kiegeland wrote:

>
>>> 1. Does Solr support this kind of index access with better 
>>> performance ?
>>> Is there anything special to define in schema.xml?
>>>
>>
>> No... Solr uses Lucene at it's core, and all matching documents for a 
>> query are scored.
>>
> So it is not possible to have a "google" like performance with Solr, 
> i.e. to search for a set of keywords and only the 10 best documents 
> are listed, without touching the other millions of (web) documents 
> matching less keywords.
> I infact would not know how to program such an index, however google 
> has done it somehow..

I can be fairly certain that google does not execute queries that match millions of documents on a single machine.  The default query operator is (mostly) AND, so the possible match sets is much smaller.  Also, I imagine they have relatively few documents per machine.

>>> 2. Can one switch off this ordering and just return any 100 
>>> documents fullfilling the query (though  getting best-matching 
>>> documents would be a nice feature if it would be fast)?
>>>
>>
>> a feature like this could be developed... but what is the usecase for 
>> this?  What are you tring to accomplish where either relevancy or 
>> complete matching doesn't matter?  There may be an easier workaround 
>> for your specific case.
>>
> This is not an actual Use-Case for my project, however I just wanted 
> to know if it would be possible.
>
> Because of the performance results, we designed a new type of query. I 
> would like to know how fast it would be before I implement the 
> following query:
>
> I have N keywords and execute a query of the form
>
> keyword1 AND keyword2 AND .. AND keywordN
>
> there may be again some millions of matching documents and I want to 
> get the first 100 documents.
> To have a ordering criteria, each Solr document has a field named 
> "REV" which has a natural number. The returned 100 documents shall be 
> those with the lowest numbers in the "REV" field.
>
> My questions now are:
>
> (1) Will the query perform in O(100) or in O(all possible matches)?

O(all possible matches)

> (2) If the answer to (1) is O(all possible matches), what will be the 
> performance if I dont order for the "REV" field? Will Solr order it 
> after the point of time where a document was created/ modified? What I 
> have to do to get O(100) complexity finally?

Ordering by natural document order in the index is sufficient to achieve O(100), but you'll have to insert code in Solr to stop after 100 docs (another alternative is to stop processing after a given amount of time).  Also, using O() in the case isn't quite accurate:  
there are costs that vary based on the number of docs in the index too.

-Mike

Re: Performance problems for OR-queries

Posted by Mike Klaas <mi...@gmail.com>.

On 22-Nov-07, at 6:02 AM, Jörg Kiegeland wrote:

>
>>> 1. Does Solr support this kind of index access with better  
>>> performance ?
>>> Is there anything special to define in schema.xml?
>>>
>>
>> No... Solr uses Lucene at it's core, and all matching documents for a
>> query are scored.
>>
> So it is not possible to have a "google" like performance with  
> Solr, i.e. to search for a set of keywords and only the 10 best  
> documents are listed, without touching the other millions of (web)  
> documents matching less keywords.
> I infact would not know how to program such an index, however  
> google has done it somehow..

I can be fairly certain that google does not execute queries that  
match millions of documents on a single machine.  The default query  
operator is (mostly) AND, so the possible match sets is much  
smaller.  Also, I imagine they have relatively few documents per  
machine.

>>> 2. Can one switch off this ordering and just return any 100  
>>> documents
>>> fullfilling the query (though  getting best-matching documents  
>>> would be
>>> a nice feature if it would be fast)?
>>>
>>
>> a feature like this could be developed... but what is the usecase for
>> this?  What are you tring to accomplish where either relevancy or
>> complete matching doesn't matter?  There may be an easier workaround
>> for your specific case.
>>
> This is not an actual Use-Case for my project, however I just  
> wanted to know if it would be possible.
>
> Because of the performance results, we designed a new type of  
> query. I would like to know how fast it would be before I implement  
> the following query:
>
> I have N keywords and execute a query of the form
>
> keyword1 AND keyword2 AND .. AND keywordN
>
> there may be again some millions of matching documents and I want  
> to get the first 100 documents.
> To have a ordering criteria, each Solr document has a field named  
> "REV" which has a natural number. The returned 100 documents shall  
> be those with
> the lowest numbers in the "REV" field.
>
> My questions now are:
>
> (1) Will the query perform in O(100) or in O(all possible matches)?

O(all possible matches)

> (2) If the answer to (1) is O(all possible matches), what will be  
> the performance if I dont order for the "REV" field? Will Solr  
> order it after the point of time where a document was created/ 
> modified? What I have to do to get O(100) complexity finally?

Ordering by natural document order in the index is sufficient to  
achieve O(100), but you'll have to insert code in Solr to stop after  
100 docs (another alternative is to stop processing after a given  
amount of time).  Also, using O() in the case isn't quite accurate:  
there are costs that vary based on the number of docs in the index too.

-Mike

Re: Performance problems for OR-queries

Posted by Jörg Kiegeland <ki...@ikv.de>.

>> 1. Does Solr support this kind of index access with better performance ?
>> Is there anything special to define in schema.xml?
>>     
>
> No... Solr uses Lucene at it's core, and all matching documents for a
> query are scored.
>   
So it is not possible to have a "google" like performance with Solr, 
i.e. to search for a set of keywords and only the 10 best documents are 
listed, without touching the other millions of (web) documents matching 
less keywords.
I infact would not know how to program such an index, however google has 
done it somehow..

>> 2. Can one switch off this ordering and just return any 100 documents
>> fullfilling the query (though  getting best-matching documents would be
>> a nice feature if it would be fast)?
>>     
>
> a feature like this could be developed... but what is the usecase for
> this?  What are you tring to accomplish where either relevancy or
> complete matching doesn't matter?  There may be an easier workaround
> for your specific case.
>   
This is not an actual Use-Case for my project, however I just wanted to 
know if it would be possible.

Because of the performance results, we designed a new type of query. I 
would like to know how fast it would be before I implement the following 
query:

I have N keywords and execute a query of the form

keyword1 AND keyword2 AND .. AND keywordN

there may be again some millions of matching documents and I want to get the first 100 documents.
To have a ordering criteria, each Solr document has a field named "REV" which has a natural number. The returned 100 documents shall be those with
the lowest numbers in the "REV" field.

My questions now are:

(1) Will the query perform in O(100) or in O(all possible matches)?

(2) If the answer to (1) is O(all possible matches), what will be the performance if I dont order for the "REV" field? Will Solr order it after the point of time where a document was created/modified? What I have to do to get O(100) complexity finally?


Thanks

Jörg

Re: Performance problems for OR-queries

Posted by Yonik Seeley <yo...@apache.org>.

On Nov 21, 2007 3:09 PM, Jörg Kiegeland <ki...@ikv.de> wrote:
> I have N keywords and execute a query of the form
>
> keyword1 OR keyword2 OR .. OR keywordN
[...]
> This seems to take linear time to the size of all possible matched
> documents.

Yes.

> 1. Does Solr support this kind of index access with better performance ?
> Is there anything special to define in schema.xml?

No... Solr uses Lucene at it's core, and all matching documents for a
query are scored.

> 2. Can one switch off this ordering and just return any 100 documents
> fullfilling the query (though  getting best-matching documents would be
> a nice feature if it would be fast)?

a feature like this could be developed... but what is the usecase for
this?  What are you tring to accomplish where either relevancy or
complete matching doesn't matter?  There may be an easier workaround
for your specific case.

-Yonik