You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Vince Taluskie <vg...@gmail.com> on 2006/02/08 22:10:17 UTC

scalability recommendations for large performance-intensive indexes

hello All,

I'm looking for some advice on how to improve scalability - we have a fairly
large lucene index of 35M documents, max 1k document size (most much
smaller) and 14 fields.   We combine descriptive text together into a
"contents" field and search on that and have been very pleased with handling
almost 100 queries/sec at about 8-12ms for the average search.

Prior to that we had a common attribute for which about 50% of the docs had
one value and the rest had the other value and the boolean query slowed
response times very significantly.  We handled this by breaking up our
indexes so that the index only contained one attribute or the other and
eliminated the need for the boolean - this was a 7-8x improvement.

Now we're back to wanting to add another attribute to the documents for
which most of the docs will have one value and much fewer will have the
other and although it sounds so simple - my limited testing with an 85/15
ratio is showing another big hit on performance with the boolean.    A two
term boolean search without the attribute is about 7-8ms, adding the
attribute to the boolean search increases the elapsed time to 4x and 2x of
original for the 85% and 15% frequencies respective.

I had some hope that a QueryFilter would really help out but it turns out to
be much much slower:  the 85% term ends up taking a whopping 336ms and the
15% term ends up around 65ms which is 40x and 8x slower than the original
8ms query speed without the additional attribute.

I have to ask if there's not a better way to handle the addition of an
common attribute with a few possible values across the index.  Any other
recommended approaches?

Thanks in advance,

Vince

Re: scalability recommendations for large performance-intensive indexes

Posted by markharw00d <ma...@yahoo.co.uk>.

Hi Vince, sounds like the same issue I highlighted recently on the 
java-dev list.

See here:
http://www.nabble.com/Preventing-%22killer%22-queries-t1077895.html

The problem lies in the underlying cost of reading TermDocs for very 
common terms (a problem for  both queries and filters)

For your issue, given the  problem field only has 2 values you can 
comfortably cache 2 Bitsets and use them as filters.
This would only require roughly (35m/8)*2 bytes or about 9 meg of RAM.

Cheers
Mark



Vince Taluskie wrote:

>hello All,
>
>I'm looking for some advice on how to improve scalability - we have a fairly
>large lucene index of 35M documents, max 1k document size (most much
>smaller) and 14 fields.   We combine descriptive text together into a
>"contents" field and search on that and have been very pleased with handling
>almost 100 queries/sec at about 8-12ms for the average search.
>
>Prior to that we had a common attribute for which about 50% of the docs had
>one value and the rest had the other value and the boolean query slowed
>response times very significantly.  We handled this by breaking up our
>indexes so that the index only contained one attribute or the other and
>eliminated the need for the boolean - this was a 7-8x improvement.
>
>Now we're back to wanting to add another attribute to the documents for
>which most of the docs will have one value and much fewer will have the
>other and although it sounds so simple - my limited testing with an 85/15
>ratio is showing another big hit on performance with the boolean.    A two
>term boolean search without the attribute is about 7-8ms, adding the
>attribute to the boolean search increases the elapsed time to 4x and 2x of
>original for the 85% and 15% frequencies respective.
>
>I had some hope that a QueryFilter would really help out but it turns out to
>be much much slower:  the 85% term ends up taking a whopping 336ms and the
>15% term ends up around 65ms which is 40x and 8x slower than the original
>8ms query speed without the additional attribute.
>
>I have to ask if there's not a better way to handle the addition of an
>common attribute with a few possible values across the index.  Any other
>recommended approaches?
>
>Thanks in advance,
>
>Vince
>
>  
>



		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org