You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alex Benjamen <ab...@friendster.com> on 2008/01/03 01:28:05 UTC

Performance stats for indeces with over 10MM documents

Hi,
 
I'm very interested in sharing performance stats with those who have indeces that
contain more than 10MM documents. It seems that the response times and QPS
drops drastically with the number of documents in the index. This overall makes 
sense, but it would be good to know what kind of QPS others are getting in comparison
to your solr server.

Many thanks to anyone kind enough to share their stats. Here's what I'm getting on
mine (which I'm not very pleased with):
 
Hardware: AMD Dual-Core, 16Gb Ram, 2.2 Ghz
Index size on disk: 1.3Gb
Index in RAM: Y
Documents: 8MM
QPS: 7
Mean response time: 800ms
	queryResultCache hit%: 20	
 
 
Another example - this is really bad :(
 
Hardware: AMD Dual-Core, 16Gb Ram, 2.2 Ghz
Index size on disk: 3Gb
Index in RAM: Y
Documents: 21MM
QPS: 1.6
Mean response time: < 1300ms
queryResultCache hit%: 10
 
Anyone able to get better numbers that this on large indeces with over 10MM records?
(ofcourse if your cache hit ratio is much higher than mine that doesn't count)
 
Thanks for sharing...
-Alex

Re: Performance stats for indeces with over 10MM documents

Posted by John Stewart <ca...@gmail.com>.

Alex,

That's too slow.  Can you provide more details about your schema, queries etc?

jds

On Jan 2, 2008 7:28 PM, Alex Benjamen <ab...@friendster.com> wrote:
> Hi,
>
> I'm very interested in sharing performance stats with those who have indeces that
> contain more than 10MM documents. It seems that the response times and QPS
> drops drastically with the number of documents in the index. This overall makes
> sense, but it would be good to know what kind of QPS others are getting in comparison
> to your solr server.
>
> Many thanks to anyone kind enough to share their stats. Here's what I'm getting on
> mine (which I'm not very pleased with):
>
> Hardware: AMD Dual-Core, 16Gb Ram, 2.2 Ghz
> Index size on disk: 1.3Gb
> Index in RAM: Y
> Documents: 8MM
> QPS: 7
> Mean response time: 800ms
>         queryResultCache hit%: 20
>
>
> Another example - this is really bad :(
>
> Hardware: AMD Dual-Core, 16Gb Ram, 2.2 Ghz
> Index size on disk: 3Gb
> Index in RAM: Y
> Documents: 21MM
> QPS: 1.6
> Mean response time: < 1300ms
> queryResultCache hit%: 10
>
> Anyone able to get better numbers that this on large indeces with over 10MM records?
> (ofcourse if your cache hit ratio is much higher than mine that doesn't count)
>
> Thanks for sharing...
> -Alex
>
>
>
>
>

Re: Performance stats for indeces with over 10MM documents

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Are you faceting?

Please provide the exact lines logged in Solr's console for the  
offending queries - that would show us _exactly_ what you're hitting  
Solr with, not just the q parameter as you seem to have provided.

	Erik


On Jan 2, 2008, at 8:47 PM, Alex Benjamen wrote:

> JDS:
>
>> That's too slow.  Can you provide more details about your schema,  
>> queries etc?
>
> Ofcourse - I'm using the standard config which comes with solr, and  
> I've added
> the following fields :
>
>   <field name="id" type="integer" indexed="true" stored="true"  
> required="true" />
>    <field name="status" type="text" indexed="true" stored="false"/>
>    <field name="seeking" type="text" indexed="true" stored="false"/>
>    <field name="gender" type="text" indexed="true" stored="false"/>
>    <field name="orientation" type="text" indexed="true"  
> stored="false"/>
>    <field name="interests" type="text" indexed="true" stored="false"/>
>    <field name="relationship" type="text" indexed="true"  
> stored="false"/>
>    <field name="dating" type="text" indexed="true" stored="false"/>
>    <field name="friends" type="text" indexed="true" stored="false"/>
>    <field name="activity" type="text" indexed="true" stored="false"/>
>    <field name="age" type="sint" indexed="true" stored="true"/>
>    <field name="country" type="text" indexed="true" stored="false"/>
>    <field name="zipcode" type="text" indexed="true" stored="false"/>
>    <field name="postalcode" type="text" indexed="true"  
> stored="false"/>
>    <field name="state" type="text" indexed="true" stored="false"/>
>    <field name="province" type="text" indexed="true" stored="false"/>
>    <field name="city" type="text" indexed="true" stored="false"/>
>    <field name="fbooks" type="text" indexed="true" stored="false"/>
>    <field name="fmovies" type="text" indexed="true" stored="false"/>
>    <field name="fmusic" type="text" indexed="true" stored="false"/>
>    <field name="ftv" type="text" indexed="true" stored="false"/>
>    <field name="school" type="text" indexed="true" stored="false"/>
>    <field name="hometown" type="text" indexed="true" stored="false"/>
>    <field name="lastlogin_months" type="sint" indexed="true"  
> stored="false"/>
>    <field name="locationString" type="text" indexed="true"  
> stored="false"/>
>    <field name="affiliations" type="text" indexed="true"  
> stored="false"/>
>    <field name="companies" type="text" indexed="true" stored="false"/>
>    <field name="photos" type="text" indexed="true" stored="false"/>
>
>
> A typical query looks something like this:
>
> gender:m AND status:(2 || 8 || 6 || 3) AND age:(26 || 27 || 28 ||  
> 29) AND orientation:3
> gender:f AND age:(27 || 28 || 29 || 30 || 31 || 32 || 33 || 34 ||  
> 35 || 36 ) AND orientation:2 AND photos:y
> gender:f AND (activity:y) AND age:(28 || 29 || 30 || 31 || 32 ||  
> 33 ) AND orientation:2
>
> Thanks
> -Alex
>
>
>
>
>

Re: Performance stats for indeces with over 10MM documents

Posted by Mike Klaas <mi...@gmail.com>.

On 2-Jan-08, at 9:52 PM, Alex Benjamen wrote:

>
> Thanks for the input, it's really valueable. Several forum users  
> have suggested using fq to separate
> the caching of filters, and I can immediately see how this would  
> help. I'm changing the code right now
> and going to run some benchmarks, hopefully see a big gain just  
> from that

Sure.  Make sure you are using a realistic query distribution.   If  
you are always picking random unique values for everything, there  
might be less of a gain.  Also, even without profiling, it can be  
quite valuable to track the time for each query and look at the  
reverse sorted list: it tends to quickly identify troublesome  
inputs.  For instance, you might find that the slowest search  
contains a 63-clause disjunction (age:22-85).

>
>> - use range queries when querying contiguous disjunctions (age:[28  
>> TO 33] rather than what you have above).
> I actually started with the above, using int type field, and it  
> somehow seemed slower than using explicit, but I will
> certainly try again.
>
>
>>  - convert the expensive, heap-based age filter disjunction into a  
>> bitset created directly from the term enum
> Can you pls. elaborate a little more? Are you advising to use  
> fq=age:[28 TO 33], or should that simply be part
> of the regular query? Also, what is the best "type" to use when  
> defining age? I'm currently using "text", should
> I use "int" instead... I didn't see any difference with using the  
> type "int".

It doesn't matter if you're just searching like this.  I was going to  
warn you about padding issues, but since this is for a dating app it  
is unlikely that you will have to worry about 1- or 3-digit ages.

> One of the issues is that the age ranges are not "pre-defined" -  
> they can be any combination, 22-23, 22-85, 45-49, etc.
> I realize that pre-defining age ranges would drastically improve  
> performance but then we're greatly reducing the value
> of this type of search

Yes, but you can compose age ranges to gain performance without  
losing flexibility.  Imagine you index a field "age_mod_five" where  
age_mod_five:25 means the person is between the ages of 25-29  
inclusive.  Then you can transform a 63-clause disjunction into a 3- 
clause disjunction and 11-clause range query:

fq=age:(22 OR 23 OR 85) OR age_mod_five:[25 TO 80]

By the way, the OR's are implied, so they are not necessary in the  
above (nor in the other examples you posted).

Perhaps a better option is to do an "inclusion-exclusion" trick.   
Take again the example of age:[22 TO 85].  It is really just the  
range 20-89, excluding a few years.   So convert it into:

fq=age_mod_five:[20 TO 85]
fq=-age:20
fq=-age:21
fq=-age:86
fq=-age:87
fq=-age:88
fq=-age:89

Hopefully the mod-5 filters will be reused enough to be performant,  
and so to with the individual ages.  '-' is the NOT operator, btw.

If you go far down this route, one more idea for you is to use open- 
ended ranges.  Imagine a field called "younger_than", in which you  
index _every_ age less than the age of the person, down to some  
minimum (like 18).  You can then create any range given two  
constraints.  age:22-85 becomes:

fq=younger_than:86
fq=-younger_than:22

There would be one bitset per valid age, and soon every possible  
range will be served from cache with one bitset operation.  Your  
index might be a tad large, depending on the median age of your  
"documents".  If your target tends toward the gerontological, it is  
better to index "older_than" and invert the logic.

These gymnastics are somewhat silly, since (as others have mentioned)  
full-text search logic is a poor tool for this job.  Usually it  
doesn't make much of a difference, but when you're scaling to a huge  
level, you need to use the right tool for the job.

This doesn't necessarily mean that you should use Solr, though, just  
that it would be best to solve this problem using a better data  
structure.  (In this case, the most memory-efficient and probably  
fastest method is to implement a filter using a FieldCache that looks  
up the age of each doc under consideration and does a range check.   
Coaxing Solr into using it correctly might be a little tricky).

good luck,
-Mike

RE: Performance stats for indeces with over 10MM documents

Posted by Alex Benjamen <ab...@friendster.com>.

we currently use a relational system, and it doesn't perform. Also, even though
a lot of our queries are structured, we do combine them with text search, so 
for instance, there could be an additional clause which is a free text search for 
a favorite TV show

----------------------

I had exactly the same thought. That query is not an information
retrieval (text search) query. It is data retrieval and would
work great on a relational database.

wunder

Re: Performance stats for indeces with over 10MM documents

Posted by Walter Underwood <wu...@netflix.com>.

I had exactly the same thought. That query is not an information
retrieval (text search) query. It is data retrieval and would
work great on a relational database.

wunder

On 1/2/08 9:53 PM, "John Stewart" <ca...@gmail.com> wrote:

> Alex,
> 
> Not to be a pain, but the response I had when looking at the query
> was, why not do this in a SQL database, which is designed precisely to
> process this sort of request at speed?  I've noticed that people
> sometimes try to get Solr to act as a generalized information store --
> I'm not sure that's what you're doing, but be aware of this pitfall.
> 
> jds
> 
> On Jan 3, 2008 12:52 AM, Alex Benjamen <ab...@friendster.com> wrote:
>> Mike,
>> 
>> Thanks for the input, it's really valueable. Several forum users have
>> suggested using fq to separate
>> the caching of filters, and I can immediately see how this would help. I'm
>> changing the code right now
>> and going to run some benchmarks, hopefully see a big gain just from that
>> 
>> 
>>> - use range queries when querying contiguous disjunctions (age:[28 TO 33]
>>> rather than what you have above).
>> I actually started with the above, using int type field, and it somehow
>> seemed slower than using explicit, but I will
>> certainly try again.
>> 
>> 
>>>  - convert the expensive, heap-based age filter disjunction into a bitset
>>> created directly from the term enum
>> Can you pls. elaborate a little more? Are you advising to use fq=age:[28 TO
>> 33], or should that simply be part
>> of the regular query? Also, what is the best "type" to use when defining age?
>> I'm currently using "text", should
>> I use "int" instead... I didn't see any difference with using the type "int".
>> 
>> One of the issues is that the age ranges are not "pre-defined" - they can be
>> any combination, 22-23, 22-85, 45-49, etc.
>> I realize that pre-defining age ranges would drastically improve performance
>> but then we're greatly reducing the value
>> of this type of search
>> 
>> Thanks,
>> Alex
>>

Re: Performance stats for indeces with over 10MM documents

Posted by John Stewart <ca...@gmail.com>.

Alex,

Not to be a pain, but the response I had when looking at the query
was, why not do this in a SQL database, which is designed precisely to
process this sort of request at speed?  I've noticed that people
sometimes try to get Solr to act as a generalized information store --
I'm not sure that's what you're doing, but be aware of this pitfall.

jds

On Jan 3, 2008 12:52 AM, Alex Benjamen <ab...@friendster.com> wrote:
> Mike,
>
> Thanks for the input, it's really valueable. Several forum users have suggested using fq to separate
> the caching of filters, and I can immediately see how this would help. I'm changing the code right now
> and going to run some benchmarks, hopefully see a big gain just from that
>
>
> > - use range queries when querying contiguous disjunctions (age:[28 TO 33] rather than what you have above).
> I actually started with the above, using int type field, and it somehow seemed slower than using explicit, but I will
> certainly try again.
>
>
> >  - convert the expensive, heap-based age filter disjunction into a bitset created directly from the term enum
> Can you pls. elaborate a little more? Are you advising to use fq=age:[28 TO 33], or should that simply be part
> of the regular query? Also, what is the best "type" to use when defining age? I'm currently using "text", should
> I use "int" instead... I didn't see any difference with using the type "int".
>
> One of the issues is that the age ranges are not "pre-defined" - they can be any combination, 22-23, 22-85, 45-49, etc.
> I realize that pre-defining age ranges would drastically improve performance but then we're greatly reducing the value
> of this type of search
>
> Thanks,
> Alex
>

RE: Performance stats for indeces with over 10MM documents

Posted by Alex Benjamen <ab...@friendster.com>.

Mike,
 
Thanks for the input, it's really valueable. Several forum users have suggested using fq to separate 
the caching of filters, and I can immediately see how this would help. I'm changing the code right now
and going to run some benchmarks, hopefully see a big gain just from that


> - use range queries when querying contiguous disjunctions (age:[28 TO 33] rather than what you have above).
I actually started with the above, using int type field, and it somehow seemed slower than using explicit, but I will
certainly try again.


>  - convert the expensive, heap-based age filter disjunction into a bitset created directly from the term enum
Can you pls. elaborate a little more? Are you advising to use fq=age:[28 TO 33], or should that simply be part
of the regular query? Also, what is the best "type" to use when defining age? I'm currently using "text", should
I use "int" instead... I didn't see any difference with using the type "int". 
 
One of the issues is that the age ranges are not "pre-defined" - they can be any combination, 22-23, 22-85, 45-49, etc. 
I realize that pre-defining age ranges would drastically improve performance but then we're greatly reducing the value 
of this type of search
 
Thanks,
Alex

Re: Performance stats for indeces with over 10MM documents

Posted by Mike Klaas <mi...@gmail.com>.

On 2-Jan-08, at 5:47 PM, Alex Benjamen wrote:
>
> gender:m AND status:(2 || 8 || 6 || 3) AND age:(26 || 27 || 28 ||  
> 29) AND orientation:3
> gender:f AND age:(27 || 28 || 29 || 30 || 31 || 32 || 33 || 34 ||  
> 35 || 36 ) AND orientation:2 AND photos:y
> gender:f AND (activity:y) AND age:(28 || 29 || 30 || 31 || 32 ||  
> 33 ) AND orientation:2

I think it is the nature of your queries.  Lucene/Solr is optimized  
for full-text search, not rather complicated boolean constraint queries.

The first step:
  - set q.alt = *:* as a default parameter
  - convert all disjunctions to separate fq parameters
  - use range queries when querying contiguous disjunctions (age:[28  
TO 33] rather than what you have above).

This should:
  - allow your filters to cache separately, improving reuse
  - make queries that are combinations of previously-cached filters  
be nothing but a few bitset intersections
  - convert the expensive, heap-based age filter disjunction into a  
bitset created directly from the term enum

There is a lot of lucenese in what I just wrote, but the gist is that  
your queries should be substantially faster.

If that isn't sufficient, it is likely that a lot of performance  
could be gained by creating a means of caching filter disjuncts which  
would then be combined (essentially, each age would be a separate  
bitset and the union would be taken at query time).  These kinds of  
things don't arise all that often in the Solr world, though, so there  
isn't built-in capability for this.  It wouldn't be atrocious to  
implement, though.

-Mike

RE: Performance stats for indeces with over 10MM documents

Posted by Alex Benjamen <ab...@friendster.com>.

JDS:

> That's too slow.  Can you provide more details about your schema, queries etc?

Ofcourse - I'm using the standard config which comes with solr, and I've added
the following fields :

  <field name="id" type="integer" indexed="true" stored="true" required="true" />
   <field name="status" type="text" indexed="true" stored="false"/>
   <field name="seeking" type="text" indexed="true" stored="false"/>
   <field name="gender" type="text" indexed="true" stored="false"/>
   <field name="orientation" type="text" indexed="true" stored="false"/>
   <field name="interests" type="text" indexed="true" stored="false"/>
   <field name="relationship" type="text" indexed="true" stored="false"/>
   <field name="dating" type="text" indexed="true" stored="false"/>
   <field name="friends" type="text" indexed="true" stored="false"/>
   <field name="activity" type="text" indexed="true" stored="false"/>
   <field name="age" type="sint" indexed="true" stored="true"/>
   <field name="country" type="text" indexed="true" stored="false"/>
   <field name="zipcode" type="text" indexed="true" stored="false"/>
   <field name="postalcode" type="text" indexed="true" stored="false"/>
   <field name="state" type="text" indexed="true" stored="false"/>
   <field name="province" type="text" indexed="true" stored="false"/>
   <field name="city" type="text" indexed="true" stored="false"/>
   <field name="fbooks" type="text" indexed="true" stored="false"/>
   <field name="fmovies" type="text" indexed="true" stored="false"/>
   <field name="fmusic" type="text" indexed="true" stored="false"/>
   <field name="ftv" type="text" indexed="true" stored="false"/>
   <field name="school" type="text" indexed="true" stored="false"/>
   <field name="hometown" type="text" indexed="true" stored="false"/>
   <field name="lastlogin_months" type="sint" indexed="true" stored="false"/>
   <field name="locationString" type="text" indexed="true" stored="false"/>
   <field name="affiliations" type="text" indexed="true" stored="false"/>
   <field name="companies" type="text" indexed="true" stored="false"/>
   <field name="photos" type="text" indexed="true" stored="false"/>


A typical query looks something like this:

gender:m AND status:(2 || 8 || 6 || 3) AND age:(26 || 27 || 28 || 29) AND orientation:3
gender:f AND age:(27 || 28 || 29 || 30 || 31 || 32 || 33 || 34 || 35 || 36 ) AND orientation:2 AND photos:y
gender:f AND (activity:y) AND age:(28 || 29 || 30 || 31 || 32 || 33 ) AND orientation:2

Thanks
-Alex