You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alex Benjamen <ab...@friendster.com> on 2008/01/03 01:28:05 UTC
Performance stats for indeces with over 10MM documents
Hi,
I'm very interested in sharing performance stats with those who have indeces that
contain more than 10MM documents. It seems that the response times and QPS
drops drastically with the number of documents in the index. This overall makes
sense, but it would be good to know what kind of QPS others are getting in comparison
to your solr server.
Many thanks to anyone kind enough to share their stats. Here's what I'm getting on
mine (which I'm not very pleased with):
Hardware: AMD Dual-Core, 16Gb Ram, 2.2 Ghz
Index size on disk: 1.3Gb
Index in RAM: Y
Documents: 8MM
QPS: 7
Mean response time: 800ms
queryResultCache hit%: 20
Another example - this is really bad :(
Hardware: AMD Dual-Core, 16Gb Ram, 2.2 Ghz
Index size on disk: 3Gb
Index in RAM: Y
Documents: 21MM
QPS: 1.6
Mean response time: < 1300ms
queryResultCache hit%: 10
Anyone able to get better numbers that this on large indeces with over 10MM records?
(ofcourse if your cache hit ratio is much higher than mine that doesn't count)
Thanks for sharing...
-Alex
Re: Performance stats for indeces with over 10MM documents
Posted by John Stewart <ca...@gmail.com>.
Alex,
That's too slow. Can you provide more details about your schema, queries etc?
jds
On Jan 2, 2008 7:28 PM, Alex Benjamen <ab...@friendster.com> wrote:
> Hi,
>
> I'm very interested in sharing performance stats with those who have indeces that
> contain more than 10MM documents. It seems that the response times and QPS
> drops drastically with the number of documents in the index. This overall makes
> sense, but it would be good to know what kind of QPS others are getting in comparison
> to your solr server.
>
> Many thanks to anyone kind enough to share their stats. Here's what I'm getting on
> mine (which I'm not very pleased with):
>
> Hardware: AMD Dual-Core, 16Gb Ram, 2.2 Ghz
> Index size on disk: 1.3Gb
> Index in RAM: Y
> Documents: 8MM
> QPS: 7
> Mean response time: 800ms
> queryResultCache hit%: 20
>
>
> Another example - this is really bad :(
>
> Hardware: AMD Dual-Core, 16Gb Ram, 2.2 Ghz
> Index size on disk: 3Gb
> Index in RAM: Y
> Documents: 21MM
> QPS: 1.6
> Mean response time: < 1300ms
> queryResultCache hit%: 10
>
> Anyone able to get better numbers that this on large indeces with over 10MM records?
> (ofcourse if your cache hit ratio is much higher than mine that doesn't count)
>
> Thanks for sharing...
> -Alex
>
>
>
>
>
Re: Performance stats for indeces with over 10MM documents
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Are you faceting?
Please provide the exact lines logged in Solr's console for the
offending queries - that would show us _exactly_ what you're hitting
Solr with, not just the q parameter as you seem to have provided.
Erik
On Jan 2, 2008, at 8:47 PM, Alex Benjamen wrote:
> JDS:
>
>> That's too slow. Can you provide more details about your schema,
>> queries etc?
>
> Ofcourse - I'm using the standard config which comes with solr, and
> I've added
> the following fields :
>
> <field name="id" type="integer" indexed="true" stored="true"
> required="true" />
> <field name="status" type="text" indexed="true" stored="false"/>
> <field name="seeking" type="text" indexed="true" stored="false"/>
> <field name="gender" type="text" indexed="true" stored="false"/>
> <field name="orientation" type="text" indexed="true"
> stored="false"/>
> <field name="interests" type="text" indexed="true" stored="false"/>
> <field name="relationship" type="text" indexed="true"
> stored="false"/>
> <field name="dating" type="text" indexed="true" stored="false"/>
> <field name="friends" type="text" indexed="true" stored="false"/>
> <field name="activity" type="text" indexed="true" stored="false"/>
> <field name="age" type="sint" indexed="true" stored="true"/>
> <field name="country" type="text" indexed="true" stored="false"/>
> <field name="zipcode" type="text" indexed="true" stored="false"/>
> <field name="postalcode" type="text" indexed="true"
> stored="false"/>
> <field name="state" type="text" indexed="true" stored="false"/>
> <field name="province" type="text" indexed="true" stored="false"/>
> <field name="city" type="text" indexed="true" stored="false"/>
> <field name="fbooks" type="text" indexed="true" stored="false"/>
> <field name="fmovies" type="text" indexed="true" stored="false"/>
> <field name="fmusic" type="text" indexed="true" stored="false"/>
> <field name="ftv" type="text" indexed="true" stored="false"/>
> <field name="school" type="text" indexed="true" stored="false"/>
> <field name="hometown" type="text" indexed="true" stored="false"/>
> <field name="lastlogin_months" type="sint" indexed="true"
> stored="false"/>
> <field name="locationString" type="text" indexed="true"
> stored="false"/>
> <field name="affiliations" type="text" indexed="true"
> stored="false"/>
> <field name="companies" type="text" indexed="true" stored="false"/>
> <field name="photos" type="text" indexed="true" stored="false"/>
>
>
> A typical query looks something like this:
>
> gender:m AND status:(2 || 8 || 6 || 3) AND age:(26 || 27 || 28 ||
> 29) AND orientation:3
> gender:f AND age:(27 || 28 || 29 || 30 || 31 || 32 || 33 || 34 ||
> 35 || 36 ) AND orientation:2 AND photos:y
> gender:f AND (activity:y) AND age:(28 || 29 || 30 || 31 || 32 ||
> 33 ) AND orientation:2
>
> Thanks
> -Alex
>
>
>
>
>
Re: Performance stats for indeces with over 10MM documents
Posted by Mike Klaas <mi...@gmail.com>.
On 2-Jan-08, at 9:52 PM, Alex Benjamen wrote:
>
> Thanks for the input, it's really valueable. Several forum users
> have suggested using fq to separate
> the caching of filters, and I can immediately see how this would
> help. I'm changing the code right now
> and going to run some benchmarks, hopefully see a big gain just
> from that
Sure. Make sure you are using a realistic query distribution. If
you are always picking random unique values for everything, there
might be less of a gain. Also, even without profiling, it can be
quite valuable to track the time for each query and look at the
reverse sorted list: it tends to quickly identify troublesome
inputs. For instance, you might find that the slowest search
contains a 63-clause disjunction (age:22-85).
>
>> - use range queries when querying contiguous disjunctions (age:[28
>> TO 33] rather than what you have above).
> I actually started with the above, using int type field, and it
> somehow seemed slower than using explicit, but I will
> certainly try again.
>
>
>> - convert the expensive, heap-based age filter disjunction into a
>> bitset created directly from the term enum
> Can you pls. elaborate a little more? Are you advising to use
> fq=age:[28 TO 33], or should that simply be part
> of the regular query? Also, what is the best "type" to use when
> defining age? I'm currently using "text", should
> I use "int" instead... I didn't see any difference with using the
> type "int".
It doesn't matter if you're just searching like this. I was going to
warn you about padding issues, but since this is for a dating app it
is unlikely that you will have to worry about 1- or 3-digit ages.
> One of the issues is that the age ranges are not "pre-defined" -
> they can be any combination, 22-23, 22-85, 45-49, etc.
> I realize that pre-defining age ranges would drastically improve
> performance but then we're greatly reducing the value
> of this type of search
Yes, but you can compose age ranges to gain performance without
losing flexibility. Imagine you index a field "age_mod_five" where
age_mod_five:25 means the person is between the ages of 25-29
inclusive. Then you can transform a 63-clause disjunction into a 3-
clause disjunction and 11-clause range query:
fq=age:(22 OR 23 OR 85) OR age_mod_five:[25 TO 80]
By the way, the OR's are implied, so they are not necessary in the
above (nor in the other examples you posted).
Perhaps a better option is to do an "inclusion-exclusion" trick.
Take again the example of age:[22 TO 85]. It is really just the
range 20-89, excluding a few years. So convert it into:
fq=age_mod_five:[20 TO 85]
fq=-age:20
fq=-age:21
fq=-age:86
fq=-age:87
fq=-age:88
fq=-age:89
Hopefully the mod-5 filters will be reused enough to be performant,
and so to with the individual ages. '-' is the NOT operator, btw.
If you go far down this route, one more idea for you is to use open-
ended ranges. Imagine a field called "younger_than", in which you
index _every_ age less than the age of the person, down to some
minimum (like 18). You can then create any range given two
constraints. age:22-85 becomes:
fq=younger_than:86
fq=-younger_than:22
There would be one bitset per valid age, and soon every possible
range will be served from cache with one bitset operation. Your
index might be a tad large, depending on the median age of your
"documents". If your target tends toward the gerontological, it is
better to index "older_than" and invert the logic.
These gymnastics are somewhat silly, since (as others have mentioned)
full-text search logic is a poor tool for this job. Usually it
doesn't make much of a difference, but when you're scaling to a huge
level, you need to use the right tool for the job.
This doesn't necessarily mean that you should use Solr, though, just
that it would be best to solve this problem using a better data
structure. (In this case, the most memory-efficient and probably
fastest method is to implement a filter using a FieldCache that looks
up the age of each doc under consideration and does a range check.
Coaxing Solr into using it correctly might be a little tricky).
good luck,
-Mike
RE: Performance stats for indeces with over 10MM documents
Posted by Alex Benjamen <ab...@friendster.com>.
we currently use a relational system, and it doesn't perform. Also, even though
a lot of our queries are structured, we do combine them with text search, so
for instance, there could be an additional clause which is a free text search for
a favorite TV show
----------------------
I had exactly the same thought. That query is not an information
retrieval (text search) query. It is data retrieval and would
work great on a relational database.
wunder
Re: Performance stats for indeces with over 10MM documents
Posted by Walter Underwood <wu...@netflix.com>.
I had exactly the same thought. That query is not an information
retrieval (text search) query. It is data retrieval and would
work great on a relational database.
wunder
On 1/2/08 9:53 PM, "John Stewart" <ca...@gmail.com> wrote:
> Alex,
>
> Not to be a pain, but the response I had when looking at the query
> was, why not do this in a SQL database, which is designed precisely to
> process this sort of request at speed? I've noticed that people
> sometimes try to get Solr to act as a generalized information store --
> I'm not sure that's what you're doing, but be aware of this pitfall.
>
> jds
>
> On Jan 3, 2008 12:52 AM, Alex Benjamen <ab...@friendster.com> wrote:
>> Mike,
>>
>> Thanks for the input, it's really valueable. Several forum users have
>> suggested using fq to separate
>> the caching of filters, and I can immediately see how this would help. I'm
>> changing the code right now
>> and going to run some benchmarks, hopefully see a big gain just from that
>>
>>
>>> - use range queries when querying contiguous disjunctions (age:[28 TO 33]
>>> rather than what you have above).
>> I actually started with the above, using int type field, and it somehow
>> seemed slower than using explicit, but I will
>> certainly try again.
>>
>>
>>> - convert the expensive, heap-based age filter disjunction into a bitset
>>> created directly from the term enum
>> Can you pls. elaborate a little more? Are you advising to use fq=age:[28 TO
>> 33], or should that simply be part
>> of the regular query? Also, what is the best "type" to use when defining age?
>> I'm currently using "text", should
>> I use "int" instead... I didn't see any difference with using the type "int".
>>
>> One of the issues is that the age ranges are not "pre-defined" - they can be
>> any combination, 22-23, 22-85, 45-49, etc.
>> I realize that pre-defining age ranges would drastically improve performance
>> but then we're greatly reducing the value
>> of this type of search
>>
>> Thanks,
>> Alex
>>
Re: Performance stats for indeces with over 10MM documents
Posted by John Stewart <ca...@gmail.com>.
Alex,
Not to be a pain, but the response I had when looking at the query
was, why not do this in a SQL database, which is designed precisely to
process this sort of request at speed? I've noticed that people
sometimes try to get Solr to act as a generalized information store --
I'm not sure that's what you're doing, but be aware of this pitfall.
jds
On Jan 3, 2008 12:52 AM, Alex Benjamen <ab...@friendster.com> wrote:
> Mike,
>
> Thanks for the input, it's really valueable. Several forum users have suggested using fq to separate
> the caching of filters, and I can immediately see how this would help. I'm changing the code right now
> and going to run some benchmarks, hopefully see a big gain just from that
>
>
> > - use range queries when querying contiguous disjunctions (age:[28 TO 33] rather than what you have above).
> I actually started with the above, using int type field, and it somehow seemed slower than using explicit, but I will
> certainly try again.
>
>
> > - convert the expensive, heap-based age filter disjunction into a bitset created directly from the term enum
> Can you pls. elaborate a little more? Are you advising to use fq=age:[28 TO 33], or should that simply be part
> of the regular query? Also, what is the best "type" to use when defining age? I'm currently using "text", should
> I use "int" instead... I didn't see any difference with using the type "int".
>
> One of the issues is that the age ranges are not "pre-defined" - they can be any combination, 22-23, 22-85, 45-49, etc.
> I realize that pre-defining age ranges would drastically improve performance but then we're greatly reducing the value
> of this type of search
>
> Thanks,
> Alex
>
RE: Performance stats for indeces with over 10MM documents
Posted by Alex Benjamen <ab...@friendster.com>.
Mike,
Thanks for the input, it's really valueable. Several forum users have suggested using fq to separate
the caching of filters, and I can immediately see how this would help. I'm changing the code right now
and going to run some benchmarks, hopefully see a big gain just from that
> - use range queries when querying contiguous disjunctions (age:[28 TO 33] rather than what you have above).
I actually started with the above, using int type field, and it somehow seemed slower than using explicit, but I will
certainly try again.
> - convert the expensive, heap-based age filter disjunction into a bitset created directly from the term enum
Can you pls. elaborate a little more? Are you advising to use fq=age:[28 TO 33], or should that simply be part
of the regular query? Also, what is the best "type" to use when defining age? I'm currently using "text", should
I use "int" instead... I didn't see any difference with using the type "int".
One of the issues is that the age ranges are not "pre-defined" - they can be any combination, 22-23, 22-85, 45-49, etc.
I realize that pre-defining age ranges would drastically improve performance but then we're greatly reducing the value
of this type of search
Thanks,
Alex
Re: Performance stats for indeces with over 10MM documents
Posted by Mike Klaas <mi...@gmail.com>.
On 2-Jan-08, at 5:47 PM, Alex Benjamen wrote:
>
> gender:m AND status:(2 || 8 || 6 || 3) AND age:(26 || 27 || 28 ||
> 29) AND orientation:3
> gender:f AND age:(27 || 28 || 29 || 30 || 31 || 32 || 33 || 34 ||
> 35 || 36 ) AND orientation:2 AND photos:y
> gender:f AND (activity:y) AND age:(28 || 29 || 30 || 31 || 32 ||
> 33 ) AND orientation:2
I think it is the nature of your queries. Lucene/Solr is optimized
for full-text search, not rather complicated boolean constraint queries.
The first step:
- set q.alt = *:* as a default parameter
- convert all disjunctions to separate fq parameters
- use range queries when querying contiguous disjunctions (age:[28
TO 33] rather than what you have above).
This should:
- allow your filters to cache separately, improving reuse
- make queries that are combinations of previously-cached filters
be nothing but a few bitset intersections
- convert the expensive, heap-based age filter disjunction into a
bitset created directly from the term enum
There is a lot of lucenese in what I just wrote, but the gist is that
your queries should be substantially faster.
If that isn't sufficient, it is likely that a lot of performance
could be gained by creating a means of caching filter disjuncts which
would then be combined (essentially, each age would be a separate
bitset and the union would be taken at query time). These kinds of
things don't arise all that often in the Solr world, though, so there
isn't built-in capability for this. It wouldn't be atrocious to
implement, though.
-Mike
RE: Performance stats for indeces with over 10MM documents
Posted by Alex Benjamen <ab...@friendster.com>.
JDS:
> That's too slow. Can you provide more details about your schema, queries etc?
Ofcourse - I'm using the standard config which comes with solr, and I've added
the following fields :
<field name="id" type="integer" indexed="true" stored="true" required="true" />
<field name="status" type="text" indexed="true" stored="false"/>
<field name="seeking" type="text" indexed="true" stored="false"/>
<field name="gender" type="text" indexed="true" stored="false"/>
<field name="orientation" type="text" indexed="true" stored="false"/>
<field name="interests" type="text" indexed="true" stored="false"/>
<field name="relationship" type="text" indexed="true" stored="false"/>
<field name="dating" type="text" indexed="true" stored="false"/>
<field name="friends" type="text" indexed="true" stored="false"/>
<field name="activity" type="text" indexed="true" stored="false"/>
<field name="age" type="sint" indexed="true" stored="true"/>
<field name="country" type="text" indexed="true" stored="false"/>
<field name="zipcode" type="text" indexed="true" stored="false"/>
<field name="postalcode" type="text" indexed="true" stored="false"/>
<field name="state" type="text" indexed="true" stored="false"/>
<field name="province" type="text" indexed="true" stored="false"/>
<field name="city" type="text" indexed="true" stored="false"/>
<field name="fbooks" type="text" indexed="true" stored="false"/>
<field name="fmovies" type="text" indexed="true" stored="false"/>
<field name="fmusic" type="text" indexed="true" stored="false"/>
<field name="ftv" type="text" indexed="true" stored="false"/>
<field name="school" type="text" indexed="true" stored="false"/>
<field name="hometown" type="text" indexed="true" stored="false"/>
<field name="lastlogin_months" type="sint" indexed="true" stored="false"/>
<field name="locationString" type="text" indexed="true" stored="false"/>
<field name="affiliations" type="text" indexed="true" stored="false"/>
<field name="companies" type="text" indexed="true" stored="false"/>
<field name="photos" type="text" indexed="true" stored="false"/>
A typical query looks something like this:
gender:m AND status:(2 || 8 || 6 || 3) AND age:(26 || 27 || 28 || 29) AND orientation:3
gender:f AND age:(27 || 28 || 29 || 30 || 31 || 32 || 33 || 34 || 35 || 36 ) AND orientation:2 AND photos:y
gender:f AND (activity:y) AND age:(28 || 29 || 30 || 31 || 32 || 33 ) AND orientation:2
Thanks
-Alex