You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Steve Conover <sc...@gmail.com> on 2009/03/27 06:50:48 UTC

optimization advice?

Hi,

I've looked over the public Solr perf docs and done some searching on
this mailing list.  Still, I'd like to seek some advice based on my
specific situation:

- 2-3 million documents / 5GB index
- each document has 40+ indexed fields, and many multivalue fields
- only primary keys are "stored"
- very low write frequency
- queries can be sorted by any combination of fields, and are always
sorted by at least one field
- query criteria vary from very simple to very complex
  (the point about queries being that they're not very amenable to being cached)

So far I've set my mergefactor very low.    I haven't paid much
attention to caching except for basic query result caching - I don't
think many of the cache features really apply well to my problem.
Increasing the amount of ram available to java (by 1GB) has no effect
I can detect.

Ideally I'd like to get response times down to near-instantaneous / <
50ms (which is where they were when the index was ~ 1 millions
documents).  I'd love to hear suggestions - in particular are there
obvious optimization options I've missed?

Regards,
Steve

Re: optimization advice?

Posted by Steve Conover <sc...@gmail.com>.
Otis,

That's an interesting suggestion.  I'm curious about the thought
process behind it though - we currently don't have memory problems,
and in fact our max memory setting is below where it could be.

Does your suggestion imply that something could be gained by throwing
more memory at the problem?  If so, could you explain a little bit
about why?

Regards,
Steve

On Sat, Mar 28, 2009 at 6:31 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
>
> OK, how about this trick then.  Do you really need the full string for sorting?  Could you get by (cheat) sorting only on the first N characters?  If so, you could create a separate field for that (copyField will come handy) and that should consume a little less memory.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Steve Conover <sc...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Saturday, March 28, 2009 1:13:04 AM
>> Subject: Re: optimization advice?
>>
>> String ;-) - we only allow sorting on string fields.
>>
>> On Fri, Mar 27, 2009 at 9:21 PM, Otis Gospodnetic
>> wrote:
>> >
>> > Steve,
>> >
>> > A field named "name" sounds like a free text field.  What is its type, string
>> or text?  Fields you sort by should not be tokenized and should be indexed.  I
>> have a hunch your name field is tokenized.
>> >
>> >
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Steve Conover
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Friday, March 27, 2009 11:59:52 PM
>> >> Subject: Re: optimization advice?
>> >>
>> >> We sort by default on "name", which varies quite a bit (we're never
>> >> going to make sorting by field go away).
>> >>
>> >> The thing is solr has been pretty amazing across 1 million records.
>> >> Now that we've doubled the size of the dataset things are definitely
>> >> slower in a nonlinear way...I'm wondering what factors are involved
>> >> here.
>> >>
>> >> -Steve
>> >>
>> >> On Fri, Mar 27, 2009 at 6:58 PM, Otis Gospodnetic
>> >> wrote:
>> >> >
>> >> > OK, we are a step closer.  Sorting makes things slower.  What field(s) do
>> you
>> >> sort on, what are their types, and if there is a date in there, are the dates
>> >> very granular, and if they are, do you really need them to be that precise?
>> >> >
>> >> >
>> >> > Otis
>> >> > --
>> >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >> >
>> >> >
>> >> >
>> >> > ----- Original Message ----
>> >> >> From: Steve Conover
>> >> >> To: solr-user@lucene.apache.org
>> >> >> Sent: Friday, March 27, 2009 1:51:14 PM
>> >> >> Subject: Re: optimization advice?
>> >> >>
>> >> >> > Steve,
>> >> >> >
>> >> >> > Maybe you can tell us about:
>> >> >>
>> >> >> sure
>> >> >>
>> >> >> > - your hardware
>> >> >>
>> >> >> 2.5GB RAM, pretty modern virtual servers
>> >> >>
>> >> >> > - query rate
>> >> >>
>> >> >> Let's say a few queries per second max... < 4
>> >> >>
>> >> >> And in general the challenge is to get latency on any given query down
>> >> >> to something very low - we don't have to worry about a huge amount of
>> >> >> load at the moment.
>> >> >>
>> >> >> > - document cache and query cache settings
>> >> >>
>> >> >>
>> >> >>         class="solr.LRUCache"
>> >> >>         size="512"
>> >> >>         initialSize="512"
>> >> >>         autowarmCount="256"/>
>> >> >>
>> >> >>
>> >> >>         class="solr.LRUCache"
>> >> >>         size="512"
>> >> >>         initialSize="512"
>> >> >>         autowarmCount="0"/>
>> >> >>
>> >> >> > - your current response times
>> >> >>
>> >> >> This depends on the query.  For queries that involve a total record
>> >> >> count of < 1 million, we often see < 10ms response times, up to
>> >> >> 4-500ms in the worst case.  When we do a page one, sorted query on our
>> >> >> full record set of 2 million+ records, response times can get up into
>> >> >> 2+ seconds.
>> >> >>
>> >> >> > - any pain points, any slow query patterns
>> >> >>
>> >> >> Something that can't be emphasized enough is that we can't predict
>> >> >> what records people will want.  Almost every query is aimed at a
>> >> >> different set of records.
>> >> >>
>> >> >> -Steve
>> >> >
>> >> >
>> >
>> >
>
>

Re: optimization advice?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
OK, how about this trick then.  Do you really need the full string for sorting?  Could you get by (cheat) sorting only on the first N characters?  If so, you could create a separate field for that (copyField will come handy) and that should consume a little less memory.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Steve Conover <sc...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Saturday, March 28, 2009 1:13:04 AM
> Subject: Re: optimization advice?
> 
> String ;-) - we only allow sorting on string fields.
> 
> On Fri, Mar 27, 2009 at 9:21 PM, Otis Gospodnetic
> wrote:
> >
> > Steve,
> >
> > A field named "name" sounds like a free text field.  What is its type, string 
> or text?  Fields you sort by should not be tokenized and should be indexed.  I 
> have a hunch your name field is tokenized.
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: Steve Conover 
> >> To: solr-user@lucene.apache.org
> >> Sent: Friday, March 27, 2009 11:59:52 PM
> >> Subject: Re: optimization advice?
> >>
> >> We sort by default on "name", which varies quite a bit (we're never
> >> going to make sorting by field go away).
> >>
> >> The thing is solr has been pretty amazing across 1 million records.
> >> Now that we've doubled the size of the dataset things are definitely
> >> slower in a nonlinear way...I'm wondering what factors are involved
> >> here.
> >>
> >> -Steve
> >>
> >> On Fri, Mar 27, 2009 at 6:58 PM, Otis Gospodnetic
> >> wrote:
> >> >
> >> > OK, we are a step closer.  Sorting makes things slower.  What field(s) do 
> you
> >> sort on, what are their types, and if there is a date in there, are the dates
> >> very granular, and if they are, do you really need them to be that precise?
> >> >
> >> >
> >> > Otis
> >> > --
> >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >> >
> >> >
> >> >
> >> > ----- Original Message ----
> >> >> From: Steve Conover
> >> >> To: solr-user@lucene.apache.org
> >> >> Sent: Friday, March 27, 2009 1:51:14 PM
> >> >> Subject: Re: optimization advice?
> >> >>
> >> >> > Steve,
> >> >> >
> >> >> > Maybe you can tell us about:
> >> >>
> >> >> sure
> >> >>
> >> >> > - your hardware
> >> >>
> >> >> 2.5GB RAM, pretty modern virtual servers
> >> >>
> >> >> > - query rate
> >> >>
> >> >> Let's say a few queries per second max... < 4
> >> >>
> >> >> And in general the challenge is to get latency on any given query down
> >> >> to something very low - we don't have to worry about a huge amount of
> >> >> load at the moment.
> >> >>
> >> >> > - document cache and query cache settings
> >> >>
> >> >>
> >> >>         class="solr.LRUCache"
> >> >>         size="512"
> >> >>         initialSize="512"
> >> >>         autowarmCount="256"/>
> >> >>
> >> >>
> >> >>         class="solr.LRUCache"
> >> >>         size="512"
> >> >>         initialSize="512"
> >> >>         autowarmCount="0"/>
> >> >>
> >> >> > - your current response times
> >> >>
> >> >> This depends on the query.  For queries that involve a total record
> >> >> count of < 1 million, we often see < 10ms response times, up to
> >> >> 4-500ms in the worst case.  When we do a page one, sorted query on our
> >> >> full record set of 2 million+ records, response times can get up into
> >> >> 2+ seconds.
> >> >>
> >> >> > - any pain points, any slow query patterns
> >> >>
> >> >> Something that can't be emphasized enough is that we can't predict
> >> >> what records people will want.  Almost every query is aimed at a
> >> >> different set of records.
> >> >>
> >> >> -Steve
> >> >
> >> >
> >
> >


Re: optimization advice?

Posted by Steve Conover <sc...@gmail.com>.
String ;-) - we only allow sorting on string fields.

On Fri, Mar 27, 2009 at 9:21 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
>
> Steve,
>
> A field named "name" sounds like a free text field.  What is its type, string or text?  Fields you sort by should not be tokenized and should be indexed.  I have a hunch your name field is tokenized.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Steve Conover <sc...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Friday, March 27, 2009 11:59:52 PM
>> Subject: Re: optimization advice?
>>
>> We sort by default on "name", which varies quite a bit (we're never
>> going to make sorting by field go away).
>>
>> The thing is solr has been pretty amazing across 1 million records.
>> Now that we've doubled the size of the dataset things are definitely
>> slower in a nonlinear way...I'm wondering what factors are involved
>> here.
>>
>> -Steve
>>
>> On Fri, Mar 27, 2009 at 6:58 PM, Otis Gospodnetic
>> wrote:
>> >
>> > OK, we are a step closer.  Sorting makes things slower.  What field(s) do you
>> sort on, what are their types, and if there is a date in there, are the dates
>> very granular, and if they are, do you really need them to be that precise?
>> >
>> >
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Steve Conover
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Friday, March 27, 2009 1:51:14 PM
>> >> Subject: Re: optimization advice?
>> >>
>> >> > Steve,
>> >> >
>> >> > Maybe you can tell us about:
>> >>
>> >> sure
>> >>
>> >> > - your hardware
>> >>
>> >> 2.5GB RAM, pretty modern virtual servers
>> >>
>> >> > - query rate
>> >>
>> >> Let's say a few queries per second max... < 4
>> >>
>> >> And in general the challenge is to get latency on any given query down
>> >> to something very low - we don't have to worry about a huge amount of
>> >> load at the moment.
>> >>
>> >> > - document cache and query cache settings
>> >>
>> >>
>> >>         class="solr.LRUCache"
>> >>         size="512"
>> >>         initialSize="512"
>> >>         autowarmCount="256"/>
>> >>
>> >>
>> >>         class="solr.LRUCache"
>> >>         size="512"
>> >>         initialSize="512"
>> >>         autowarmCount="0"/>
>> >>
>> >> > - your current response times
>> >>
>> >> This depends on the query.  For queries that involve a total record
>> >> count of < 1 million, we often see < 10ms response times, up to
>> >> 4-500ms in the worst case.  When we do a page one, sorted query on our
>> >> full record set of 2 million+ records, response times can get up into
>> >> 2+ seconds.
>> >>
>> >> > - any pain points, any slow query patterns
>> >>
>> >> Something that can't be emphasized enough is that we can't predict
>> >> what records people will want.  Almost every query is aimed at a
>> >> different set of records.
>> >>
>> >> -Steve
>> >
>> >
>
>

Re: optimization advice?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Steve,

A field named "name" sounds like a free text field.  What is its type, string or text?  Fields you sort by should not be tokenized and should be indexed.  I have a hunch your name field is tokenized.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Steve Conover <sc...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, March 27, 2009 11:59:52 PM
> Subject: Re: optimization advice?
> 
> We sort by default on "name", which varies quite a bit (we're never
> going to make sorting by field go away).
> 
> The thing is solr has been pretty amazing across 1 million records.
> Now that we've doubled the size of the dataset things are definitely
> slower in a nonlinear way...I'm wondering what factors are involved
> here.
> 
> -Steve
> 
> On Fri, Mar 27, 2009 at 6:58 PM, Otis Gospodnetic
> wrote:
> >
> > OK, we are a step closer.  Sorting makes things slower.  What field(s) do you 
> sort on, what are their types, and if there is a date in there, are the dates 
> very granular, and if they are, do you really need them to be that precise?
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: Steve Conover 
> >> To: solr-user@lucene.apache.org
> >> Sent: Friday, March 27, 2009 1:51:14 PM
> >> Subject: Re: optimization advice?
> >>
> >> > Steve,
> >> >
> >> > Maybe you can tell us about:
> >>
> >> sure
> >>
> >> > - your hardware
> >>
> >> 2.5GB RAM, pretty modern virtual servers
> >>
> >> > - query rate
> >>
> >> Let's say a few queries per second max... < 4
> >>
> >> And in general the challenge is to get latency on any given query down
> >> to something very low - we don't have to worry about a huge amount of
> >> load at the moment.
> >>
> >> > - document cache and query cache settings
> >>
> >>
> >>         class="solr.LRUCache"
> >>         size="512"
> >>         initialSize="512"
> >>         autowarmCount="256"/>
> >>
> >>
> >>         class="solr.LRUCache"
> >>         size="512"
> >>         initialSize="512"
> >>         autowarmCount="0"/>
> >>
> >> > - your current response times
> >>
> >> This depends on the query.  For queries that involve a total record
> >> count of < 1 million, we often see < 10ms response times, up to
> >> 4-500ms in the worst case.  When we do a page one, sorted query on our
> >> full record set of 2 million+ records, response times can get up into
> >> 2+ seconds.
> >>
> >> > - any pain points, any slow query patterns
> >>
> >> Something that can't be emphasized enough is that we can't predict
> >> what records people will want.  Almost every query is aimed at a
> >> different set of records.
> >>
> >> -Steve
> >
> >


Re: optimization advice?

Posted by Steve Conover <sc...@gmail.com>.
We sort by default on "name", which varies quite a bit (we're never
going to make sorting by field go away).

The thing is solr has been pretty amazing across 1 million records.
Now that we've doubled the size of the dataset things are definitely
slower in a nonlinear way...I'm wondering what factors are involved
here.

-Steve

On Fri, Mar 27, 2009 at 6:58 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
>
> OK, we are a step closer.  Sorting makes things slower.  What field(s) do you sort on, what are their types, and if there is a date in there, are the dates very granular, and if they are, do you really need them to be that precise?
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Steve Conover <sc...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Friday, March 27, 2009 1:51:14 PM
>> Subject: Re: optimization advice?
>>
>> > Steve,
>> >
>> > Maybe you can tell us about:
>>
>> sure
>>
>> > - your hardware
>>
>> 2.5GB RAM, pretty modern virtual servers
>>
>> > - query rate
>>
>> Let's say a few queries per second max... < 4
>>
>> And in general the challenge is to get latency on any given query down
>> to something very low - we don't have to worry about a huge amount of
>> load at the moment.
>>
>> > - document cache and query cache settings
>>
>>
>>         class="solr.LRUCache"
>>         size="512"
>>         initialSize="512"
>>         autowarmCount="256"/>
>>
>>
>>         class="solr.LRUCache"
>>         size="512"
>>         initialSize="512"
>>         autowarmCount="0"/>
>>
>> > - your current response times
>>
>> This depends on the query.  For queries that involve a total record
>> count of < 1 million, we often see < 10ms response times, up to
>> 4-500ms in the worst case.  When we do a page one, sorted query on our
>> full record set of 2 million+ records, response times can get up into
>> 2+ seconds.
>>
>> > - any pain points, any slow query patterns
>>
>> Something that can't be emphasized enough is that we can't predict
>> what records people will want.  Almost every query is aimed at a
>> different set of records.
>>
>> -Steve
>
>

Re: optimization advice?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
OK, we are a step closer.  Sorting makes things slower.  What field(s) do you sort on, what are their types, and if there is a date in there, are the dates very granular, and if they are, do you really need them to be that precise?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Steve Conover <sc...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, March 27, 2009 1:51:14 PM
> Subject: Re: optimization advice?
> 
> > Steve,
> >
> > Maybe you can tell us about:
> 
> sure
> 
> > - your hardware
> 
> 2.5GB RAM, pretty modern virtual servers
> 
> > - query rate
> 
> Let's say a few queries per second max... < 4
> 
> And in general the challenge is to get latency on any given query down
> to something very low - we don't have to worry about a huge amount of
> load at the moment.
> 
> > - document cache and query cache settings
> 
> 
>         class="solr.LRUCache"
>         size="512"
>         initialSize="512"
>         autowarmCount="256"/>
> 
> 
>         class="solr.LRUCache"
>         size="512"
>         initialSize="512"
>         autowarmCount="0"/>
> 
> > - your current response times
> 
> This depends on the query.  For queries that involve a total record
> count of < 1 million, we often see < 10ms response times, up to
> 4-500ms in the worst case.  When we do a page one, sorted query on our
> full record set of 2 million+ records, response times can get up into
> 2+ seconds.
> 
> > - any pain points, any slow query patterns
> 
> Something that can't be emphasized enough is that we can't predict
> what records people will want.  Almost every query is aimed at a
> different set of records.
> 
> -Steve


Re: optimization advice?

Posted by Steve Conover <sc...@gmail.com>.
> Steve,
>
> Maybe you can tell us about:

sure

> - your hardware

2.5GB RAM, pretty modern virtual servers

> - query rate

Let's say a few queries per second max... < 4

And in general the challenge is to get latency on any given query down
to something very low - we don't have to worry about a huge amount of
load at the moment.

> - document cache and query cache settings

<queryResultCache
        class="solr.LRUCache"
        size="512"
        initialSize="512"
        autowarmCount="256"/>

<documentCache
        class="solr.LRUCache"
        size="512"
        initialSize="512"
        autowarmCount="0"/>

> - your current response times

This depends on the query.  For queries that involve a total record
count of < 1 million, we often see < 10ms response times, up to
4-500ms in the worst case.  When we do a page one, sorted query on our
full record set of 2 million+ records, response times can get up into
2+ seconds.

> - any pain points, any slow query patterns

Something that can't be emphasized enough is that we can't predict
what records people will want.  Almost every query is aimed at a
different set of records.

-Steve

Re: optimization advice?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Steve,

Maybe you can tell us about:
- your hardware
- query rate
- document cache and query cache settings
- your current response times
- any pain points, any slow query patterns
- etc.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Steve Conover <sc...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, March 27, 2009 1:50:48 AM
> Subject: optimization advice?
> 
> Hi,
> 
> I've looked over the public Solr perf docs and done some searching on
> this mailing list.  Still, I'd like to seek some advice based on my
> specific situation:
> 
> - 2-3 million documents / 5GB index
> - each document has 40+ indexed fields, and many multivalue fields
> - only primary keys are "stored"
> - very low write frequency
> - queries can be sorted by any combination of fields, and are always
> sorted by at least one field
> - query criteria vary from very simple to very complex
>   (the point about queries being that they're not very amenable to being cached)
> 
> So far I've set my mergefactor very low.    I haven't paid much
> attention to caching except for basic query result caching - I don't
> think many of the cache features really apply well to my problem.
> Increasing the amount of ram available to java (by 1GB) has no effect
> I can detect.
> 
> Ideally I'd like to get response times down to near-instantaneous / <
> 50ms (which is where they were when the index was ~ 1 millions
> documents).  I'd love to hear suggestions - in particular are there
> obvious optimization options I've missed?
> 
> Regards,
> Steve