You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by James Brady <ja...@gmail.com> on 2008/02/28 04:08:02 UTC

Strategy for handling large (and growing) index: horizontal partitioning?

Hi all,
Our current setup is a master and slave pair on a single machine,  
with an index size of ~50GB.

Query and update times are still respectable, but commits are taking  
~20% of time on the master, while our daily index optimise can up to  
4 hours...
Here's the most relevant part of solrconfig.xml:
     <useCompoundFile>true</useCompoundFile>
     <mergeFactor>10</mergeFactor>
     <maxBufferedDocs>1000</maxBufferedDocs>
     <maxMergeDocs>10000</maxMergeDocs>
     <maxFieldLength>10000</maxFieldLength>

I've given both master and slave 2.5GB of RAM.

Does an index optimise read and re-write the whole thing? If so,  
taking about 4 hours is pretty good! However, the documentation here:
http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten 
+minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
states "Optimizations can take nearly ten minutes to run..." which  
leads me to think that we've grossly misconfigured something...

Firstly, we would obviously love any way to reduce this optimise time  
- I have yet to experiment extensively with the settings above, and  
optimise frequency, but some general guidance would be great.

Secondly, this index size is increasing monotonously over time and as  
we acquire new users. We need to take action to ensure we can scale  
in the future. The approach we're favouring at the moment is  
horizontal partitioning of indices by user id as our data suits this  
scheme well. A given index would hold the indexed data for n users,  
where n would probably be between 1 and 100 users, and we will have  
multiple indices per search server.

Running server per index is impractical, especially for a small n, so  
is a sinlge Solr instance capable of managing multiple searchers and  
writers in this way? Following on from that, does anyone know of  
limiting factors in Solr or Lucene that would influence our decision  
on the value of n - the number of users per index?

Thanks!
James




Re: Strategy for handling large (and growing) index: horizontal partitioning?

Posted by Walter Underwood <wu...@netflix.com>.
We should probably work out a rule of thumb, like "10-20 minutes per
gigabyte". I'll send a separate message to collect that info.

wunder

On 2/28/08 9:59 AM, "James Brady" <ja...@gmail.com> wrote:

> Hi, yes a post-optimise copy takes 45 minutes at present. Disk IO is
> definitely the bottleneck, you're right -- iostat was showing 100%
> utilisation for the 5 hours it took to optimise yesterday...
> 
> The master and slave are on the same disk, and it's definitely on my
> list to fix that, but the searcher is so lightly loaded compared to
> the indexer that I don't think it will win us too much.
> 
> As there has been another optimise time question on the list today
> could I request that the "10 minute" claim is taken of the
> CollectionDistribution wiki page? It's extremely misleading for
> newcomers who don't necessarily realise an optimise entails reading
> and writing the whole index, and that optimise time is going to be at
> least O(n)
> 
> James
> 
> 
> On 28 Feb 2008, at 09:07, Walter Underwood wrote:
> 
>> Have you timed how long it takes to copy the index files? Optimizing
>> can never be faster than that, since it must read every byte and write
>> a whole new set. Disc speed may be your bottleneck.
>> 
>> You could also look at disc access rates in a monitoring tool.
>> 
>> Is there read contention between the master and slave for the same
>> disc?
>> 
>> wunder
>> 
>> On 2/27/08 7:08 PM, "James Brady" <ja...@gmail.com> wrote:
>> 
>>> Hi all,
>>> Our current setup is a master and slave pair on a single machine,
>>> with an index size of ~50GB.
>>> 
>>> Query and update times are still respectable, but commits are taking
>>> ~20% of time on the master, while our daily index optimise can up to
>>> 4 hours...
>>> Here's the most relevant part of solrconfig.xml:
>>>      <useCompoundFile>true</useCompoundFile>
>>>      <mergeFactor>10</mergeFactor>
>>>      <maxBufferedDocs>1000</maxBufferedDocs>
>>>      <maxMergeDocs>10000</maxMergeDocs>
>>>      <maxFieldLength>10000</maxFieldLength>
>>> 
>>> I've given both master and slave 2.5GB of RAM.
>>> 
>>> Does an index optimise read and re-write the whole thing? If so,
>>> taking about 4 hours is pretty good! However, the documentation here:
>>> http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten
>>> +minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
>>> states "Optimizations can take nearly ten minutes to run..." which
>>> leads me to think that we've grossly misconfigured something...
>>> 
>>> Firstly, we would obviously love any way to reduce this optimise time
>>> - I have yet to experiment extensively with the settings above, and
>>> optimise frequency, but some general guidance would be great.
>>> 
>>> Secondly, this index size is increasing monotonously over time and as
>>> we acquire new users. We need to take action to ensure we can scale
>>> in the future. The approach we're favouring at the moment is
>>> horizontal partitioning of indices by user id as our data suits this
>>> scheme well. A given index would hold the indexed data for n users,
>>> where n would probably be between 1 and 100 users, and we will have
>>> multiple indices per search server.
>>> 
>>> Running server per index is impractical, especially for a small n, so
>>> is a sinlge Solr instance capable of managing multiple searchers and
>>> writers in this way? Following on from that, does anyone know of
>>> limiting factors in Solr or Lucene that would influence our decision
>>> on the value of n - the number of users per index?
>>> 
>>> Thanks!
>>> James
>>> 
>>> 
>>> 
>> 
> 


Re: Strategy for handling large (and growing) index: horizontal partitioning?

Posted by James Brady <ja...@gmail.com>.
Hi, yes a post-optimise copy takes 45 minutes at present. Disk IO is  
definitely the bottleneck, you're right -- iostat was showing 100%  
utilisation for the 5 hours it took to optimise yesterday...

The master and slave are on the same disk, and it's definitely on my  
list to fix that, but the searcher is so lightly loaded compared to  
the indexer that I don't think it will win us too much.

As there has been another optimise time question on the list today  
could I request that the "10 minute" claim is taken of the  
CollectionDistribution wiki page? It's extremely misleading for  
newcomers who don't necessarily realise an optimise entails reading  
and writing the whole index, and that optimise time is going to be at  
least O(n)

James


On 28 Feb 2008, at 09:07, Walter Underwood wrote:

> Have you timed how long it takes to copy the index files? Optimizing
> can never be faster than that, since it must read every byte and write
> a whole new set. Disc speed may be your bottleneck.
>
> You could also look at disc access rates in a monitoring tool.
>
> Is there read contention between the master and slave for the same  
> disc?
>
> wunder
>
> On 2/27/08 7:08 PM, "James Brady" <ja...@gmail.com> wrote:
>
>> Hi all,
>> Our current setup is a master and slave pair on a single machine,
>> with an index size of ~50GB.
>>
>> Query and update times are still respectable, but commits are taking
>> ~20% of time on the master, while our daily index optimise can up to
>> 4 hours...
>> Here's the most relevant part of solrconfig.xml:
>>      <useCompoundFile>true</useCompoundFile>
>>      <mergeFactor>10</mergeFactor>
>>      <maxBufferedDocs>1000</maxBufferedDocs>
>>      <maxMergeDocs>10000</maxMergeDocs>
>>      <maxFieldLength>10000</maxFieldLength>
>>
>> I've given both master and slave 2.5GB of RAM.
>>
>> Does an index optimise read and re-write the whole thing? If so,
>> taking about 4 hours is pretty good! However, the documentation here:
>> http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten
>> +minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
>> states "Optimizations can take nearly ten minutes to run..." which
>> leads me to think that we've grossly misconfigured something...
>>
>> Firstly, we would obviously love any way to reduce this optimise time
>> - I have yet to experiment extensively with the settings above, and
>> optimise frequency, but some general guidance would be great.
>>
>> Secondly, this index size is increasing monotonously over time and as
>> we acquire new users. We need to take action to ensure we can scale
>> in the future. The approach we're favouring at the moment is
>> horizontal partitioning of indices by user id as our data suits this
>> scheme well. A given index would hold the indexed data for n users,
>> where n would probably be between 1 and 100 users, and we will have
>> multiple indices per search server.
>>
>> Running server per index is impractical, especially for a small n, so
>> is a sinlge Solr instance capable of managing multiple searchers and
>> writers in this way? Following on from that, does anyone know of
>> limiting factors in Solr or Lucene that would influence our decision
>> on the value of n - the number of users per index?
>>
>> Thanks!
>> James
>>
>>
>>
>


Re: Strategy for handling large (and growing) index: horizontal partitioning?

Posted by Walter Underwood <wu...@netflix.com>.
Have you timed how long it takes to copy the index files? Optimizing
can never be faster than that, since it must read every byte and write
a whole new set. Disc speed may be your bottleneck.

You could also look at disc access rates in a monitoring tool.

Is there read contention between the master and slave for the same disc?

wunder

On 2/27/08 7:08 PM, "James Brady" <ja...@gmail.com> wrote:

> Hi all,
> Our current setup is a master and slave pair on a single machine,
> with an index size of ~50GB.
> 
> Query and update times are still respectable, but commits are taking
> ~20% of time on the master, while our daily index optimise can up to
> 4 hours...
> Here's the most relevant part of solrconfig.xml:
>      <useCompoundFile>true</useCompoundFile>
>      <mergeFactor>10</mergeFactor>
>      <maxBufferedDocs>1000</maxBufferedDocs>
>      <maxMergeDocs>10000</maxMergeDocs>
>      <maxFieldLength>10000</maxFieldLength>
> 
> I've given both master and slave 2.5GB of RAM.
> 
> Does an index optimise read and re-write the whole thing? If so,
> taking about 4 hours is pretty good! However, the documentation here:
> http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten
> +minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
> states "Optimizations can take nearly ten minutes to run..." which
> leads me to think that we've grossly misconfigured something...
> 
> Firstly, we would obviously love any way to reduce this optimise time
> - I have yet to experiment extensively with the settings above, and
> optimise frequency, but some general guidance would be great.
> 
> Secondly, this index size is increasing monotonously over time and as
> we acquire new users. We need to take action to ensure we can scale
> in the future. The approach we're favouring at the moment is
> horizontal partitioning of indices by user id as our data suits this
> scheme well. A given index would hold the indexed data for n users,
> where n would probably be between 1 and 100 users, and we will have
> multiple indices per search server.
> 
> Running server per index is impractical, especially for a small n, so
> is a sinlge Solr instance capable of managing multiple searchers and
> writers in this way? Following on from that, does anyone know of
> limiting factors in Solr or Lucene that would influence our decision
> on the value of n - the number of users per index?
> 
> Thanks!
> James
> 
> 
>