You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Patrick Henry <pa...@gmail.com> on 2014/11/14 04:24:34 UTC

Handling growth

Hello everyone,

I am working with a Solr collection that is several terabytes in size over
 several hundred millions of documents.  Each document is very rich, and
over the past few years we have consistently quadrupled the size our
collection annually.  Unfortunately, this sits on a single node with only a
few hundred megabytes of memory - so our performance is less than ideal.

I am looking into implementing a SolrCloud cluster.  From reading a few
books (i.e. Solr in Action), various internet blogs, and the reference
guide, it states to build a cluster with room to grow.  I can probably
provision enough hardware for a year from today worth of growth, however I
would like to have a plan beyond that.  Shard splitting seems pretty
straight forward.  We are in a continuous adding documents and never change
existing ones.  Based on that, one individual recommended for me to
implement custom hashing and route the latest documents to the shard with
the least documents, and when that shard fills up add a new shard and index
on the new shard, rinse and repeat.

The last one makes sense.  However, my concern with the last one is I lose
the distributed indexing, implementation concerns and the maintainability.
My question for the community is what are your thoughts are on this, and do
you have any suggestion and/or recommendations on planning for future
growth?

Look forward to your responses,
Patrick

Re: Handling growth

Posted by Erick Erickson <er...@gmail.com>.

Oversharding is another option that punts the ball further down the
road, but 5 years from now somebody _else_ will have to deal with it
;)...

You can host multiple shards on a single Solr. So say you think you'll
need 20 shards in 5 years (or whatever). Start with 20 shards on your
single machine in a single Solr. Then when you start getting pressure,
just move 10 of those shards to a new machine. No splitting involved.

A slick way to "move the shard" is to create a new replica on a new
box for the shard (see the CollectionsAPI ADDREPLICA command). Wait
and it'll automagically synchronize, then delete the original replica
(CollectionAPI again). Presto, you've moved the shard.

Be aware that if you take control of the sharding, you need to watch
out for your usage patterns. For illustration, say you have news data.
If you split the data for a single shard per day, an overwhelming
percent of your queries will be on the shards hosting the last few
days' data, the older shards are sitting there doing nothing. This may
be a fine thing in your problem space, there's nothing intrinsically
wrong with it. It's just something to be aware of.

It all depends on your problem space, as you can see from the varied
approaches...

Best,
Erick

On Thu, Nov 20, 2014 at 7:33 AM, Michael Della Bitta
<mi...@appinions.com> wrote:
> The collections we index under this multi-collection alias does not use real
> time get, no. We have other collections behind single-collection aliases
> where get calls seem to work, but I'm not clear whether the calls are real
> time. Seems like it would be easy for you to test, but just be aware that
> there's multiple things you'd have to prove:
>
> 1. Whether get calls are real-time
> 2. Whether they work against multi-collection aliases as opposed to single
> collection aliases.
>
> Also be aware that there were some issues with alias visibility and solrj
> clients prior to ~4.5 or so, and I believe there were early issues with
> writing to aliases prior to then as well. I'd suggest using a relatively
> modern release.
>
> Michael
>
>
> On 11/19/14 19:56, Patrick Henry wrote:
>>
>> Michael,
>>
>> Interesting, I'm still unfamiliar with limitations (if any) of aliasing.
>> Does architecture utilize realtime get?
>> On Nov 18, 2014 11:49 AM, "Michael Della Bitta" <
>> michael.della.bitta@appinions.com> wrote:
>>
>>> We're achieving some success by treating aliases as collections and
>>> collections as shards.
>>>
>>> More specifically, there's a read alias that spans all the collections,
>>> and a write alias that points at the 'latest' collection. Every week, I
>>> create a new collection, add it to the read alias, and point the write
>>> alias at it.
>>>
>>> Michael
>>>
>>> On 11/14/14 07:06, Toke Eskildsen wrote:
>>>
>>>> Patrick Henry [patricktheawesomegeek@gmail.com] wrote:
>>>>
>>>>   I am working with a Solr collection that is several terabytes in size
>>>>>
>>>>> over
>>>>> several hundred millions of documents.  Each document is very rich, and
>>>>> over the past few years we have consistently quadrupled the size our
>>>>> collection annually.  Unfortunately, this sits on a single node with
>>>>> only a
>>>>> few hundred megabytes of memory - so our performance is less than
>>>>> ideal.
>>>>>
>>>> I assume you mean gigabytes of memory. If you have not already done so,
>>>> switching to SSDs for storage should buy you some more time.
>>>>
>>>>   [Going for SolrCloud]  We are in a continuous adding documents and
>>>> never
>>>>>
>>>>> change
>>>>> existing ones.  Based on that, one individual recommended for me to
>>>>> implement custom hashing and route the latest documents to the shard
>>>>> with
>>>>> the least documents, and when that shard fills up add a new shard and
>>>>> index
>>>>> on the new shard, rinse and repeat.
>>>>>
>>>> We have quite a similar setup, where we produce a never-changing shard
>>>> once every 8 days and add it to our cloud. One could also combine this
>>>> setup with a single live shard, for keeping the full index constantly up
>>>> to
>>>> date. The memory overhead of running an immutable shard is smaller than
>>>> a
>>>> mutable one and easier to fine-tune. It also allows you to optimize the
>>>> index down to a single segment, which requires a bit less processing
>>>> power
>>>> and saves memory when faceting. There's a description of our setup at
>>>> http://sbdevel.wordpress.com/net-archive-search/
>>>>
>>>>   From an administrative point of view, we like having complete control
>>>> over each shard. We keep track of what goes in it and in case of schema
>>>> or
>>>> analyze chain changes, we can re-build each shard one at a time and
>>>> deploy
>>>> them continuously, instead of having to re-build everything in one go on
>>>> a
>>>> parallel setup. Of course, fundamental changes to the schema would
>>>> require
>>>> a complete re-build before deploy, so we hope to avoid that.
>>>>
>>>> - Toke Eskildsen
>>>>
>>>
>

Re: Handling growth

Posted by Michael Della Bitta <mi...@appinions.com>.

The collections we index under this multi-collection alias does not use 
real time get, no. We have other collections behind single-collection 
aliases where get calls seem to work, but I'm not clear whether the 
calls are real time. Seems like it would be easy for you to test, but 
just be aware that there's multiple things you'd have to prove:

1. Whether get calls are real-time
2. Whether they work against multi-collection aliases as opposed to 
single collection aliases.

Also be aware that there were some issues with alias visibility and 
solrj clients prior to ~4.5 or so, and I believe there were early issues 
with writing to aliases prior to then as well. I'd suggest using a 
relatively modern release.

Michael

On 11/19/14 19:56, Patrick Henry wrote:
> Michael,
>
> Interesting, I'm still unfamiliar with limitations (if any) of aliasing.
> Does architecture utilize realtime get?
> On Nov 18, 2014 11:49 AM, "Michael Della Bitta" <
> michael.della.bitta@appinions.com> wrote:
>
>> We're achieving some success by treating aliases as collections and
>> collections as shards.
>>
>> More specifically, there's a read alias that spans all the collections,
>> and a write alias that points at the 'latest' collection. Every week, I
>> create a new collection, add it to the read alias, and point the write
>> alias at it.
>>
>> Michael
>>
>> On 11/14/14 07:06, Toke Eskildsen wrote:
>>
>>> Patrick Henry [patricktheawesomegeek@gmail.com] wrote:
>>>
>>>   I am working with a Solr collection that is several terabytes in size
>>>> over
>>>> several hundred millions of documents.  Each document is very rich, and
>>>> over the past few years we have consistently quadrupled the size our
>>>> collection annually.  Unfortunately, this sits on a single node with
>>>> only a
>>>> few hundred megabytes of memory - so our performance is less than ideal.
>>>>
>>> I assume you mean gigabytes of memory. If you have not already done so,
>>> switching to SSDs for storage should buy you some more time.
>>>
>>>   [Going for SolrCloud]  We are in a continuous adding documents and never
>>>> change
>>>> existing ones.  Based on that, one individual recommended for me to
>>>> implement custom hashing and route the latest documents to the shard with
>>>> the least documents, and when that shard fills up add a new shard and
>>>> index
>>>> on the new shard, rinse and repeat.
>>>>
>>> We have quite a similar setup, where we produce a never-changing shard
>>> once every 8 days and add it to our cloud. One could also combine this
>>> setup with a single live shard, for keeping the full index constantly up to
>>> date. The memory overhead of running an immutable shard is smaller than a
>>> mutable one and easier to fine-tune. It also allows you to optimize the
>>> index down to a single segment, which requires a bit less processing power
>>> and saves memory when faceting. There's a description of our setup at
>>> http://sbdevel.wordpress.com/net-archive-search/
>>>
>>>   From an administrative point of view, we like having complete control
>>> over each shard. We keep track of what goes in it and in case of schema or
>>> analyze chain changes, we can re-build each shard one at a time and deploy
>>> them continuously, instead of having to re-build everything in one go on a
>>> parallel setup. Of course, fundamental changes to the schema would require
>>> a complete re-build before deploy, so we hope to avoid that.
>>>
>>> - Toke Eskildsen
>>>
>>

Re: Handling growth

Posted by Patrick Henry <pa...@gmail.com>.

Michael,

Interesting, I'm still unfamiliar with limitations (if any) of aliasing.
Does architecture utilize realtime get?
On Nov 18, 2014 11:49 AM, "Michael Della Bitta" <
michael.della.bitta@appinions.com> wrote:

> We're achieving some success by treating aliases as collections and
> collections as shards.
>
> More specifically, there's a read alias that spans all the collections,
> and a write alias that points at the 'latest' collection. Every week, I
> create a new collection, add it to the read alias, and point the write
> alias at it.
>
> Michael
>
> On 11/14/14 07:06, Toke Eskildsen wrote:
>
>> Patrick Henry [patricktheawesomegeek@gmail.com] wrote:
>>
>>  I am working with a Solr collection that is several terabytes in size
>>> over
>>> several hundred millions of documents.  Each document is very rich, and
>>> over the past few years we have consistently quadrupled the size our
>>> collection annually.  Unfortunately, this sits on a single node with
>>> only a
>>> few hundred megabytes of memory - so our performance is less than ideal.
>>>
>> I assume you mean gigabytes of memory. If you have not already done so,
>> switching to SSDs for storage should buy you some more time.
>>
>>  [Going for SolrCloud]  We are in a continuous adding documents and never
>>> change
>>> existing ones.  Based on that, one individual recommended for me to
>>> implement custom hashing and route the latest documents to the shard with
>>> the least documents, and when that shard fills up add a new shard and
>>> index
>>> on the new shard, rinse and repeat.
>>>
>> We have quite a similar setup, where we produce a never-changing shard
>> once every 8 days and add it to our cloud. One could also combine this
>> setup with a single live shard, for keeping the full index constantly up to
>> date. The memory overhead of running an immutable shard is smaller than a
>> mutable one and easier to fine-tune. It also allows you to optimize the
>> index down to a single segment, which requires a bit less processing power
>> and saves memory when faceting. There's a description of our setup at
>> http://sbdevel.wordpress.com/net-archive-search/
>>
>>  From an administrative point of view, we like having complete control
>> over each shard. We keep track of what goes in it and in case of schema or
>> analyze chain changes, we can re-build each shard one at a time and deploy
>> them continuously, instead of having to re-build everything in one go on a
>> parallel setup. Of course, fundamental changes to the schema would require
>> a complete re-build before deploy, so we hope to avoid that.
>>
>> - Toke Eskildsen
>>
>
>

Re: Handling growth

Posted by Michael Della Bitta <mi...@appinions.com>.

We're achieving some success by treating aliases as collections and 
collections as shards.

More specifically, there's a read alias that spans all the collections, 
and a write alias that points at the 'latest' collection. Every week, I 
create a new collection, add it to the read alias, and point the write 
alias at it.

Michael

On 11/14/14 07:06, Toke Eskildsen wrote:
> Patrick Henry [patricktheawesomegeek@gmail.com] wrote:
>
>> I am working with a Solr collection that is several terabytes in size over
>> several hundred millions of documents.  Each document is very rich, and
>> over the past few years we have consistently quadrupled the size our
>> collection annually.  Unfortunately, this sits on a single node with only a
>> few hundred megabytes of memory - so our performance is less than ideal.
> I assume you mean gigabytes of memory. If you have not already done so, switching to SSDs for storage should buy you some more time.
>
>> [Going for SolrCloud]  We are in a continuous adding documents and never change
>> existing ones.  Based on that, one individual recommended for me to
>> implement custom hashing and route the latest documents to the shard with
>> the least documents, and when that shard fills up add a new shard and index
>> on the new shard, rinse and repeat.
> We have quite a similar setup, where we produce a never-changing shard once every 8 days and add it to our cloud. One could also combine this setup with a single live shard, for keeping the full index constantly up to date. The memory overhead of running an immutable shard is smaller than a mutable one and easier to fine-tune. It also allows you to optimize the index down to a single segment, which requires a bit less processing power and saves memory when faceting. There's a description of our setup at http://sbdevel.wordpress.com/net-archive-search/
>
>  From an administrative point of view, we like having complete control over each shard. We keep track of what goes in it and in case of schema or analyze chain changes, we can re-build each shard one at a time and deploy them continuously, instead of having to re-build everything in one go on a parallel setup. Of course, fundamental changes to the schema would require a complete re-build before deploy, so we hope to avoid that.
>
> - Toke Eskildsen

Re: Handling growth

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Thu, 2014-11-20 at 01:42 +0100, Patrick Henry wrote:
> Good eye, that should have been gigabytes.  When adding to the new shard,
> is the shard already part of the the collection?  What mechanism have you
> found useful in accomplishing this (i.e. routing)?

Currently (and for the foreseeable future), the number of users of the
system is very limited, so a few minutes downtime is acceptable.

For each search machine (currently 1, but 2 more are being added in
January), 25 single-shard Solrs are running, some of them empty. When we
have produced a new shard, we shut down a Solr with an empty shard, move
the new shard in and start it up.

Not very elegant, but simple to maintain.

- Toke Eskildsen, State and University Library

RE: Handling growth

Posted by Patrick Henry <pa...@gmail.com>.

Good eye, that should have been gigabytes.  When adding to the new shard,
is the shard already part of the the collection?  What mechanism have you
found useful in accomplishing this (i.e. routing)?
On Nov 14, 2014 7:07 AM, "Toke Eskildsen" <te...@statsbiblioteket.dk> wrote:

> Patrick Henry [patricktheawesomegeek@gmail.com] wrote:
>
> >I am working with a Solr collection that is several terabytes in size over
> > several hundred millions of documents.  Each document is very rich, and
> > over the past few years we have consistently quadrupled the size our
> > collection annually.  Unfortunately, this sits on a single node with
> only a
> > few hundred megabytes of memory - so our performance is less than ideal.
>
> I assume you mean gigabytes of memory. If you have not already done so,
> switching to SSDs for storage should buy you some more time.
>
> > [Going for SolrCloud]  We are in a continuous adding documents and never
> change
> > existing ones.  Based on that, one individual recommended for me to
> > implement custom hashing and route the latest documents to the shard with
> > the least documents, and when that shard fills up add a new shard and
> index
> > on the new shard, rinse and repeat.
>
> We have quite a similar setup, where we produce a never-changing shard
> once every 8 days and add it to our cloud. One could also combine this
> setup with a single live shard, for keeping the full index constantly up to
> date. The memory overhead of running an immutable shard is smaller than a
> mutable one and easier to fine-tune. It also allows you to optimize the
> index down to a single segment, which requires a bit less processing power
> and saves memory when faceting. There's a description of our setup at
> http://sbdevel.wordpress.com/net-archive-search/
>
> From an administrative point of view, we like having complete control over
> each shard. We keep track of what goes in it and in case of schema or
> analyze chain changes, we can re-build each shard one at a time and deploy
> them continuously, instead of having to re-build everything in one go on a
> parallel setup. Of course, fundamental changes to the schema would require
> a complete re-build before deploy, so we hope to avoid that.
>
> - Toke Eskildsen
>

RE: Handling growth

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

Patrick Henry [patricktheawesomegeek@gmail.com] wrote:

>I am working with a Solr collection that is several terabytes in size over
> several hundred millions of documents.  Each document is very rich, and
> over the past few years we have consistently quadrupled the size our
> collection annually.  Unfortunately, this sits on a single node with only a
> few hundred megabytes of memory - so our performance is less than ideal.

I assume you mean gigabytes of memory. If you have not already done so, switching to SSDs for storage should buy you some more time.

> [Going for SolrCloud]  We are in a continuous adding documents and never change
> existing ones.  Based on that, one individual recommended for me to
> implement custom hashing and route the latest documents to the shard with
> the least documents, and when that shard fills up add a new shard and index
> on the new shard, rinse and repeat.

We have quite a similar setup, where we produce a never-changing shard once every 8 days and add it to our cloud. One could also combine this setup with a single live shard, for keeping the full index constantly up to date. The memory overhead of running an immutable shard is smaller than a mutable one and easier to fine-tune. It also allows you to optimize the index down to a single segment, which requires a bit less processing power and saves memory when faceting. There's a description of our setup at http://sbdevel.wordpress.com/net-archive-search/

>From an administrative point of view, we like having complete control over each shard. We keep track of what goes in it and in case of schema or analyze chain changes, we can re-build each shard one at a time and deploy them continuously, instead of having to re-build everything in one go on a parallel setup. Of course, fundamental changes to the schema would require a complete re-build before deploy, so we hope to avoid that.

- Toke Eskildsen