You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Stephen Lewis Bianamara <st...@gmail.com> on 2021/06/22 17:45:32 UTC

Is 2x drive size necessary for recovery on SOLR 8?

Hi SOLR Community,

I've been investigating SOLR 8 and recovery behavior. From what I can tell,
older SOLR versions (6 and before at least) required a solrdata drive with
at least 2x the space of the index; so an index which could be up to 100GB
on one shard would require a disk with at least 200GB of storage space,
since recovery would copy a brand new index over and then switch after the
fact.

However, SOLR 8 looks to have a different behavior wherein the index is
perhaps updated in place, and thus a 100GB / shard index might only need a
bit more headroom (call it 110GB say). Is this always the case with
recovery on SOLR 8+? Or are there some situations where you might need
200GB for the recovery?

Thanks in advance!
Stephen

Re: Is 2x drive size necessary for recovery on SOLR 8?

Posted by Stephen Lewis Bianamara <st...@gmail.com>.
Thanks Shawn!

On Wed, Jun 23, 2021, 5:27 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 6/22/2021 11:29 PM, Stephen Lewis Bianamara wrote:
> > This was a ton of great info I needed, and more than I initially knew to
> > ask :) Your first response seems to imply an answer to my original
> > question, but I wanted to follow up to be as sure as I can. In the
> recovery
> > scenario, are there situations where a complete index will copy over next
> > to the original index, thus requiring 2x the disk space? Or is that now
> > outdated? I could imagine for example the replacement is now done on each
> > segment at a smaller scale or something along those lines and so recovery
> > requirements would expect to be on par with merge requirements, or
> perhaps
> > there is some "bad enough" scenario where a full side-by-side copy is
> made
> > during recovery. Can you comment on that?
>
>
> SolrCloud recovery uses the replication handler, configuring it on the
> fly.  This works almost exactly like rsync, and I am pretty sure it
> mimics the -W option on rsync, copying whole files if the filename
> already exists but has a different size.
>
> So if the index it's copying is substantially similar to the one you've
> already got -- at the file level -- the recovery will be fast and not
> take much space.  But if it's very different (again, at the file level,
> not at the Lucene level) it very well might copy the entire index over,
> and then delete the files that make up the existing index.
>
> Thanks,
> Shawn
>
>

Re: Is 2x drive size necessary for recovery on SOLR 8?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/22/2021 11:29 PM, Stephen Lewis Bianamara wrote:
> This was a ton of great info I needed, and more than I initially knew to
> ask :) Your first response seems to imply an answer to my original
> question, but I wanted to follow up to be as sure as I can. In the recovery
> scenario, are there situations where a complete index will copy over next
> to the original index, thus requiring 2x the disk space? Or is that now
> outdated? I could imagine for example the replacement is now done on each
> segment at a smaller scale or something along those lines and so recovery
> requirements would expect to be on par with merge requirements, or perhaps
> there is some "bad enough" scenario where a full side-by-side copy is made
> during recovery. Can you comment on that?


SolrCloud recovery uses the replication handler, configuring it on the 
fly.  This works almost exactly like rsync, and I am pretty sure it 
mimics the -W option on rsync, copying whole files if the filename 
already exists but has a different size.

So if the index it's copying is substantially similar to the one you've 
already got -- at the file level -- the recovery will be fast and not 
take much space.  But if it's very different (again, at the file level, 
not at the Lucene level) it very well might copy the entire index over, 
and then delete the files that make up the existing index.

Thanks,
Shawn


Re: Is 2x drive size necessary for recovery on SOLR 8?

Posted by Stephen Lewis Bianamara <st...@gmail.com>.
Thanks Shawn! I will definitely be interested to explore this space
cautiously in that case.

This was a ton of great info I needed, and more than I initially knew to
ask :) Your first response seems to imply an answer to my original
question, but I wanted to follow up to be as sure as I can. In the recovery
scenario, are there situations where a complete index will copy over next
to the original index, thus requiring 2x the disk space? Or is that now
outdated? I could imagine for example the replacement is now done on each
segment at a smaller scale or something along those lines and so recovery
requirements would expect to be on par with merge requirements, or perhaps
there is some "bad enough" scenario where a full side-by-side copy is made
during recovery. Can you comment on that?

Thanks!

On Tue, Jun 22, 2021 at 5:33 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 6/22/2021 5:02 PM, Stephen Lewis Bianamara wrote:
> > The merging considerations are certainly interesting and naunced. Has
> there
> > been any investigation into a "minimum number of segments" setting which
> > could force a minimum number of segments (say 5 or 10) so that no one
> > segment operation could involve the entire index?
>
>
> Since Solr 7.5, the merge defaults are a lot better.  I think the "no
> segment larger than 5GB" setting even applies to optimize, but I'm not
> completely positive.  Erick Erickson is familiar with the nitty gritty
> details on that.
>
> If you never run an optimize, you should be fine.  And since you go with
> the A/B option for reindexing, you might not ever run into the 3x
> requirement.  But if you're dealing with bare metal servers, disks are
> cheap, so it's a good idea to have LOTS of free space.  If you're going
> AWS or some other cloud solution, you'll probably want to be more aware
> of realistic requirements.
>
> Thanks,
> Shawn
>
>

Re: Is 2x drive size necessary for recovery on SOLR 8?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/22/2021 5:02 PM, Stephen Lewis Bianamara wrote:
> The merging considerations are certainly interesting and naunced. Has there
> been any investigation into a "minimum number of segments" setting which
> could force a minimum number of segments (say 5 or 10) so that no one
> segment operation could involve the entire index?


Since Solr 7.5, the merge defaults are a lot better.  I think the "no 
segment larger than 5GB" setting even applies to optimize, but I'm not 
completely positive.  Erick Erickson is familiar with the nitty gritty 
details on that.

If you never run an optimize, you should be fine.  And since you go with 
the A/B option for reindexing, you might not ever run into the 3x 
requirement.  But if you're dealing with bare metal servers, disks are 
cheap, so it's a good idea to have LOTS of free space.  If you're going 
AWS or some other cloud solution, you'll probably want to be more aware 
of realistic requirements.

Thanks,
Shawn


Re: Is 2x drive size necessary for recovery on SOLR 8?

Posted by Stephen Lewis Bianamara <st...@gmail.com>.
In my experience, disks are not always cheap :) Running in AWS I have found
several contexts which require local storage for cost effective performance
of SOLR, but that does require scaling the instance as a whole to increase
capacity (hence the particular motivation for this question).

Generally the use cases I am considering don't re-index the whole index
inplace, but rather I have used an A/B strategy to stand up a parallel
cluster and index to that the cut over using some other method (aliases or
DNS draining depending on the context). So as far as a re-indexing
operation is concerned, this seems controllable by favoring certain
methodologies.

The merging considerations are certainly interesting and naunced. Has there
been any investigation into a "minimum number of segments" setting which
could force a minimum number of segments (say 5 or 10) so that no one
segment operation could involve the entire index?

On Tue, Jun 22, 2021 at 1:37 PM Dave <ha...@gmail.com> wrote:

> The 3x index size has been around for a long time. Usually it’s for a full
> optimize.  When this happens the original index stays in place, 1x, and is
> being reconstructed, 2x, then merged into the replacement 3x, once it’s all
> done you are back to less than 1x but you need the space or the optimize
> will fail.  The new rules are that you never optimize but you will always
> want that extra space just in case, and disks are cheap,
>
> > On Jun 22, 2021, at 4:24 PM, Stephen Lewis Bianamara <
> stephen.bianamara@gmail.com> wrote:
> >
> > Thanks Shawn! That is really helpful to know. Can you say more about
> what
> > circumstance might cause an index to triple in size? Is it connected with
> > bulk operations like "optimize" which can be avoided, or is it inherent
> to
> > situations like merging segments? And if so, can this requirement be
> > adjusted by an appropriate setting of maxMergedSegmentMB or something
> > similar?
> >
> > I guess I'm wondering if there is any info or references I could look at
> to
> > determine what the limit should be for a given case even if the general
> > guidance is that 3x is needed.
> >
> > Thanks!
> >
> >> On Tue, Jun 22, 2021 at 1:05 PM Shawn Heisey <ap...@elyograg.org>
> wrote:
> >>
> >>> On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:
> >>> However, SOLR 8 looks to have a different behavior wherein the index is
> >>> perhaps updated in place, and thus a 100GB / shard index might only
> need
> >> a
> >>> bit more headroom (call it 110GB say). Is this always the case with
> >>> recovery on SOLR 8+? Or are there some situations where you might need
> >>> 200GB for the recovery?
> >>
> >>
> >> The general recommendation, for normal operation and not just recovery,
> >> is to ensure you have enough space available so that the index can
> >> triple in size temporarily.  The 3x requirement only comes about with a
> >> very specific set of circumstances involving reindexing in-place on an
> >> existing index -- for MOST usage, you want enough space for the index to
> >> double in size temporarily. But because we cannot be sure how you are
> >> going to use Solr, we always err on the side of caution and tell people
> >> the index could triple in size before it goes back down.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>

Re: Is 2x drive size necessary for recovery on SOLR 8?

Posted by Dave <ha...@gmail.com>.
The 3x index size has been around for a long time. Usually it’s for a full optimize.  When this happens the original index stays in place, 1x, and is being reconstructed, 2x, then merged into the replacement 3x, once it’s all done you are back to less than 1x but you need the space or the optimize will fail.  The new rules are that you never optimize but you will always want that extra space just in case, and disks are cheap, 

> On Jun 22, 2021, at 4:24 PM, Stephen Lewis Bianamara <st...@gmail.com> wrote:
> 
> Thanks Shawn! That is really helpful to know. Can you say more about what
> circumstance might cause an index to triple in size? Is it connected with
> bulk operations like "optimize" which can be avoided, or is it inherent to
> situations like merging segments? And if so, can this requirement be
> adjusted by an appropriate setting of maxMergedSegmentMB or something
> similar?
> 
> I guess I'm wondering if there is any info or references I could look at to
> determine what the limit should be for a given case even if the general
> guidance is that 3x is needed.
> 
> Thanks!
> 
>> On Tue, Jun 22, 2021 at 1:05 PM Shawn Heisey <ap...@elyograg.org> wrote:
>> 
>>> On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:
>>> However, SOLR 8 looks to have a different behavior wherein the index is
>>> perhaps updated in place, and thus a 100GB / shard index might only need
>> a
>>> bit more headroom (call it 110GB say). Is this always the case with
>>> recovery on SOLR 8+? Or are there some situations where you might need
>>> 200GB for the recovery?
>> 
>> 
>> The general recommendation, for normal operation and not just recovery,
>> is to ensure you have enough space available so that the index can
>> triple in size temporarily.  The 3x requirement only comes about with a
>> very specific set of circumstances involving reindexing in-place on an
>> existing index -- for MOST usage, you want enough space for the index to
>> double in size temporarily. But because we cannot be sure how you are
>> going to use Solr, we always err on the side of caution and tell people
>> the index could triple in size before it goes back down.
>> 
>> Thanks,
>> Shawn
>> 
>> 

Re: Is 2x drive size necessary for recovery on SOLR 8?

Posted by Shawn Heisey <el...@elyograg.org>.
On 6/22/2021 2:24 PM, Stephen Lewis Bianamara wrote:
> Thanks Shawn! That is really helpful to know. Can you say more about what
> circumstance might cause an index to triple in size? Is it connected with
> bulk operations like "optimize" which can be avoided, or is it inherent to
> situations like merging segments? And if so, can this requirement be
> adjusted by an appropriate setting of maxMergedSegmentMB or something
> similar?


Any merge, whether it's optimize (forcemerge) or normal merging, can 
involve the entire index.

Let's say you have an index that has a number of very large segments.  
Either you optimized it at some point or it's just been running for a 
long time and has reached that state naturally.

You begin a reindexing process.  This process hits almost all the 
documents in the index, but a few are left untouched.

Those few untouched documents mean that the segments containing them 
must stick around, even though they're comprised almost entirely of 
deleted documents.

At this point, without even doing an optimize, the index has doubled in 
size -- the original segments are still there because they contain a few 
not-deleted docs, and all the new data is in new segments.  In practice, 
some of those older segments probably got merged and shrank, but we're 
discussing worst-case scenarios here, so pretend for a moment that they 
have not been merged away.

Then either you do some more indexing that results in a super-large 
merge, or run an optimize.  At this point, with the index already 
doubled in size, that further merging could add the whole index again 
before it deletes the older segments and you're back to 1x.

Realistically, you probably need enough space for the index to reach 
2.5x when doing in-place reindexing, but if the planets all align just 
right, you could need 3x.  If you never reindex the whole thing in place 
(without either creating a new index or deleting the existing one) then 
you would only need 2x.  But because sometimes the planets do align just 
right, I tell people to have 3x just in case.

Thanks
Shawn


Re: Is 2x drive size necessary for recovery on SOLR 8?

Posted by Stephen Lewis Bianamara <st...@gmail.com>.
Thanks Shawn! That is really helpful to know. Can you say more about what
circumstance might cause an index to triple in size? Is it connected with
bulk operations like "optimize" which can be avoided, or is it inherent to
situations like merging segments? And if so, can this requirement be
adjusted by an appropriate setting of maxMergedSegmentMB or something
similar?

I guess I'm wondering if there is any info or references I could look at to
determine what the limit should be for a given case even if the general
guidance is that 3x is needed.

Thanks!

On Tue, Jun 22, 2021 at 1:05 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:
> > However, SOLR 8 looks to have a different behavior wherein the index is
> > perhaps updated in place, and thus a 100GB / shard index might only need
> a
> > bit more headroom (call it 110GB say). Is this always the case with
> > recovery on SOLR 8+? Or are there some situations where you might need
> > 200GB for the recovery?
>
>
> The general recommendation, for normal operation and not just recovery,
> is to ensure you have enough space available so that the index can
> triple in size temporarily.  The 3x requirement only comes about with a
> very specific set of circumstances involving reindexing in-place on an
> existing index -- for MOST usage, you want enough space for the index to
> double in size temporarily. But because we cannot be sure how you are
> going to use Solr, we always err on the side of caution and tell people
> the index could triple in size before it goes back down.
>
> Thanks,
> Shawn
>
>

Re: Is 2x drive size necessary for recovery on SOLR 8?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:
> However, SOLR 8 looks to have a different behavior wherein the index is
> perhaps updated in place, and thus a 100GB / shard index might only need a
> bit more headroom (call it 110GB say). Is this always the case with
> recovery on SOLR 8+? Or are there some situations where you might need
> 200GB for the recovery?


The general recommendation, for normal operation and not just recovery, 
is to ensure you have enough space available so that the index can 
triple in size temporarily.  The 3x requirement only comes about with a 
very specific set of circumstances involving reindexing in-place on an 
existing index -- for MOST usage, you want enough space for the index to 
double in size temporarily. But because we cannot be sure how you are 
going to use Solr, we always err on the side of caution and tell people 
the index could triple in size before it goes back down.

Thanks,
Shawn