You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Stephen Lewis Bianamara <st...@gmail.com> on 2021/06/22 17:45:32 UTC
Is 2x drive size necessary for recovery on SOLR 8?
Hi SOLR Community,
I've been investigating SOLR 8 and recovery behavior. From what I can tell,
older SOLR versions (6 and before at least) required a solrdata drive with
at least 2x the space of the index; so an index which could be up to 100GB
on one shard would require a disk with at least 200GB of storage space,
since recovery would copy a brand new index over and then switch after the
fact.
However, SOLR 8 looks to have a different behavior wherein the index is
perhaps updated in place, and thus a 100GB / shard index might only need a
bit more headroom (call it 110GB say). Is this always the case with
recovery on SOLR 8+? Or are there some situations where you might need
200GB for the recovery?
Thanks in advance!
Stephen
Re: Is 2x drive size necessary for recovery on SOLR 8?
Posted by Stephen Lewis Bianamara <st...@gmail.com>.
Thanks Shawn!
On Wed, Jun 23, 2021, 5:27 AM Shawn Heisey <ap...@elyograg.org> wrote:
> On 6/22/2021 11:29 PM, Stephen Lewis Bianamara wrote:
> > This was a ton of great info I needed, and more than I initially knew to
> > ask :) Your first response seems to imply an answer to my original
> > question, but I wanted to follow up to be as sure as I can. In the
> recovery
> > scenario, are there situations where a complete index will copy over next
> > to the original index, thus requiring 2x the disk space? Or is that now
> > outdated? I could imagine for example the replacement is now done on each
> > segment at a smaller scale or something along those lines and so recovery
> > requirements would expect to be on par with merge requirements, or
> perhaps
> > there is some "bad enough" scenario where a full side-by-side copy is
> made
> > during recovery. Can you comment on that?
>
>
> SolrCloud recovery uses the replication handler, configuring it on the
> fly. This works almost exactly like rsync, and I am pretty sure it
> mimics the -W option on rsync, copying whole files if the filename
> already exists but has a different size.
>
> So if the index it's copying is substantially similar to the one you've
> already got -- at the file level -- the recovery will be fast and not
> take much space. But if it's very different (again, at the file level,
> not at the Lucene level) it very well might copy the entire index over,
> and then delete the files that make up the existing index.
>
> Thanks,
> Shawn
>
>
Re: Is 2x drive size necessary for recovery on SOLR 8?
Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/22/2021 11:29 PM, Stephen Lewis Bianamara wrote:
> This was a ton of great info I needed, and more than I initially knew to
> ask :) Your first response seems to imply an answer to my original
> question, but I wanted to follow up to be as sure as I can. In the recovery
> scenario, are there situations where a complete index will copy over next
> to the original index, thus requiring 2x the disk space? Or is that now
> outdated? I could imagine for example the replacement is now done on each
> segment at a smaller scale or something along those lines and so recovery
> requirements would expect to be on par with merge requirements, or perhaps
> there is some "bad enough" scenario where a full side-by-side copy is made
> during recovery. Can you comment on that?
SolrCloud recovery uses the replication handler, configuring it on the
fly. This works almost exactly like rsync, and I am pretty sure it
mimics the -W option on rsync, copying whole files if the filename
already exists but has a different size.
So if the index it's copying is substantially similar to the one you've
already got -- at the file level -- the recovery will be fast and not
take much space. But if it's very different (again, at the file level,
not at the Lucene level) it very well might copy the entire index over,
and then delete the files that make up the existing index.
Thanks,
Shawn
Re: Is 2x drive size necessary for recovery on SOLR 8?
Posted by Stephen Lewis Bianamara <st...@gmail.com>.
Thanks Shawn! I will definitely be interested to explore this space
cautiously in that case.
This was a ton of great info I needed, and more than I initially knew to
ask :) Your first response seems to imply an answer to my original
question, but I wanted to follow up to be as sure as I can. In the recovery
scenario, are there situations where a complete index will copy over next
to the original index, thus requiring 2x the disk space? Or is that now
outdated? I could imagine for example the replacement is now done on each
segment at a smaller scale or something along those lines and so recovery
requirements would expect to be on par with merge requirements, or perhaps
there is some "bad enough" scenario where a full side-by-side copy is made
during recovery. Can you comment on that?
Thanks!
On Tue, Jun 22, 2021 at 5:33 PM Shawn Heisey <ap...@elyograg.org> wrote:
> On 6/22/2021 5:02 PM, Stephen Lewis Bianamara wrote:
> > The merging considerations are certainly interesting and naunced. Has
> there
> > been any investigation into a "minimum number of segments" setting which
> > could force a minimum number of segments (say 5 or 10) so that no one
> > segment operation could involve the entire index?
>
>
> Since Solr 7.5, the merge defaults are a lot better. I think the "no
> segment larger than 5GB" setting even applies to optimize, but I'm not
> completely positive. Erick Erickson is familiar with the nitty gritty
> details on that.
>
> If you never run an optimize, you should be fine. And since you go with
> the A/B option for reindexing, you might not ever run into the 3x
> requirement. But if you're dealing with bare metal servers, disks are
> cheap, so it's a good idea to have LOTS of free space. If you're going
> AWS or some other cloud solution, you'll probably want to be more aware
> of realistic requirements.
>
> Thanks,
> Shawn
>
>
Re: Is 2x drive size necessary for recovery on SOLR 8?
Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/22/2021 5:02 PM, Stephen Lewis Bianamara wrote:
> The merging considerations are certainly interesting and naunced. Has there
> been any investigation into a "minimum number of segments" setting which
> could force a minimum number of segments (say 5 or 10) so that no one
> segment operation could involve the entire index?
Since Solr 7.5, the merge defaults are a lot better. I think the "no
segment larger than 5GB" setting even applies to optimize, but I'm not
completely positive. Erick Erickson is familiar with the nitty gritty
details on that.
If you never run an optimize, you should be fine. And since you go with
the A/B option for reindexing, you might not ever run into the 3x
requirement. But if you're dealing with bare metal servers, disks are
cheap, so it's a good idea to have LOTS of free space. If you're going
AWS or some other cloud solution, you'll probably want to be more aware
of realistic requirements.
Thanks,
Shawn
Re: Is 2x drive size necessary for recovery on SOLR 8?
Posted by Stephen Lewis Bianamara <st...@gmail.com>.
In my experience, disks are not always cheap :) Running in AWS I have found
several contexts which require local storage for cost effective performance
of SOLR, but that does require scaling the instance as a whole to increase
capacity (hence the particular motivation for this question).
Generally the use cases I am considering don't re-index the whole index
inplace, but rather I have used an A/B strategy to stand up a parallel
cluster and index to that the cut over using some other method (aliases or
DNS draining depending on the context). So as far as a re-indexing
operation is concerned, this seems controllable by favoring certain
methodologies.
The merging considerations are certainly interesting and naunced. Has there
been any investigation into a "minimum number of segments" setting which
could force a minimum number of segments (say 5 or 10) so that no one
segment operation could involve the entire index?
On Tue, Jun 22, 2021 at 1:37 PM Dave <ha...@gmail.com> wrote:
> The 3x index size has been around for a long time. Usually it’s for a full
> optimize. When this happens the original index stays in place, 1x, and is
> being reconstructed, 2x, then merged into the replacement 3x, once it’s all
> done you are back to less than 1x but you need the space or the optimize
> will fail. The new rules are that you never optimize but you will always
> want that extra space just in case, and disks are cheap,
>
> > On Jun 22, 2021, at 4:24 PM, Stephen Lewis Bianamara <
> stephen.bianamara@gmail.com> wrote:
> >
> > Thanks Shawn! That is really helpful to know. Can you say more about
> what
> > circumstance might cause an index to triple in size? Is it connected with
> > bulk operations like "optimize" which can be avoided, or is it inherent
> to
> > situations like merging segments? And if so, can this requirement be
> > adjusted by an appropriate setting of maxMergedSegmentMB or something
> > similar?
> >
> > I guess I'm wondering if there is any info or references I could look at
> to
> > determine what the limit should be for a given case even if the general
> > guidance is that 3x is needed.
> >
> > Thanks!
> >
> >> On Tue, Jun 22, 2021 at 1:05 PM Shawn Heisey <ap...@elyograg.org>
> wrote:
> >>
> >>> On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:
> >>> However, SOLR 8 looks to have a different behavior wherein the index is
> >>> perhaps updated in place, and thus a 100GB / shard index might only
> need
> >> a
> >>> bit more headroom (call it 110GB say). Is this always the case with
> >>> recovery on SOLR 8+? Or are there some situations where you might need
> >>> 200GB for the recovery?
> >>
> >>
> >> The general recommendation, for normal operation and not just recovery,
> >> is to ensure you have enough space available so that the index can
> >> triple in size temporarily. The 3x requirement only comes about with a
> >> very specific set of circumstances involving reindexing in-place on an
> >> existing index -- for MOST usage, you want enough space for the index to
> >> double in size temporarily. But because we cannot be sure how you are
> >> going to use Solr, we always err on the side of caution and tell people
> >> the index could triple in size before it goes back down.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
Re: Is 2x drive size necessary for recovery on SOLR 8?
Posted by Dave <ha...@gmail.com>.
The 3x index size has been around for a long time. Usually it’s for a full optimize. When this happens the original index stays in place, 1x, and is being reconstructed, 2x, then merged into the replacement 3x, once it’s all done you are back to less than 1x but you need the space or the optimize will fail. The new rules are that you never optimize but you will always want that extra space just in case, and disks are cheap,
> On Jun 22, 2021, at 4:24 PM, Stephen Lewis Bianamara <st...@gmail.com> wrote:
>
> Thanks Shawn! That is really helpful to know. Can you say more about what
> circumstance might cause an index to triple in size? Is it connected with
> bulk operations like "optimize" which can be avoided, or is it inherent to
> situations like merging segments? And if so, can this requirement be
> adjusted by an appropriate setting of maxMergedSegmentMB or something
> similar?
>
> I guess I'm wondering if there is any info or references I could look at to
> determine what the limit should be for a given case even if the general
> guidance is that 3x is needed.
>
> Thanks!
>
>> On Tue, Jun 22, 2021 at 1:05 PM Shawn Heisey <ap...@elyograg.org> wrote:
>>
>>> On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:
>>> However, SOLR 8 looks to have a different behavior wherein the index is
>>> perhaps updated in place, and thus a 100GB / shard index might only need
>> a
>>> bit more headroom (call it 110GB say). Is this always the case with
>>> recovery on SOLR 8+? Or are there some situations where you might need
>>> 200GB for the recovery?
>>
>>
>> The general recommendation, for normal operation and not just recovery,
>> is to ensure you have enough space available so that the index can
>> triple in size temporarily. The 3x requirement only comes about with a
>> very specific set of circumstances involving reindexing in-place on an
>> existing index -- for MOST usage, you want enough space for the index to
>> double in size temporarily. But because we cannot be sure how you are
>> going to use Solr, we always err on the side of caution and tell people
>> the index could triple in size before it goes back down.
>>
>> Thanks,
>> Shawn
>>
>>
Re: Is 2x drive size necessary for recovery on SOLR 8?
Posted by Shawn Heisey <el...@elyograg.org>.
On 6/22/2021 2:24 PM, Stephen Lewis Bianamara wrote:
> Thanks Shawn! That is really helpful to know. Can you say more about what
> circumstance might cause an index to triple in size? Is it connected with
> bulk operations like "optimize" which can be avoided, or is it inherent to
> situations like merging segments? And if so, can this requirement be
> adjusted by an appropriate setting of maxMergedSegmentMB or something
> similar?
Any merge, whether it's optimize (forcemerge) or normal merging, can
involve the entire index.
Let's say you have an index that has a number of very large segments.
Either you optimized it at some point or it's just been running for a
long time and has reached that state naturally.
You begin a reindexing process. This process hits almost all the
documents in the index, but a few are left untouched.
Those few untouched documents mean that the segments containing them
must stick around, even though they're comprised almost entirely of
deleted documents.
At this point, without even doing an optimize, the index has doubled in
size -- the original segments are still there because they contain a few
not-deleted docs, and all the new data is in new segments. In practice,
some of those older segments probably got merged and shrank, but we're
discussing worst-case scenarios here, so pretend for a moment that they
have not been merged away.
Then either you do some more indexing that results in a super-large
merge, or run an optimize. At this point, with the index already
doubled in size, that further merging could add the whole index again
before it deletes the older segments and you're back to 1x.
Realistically, you probably need enough space for the index to reach
2.5x when doing in-place reindexing, but if the planets all align just
right, you could need 3x. If you never reindex the whole thing in place
(without either creating a new index or deleting the existing one) then
you would only need 2x. But because sometimes the planets do align just
right, I tell people to have 3x just in case.
Thanks
Shawn
Re: Is 2x drive size necessary for recovery on SOLR 8?
Posted by Stephen Lewis Bianamara <st...@gmail.com>.
Thanks Shawn! That is really helpful to know. Can you say more about what
circumstance might cause an index to triple in size? Is it connected with
bulk operations like "optimize" which can be avoided, or is it inherent to
situations like merging segments? And if so, can this requirement be
adjusted by an appropriate setting of maxMergedSegmentMB or something
similar?
I guess I'm wondering if there is any info or references I could look at to
determine what the limit should be for a given case even if the general
guidance is that 3x is needed.
Thanks!
On Tue, Jun 22, 2021 at 1:05 PM Shawn Heisey <ap...@elyograg.org> wrote:
> On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:
> > However, SOLR 8 looks to have a different behavior wherein the index is
> > perhaps updated in place, and thus a 100GB / shard index might only need
> a
> > bit more headroom (call it 110GB say). Is this always the case with
> > recovery on SOLR 8+? Or are there some situations where you might need
> > 200GB for the recovery?
>
>
> The general recommendation, for normal operation and not just recovery,
> is to ensure you have enough space available so that the index can
> triple in size temporarily. The 3x requirement only comes about with a
> very specific set of circumstances involving reindexing in-place on an
> existing index -- for MOST usage, you want enough space for the index to
> double in size temporarily. But because we cannot be sure how you are
> going to use Solr, we always err on the side of caution and tell people
> the index could triple in size before it goes back down.
>
> Thanks,
> Shawn
>
>
Re: Is 2x drive size necessary for recovery on SOLR 8?
Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:
> However, SOLR 8 looks to have a different behavior wherein the index is
> perhaps updated in place, and thus a 100GB / shard index might only need a
> bit more headroom (call it 110GB say). Is this always the case with
> recovery on SOLR 8+? Or are there some situations where you might need
> 200GB for the recovery?
The general recommendation, for normal operation and not just recovery,
is to ensure you have enough space available so that the index can
triple in size temporarily. The 3x requirement only comes about with a
very specific set of circumstances involving reindexing in-place on an
existing index -- for MOST usage, you want enough space for the index to
double in size temporarily. But because we cannot be sure how you are
going to use Solr, we always err on the side of caution and tell people
the index could triple in size before it goes back down.
Thanks,
Shawn