You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Moulay Hicham <ma...@gmail.com> on 2020/10/23 16:35:37 UTC

TieredMergePolicyFactory question

Hi,

I am using solr 8.1 in production. We have about 30%-50% of deleted
documents in some old segments that were merged a year ago.

These segments size is about 5GB.

I was wondering why these segments have a high % of deleted docs and found
out that they are NOT being candidates for merging because the
default TieredMergePolicy maxMergedSegmentMB is 5G.

So I have modified the TieredMergePolicyFactory config as below to
lower the delete docs %

<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
  <int name="maxMergeAtOnce">10</int>
  <int name="segmentsPerTier">10</int>
  <double name="maxMergedSegmentMB">12000</double>
  <double name="deletesPctAllowed">20</double>
</mergePolicyFactory>


Do you see any issues with increasing the max merged segment to 12GB and
lowered the deletedPctAllowed to 20%?

Thanks,

Moulay

Re: TieredMergePolicyFactory question

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/25/2020 11:22 PM, Moulay Hicham wrote:
> I am wondering about 3 other things:
> 
> 1 - You mentioned that I need free disk space. Just to make sure that we
> are talking about disc space here. RAM can still remain at the same size?
> My current RAM size is  Index size < RAM < 1.5 Index size

You must always have enough disk space available for your indexes to 
double in size.  We recommend having enough disk space for your indexes 
to *triple* in size, because there is a real-world scenario that will 
require that much disk space.

> 2 - When the merge is happening, it happens in disc and when it's
> completed, then the data is sync'ed with RAM. I am just guessing here ;-).
> I couldn't find a good explanation online about this.

If you have enough free memory, then the OS will make sure that the data 
is available in RAM.  All modern operating systems do this 
automatically.  Note that I am talking about memory that is not 
allocated to programs.  Any memory assigned to the Solr heap (or any 
other program) will NOT be available for caching index data.

If you want ideal performance in typical situations, you must have as 
much free memory as the space your indexes take up on disk.  For ideal 
performance in ALL situations, you'll want enough free memory to be able 
to hold both the original and optimized copies of your index data at the 
same time.  We have seen that good performance can be achieved without 
going to this extreme, but if you have little free memory, Solr 
performance will be terrible.

I wrote a wiki page that covers this in some detail:

https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems

> 3 - Also I am wondering what recommendation you have for continuously
> purging deleted documents. optimize? expungeDeletes? Natural Merge?
> Here are more details about the need to purge documents.

The only way to guarantee that all deleted docs are purged is to 
optimize.   You could use the expungeDeletes action ... but this might 
not get rid of all the deleted documents, and depending on how those 
documents are distributed across the whole index, expungeDeletes might 
not do anything at all.  These operations are expensive (require a lot 
of time and system resources) and will temporarily increase the size of 
your index, up to double the starting size.

Before you go down the road of optimizing regularly, you should 
determine whether freeing up the disk space for deleted documents 
actually makes a substantial difference in performance.  In very old 
Solr versions, optimizing the index did produce major performance 
gains... but current versions have much better performance on indexes 
that have deleted documents.  Because performance is typically 
drastically reduced while the optimize is happening, the tradeoff may 
not be worthwhile.

Thanks,
Shawn

Re: TieredMergePolicyFactory question

Posted by Moulay Hicham <ma...@gmail.com>.
Thanks Shawn and Erick.

So far I haven't noticed any performance issues before and after the change.

My concern all along is COST. We could have left the configuration as is -
keeping the deleting documents in the index - But we have to scale up our
Solr cluster.  This will double our Solr Cluster Cost. And the additional
COST is what we are trying to avoid.

I will test the expungeDeletes and revert the max segment size back to 5G.

Thanks again,

Moulay

On Mon, Oct 26, 2020 at 5:49 AM Erick Erickson <er...@gmail.com>
wrote:

> "Some large segments were merged into 12GB segments and
> deleted documents were physically removed.”
> and
> “So with the current natural merge strategy, I need to update
> solrconfig.xml
> and increase the maxMergedSegmentMB often"
>
> I strongly recommend you do not continue down this path. You’re making a
> mountain out of a mole-hill. You have offered no proof that removing the
> deleted documents is noticeably improving performance. If you replace
> docs randomly, deleted docs will be removed eventually with the default
> merge policy without you doing _anything_ special at all.
>
> The fact that you think you need to continuously bump up the size of
> your segments indicates your understanding is incomplete. When
> you start changing settings basically at random in order to “fix” a
> problem,
> especially one that you haven’t demonstrated _is_ a problem, you
> invariably make the problem worse.
>
> By making segments larger, you’ve increased the work Solr (well Lucene) has
> to do in order to merge them since the merge process has to handle these
> larger segments. That’ll take longer. There are a fixed number of threads
> that do merging. If they’re all tied up, incoming updates will block until
> a thread frees up. I predict that if you continue down this path,
> eventually
> your updates will start to misbehave and you’ll spend a week trying to
> figure
> out why.
>
> If you insist on worrying about deleted documents, just expungeDeletes
> occasionally. I’d also set the segments size back to the default 5G. I
> can’t
> emphasize strongly enough that the way you’re approaching this will lead
> to problems, not to mention maintenance that is harder than it needs to
> be. If you do set the max segment size back to 5G, your 12G segments will
> _not_ merge until they have lots of deletes, making your problem worse.
> Then you’ll spend time trying to figure out why.
>
> Recovering from what you’ve done already has problems. Those large segments
> _will_ get rewritten (we call it “singleton merge”) when they’ve
> accumulated a
> lot of deletes, but meanwhile you’ll think that your problem is getting
> worse and worse.
>
> When those large segments have more than 10% deleted documents,
> expungeDeletes
> will singleton merge them and they’ll gradually shrink.
>
> So my prescription is:
>
> 1> set the max segment size back to 5G
>
> 2> monitor your segments. When you see your large segments  > 5G have
> more than 10% deleted documents, issue an expungeDeletes command (not
> optimize).
> This will recover your index from the changes you’ve already made.
>
> 3> eventually, all your segments will be under 5G. When that happens, stop
> issuing expungeDeletes.
>
> 4> gather some performance statistics and prove one way or another that as
> deleted
> docs accumulate over time, it impacts performance. NOTE: after your last
> expungeDeletes, deleted docs will accumulate over time until they reach a
> plateau and
> shouldn’t continue increasing after that. If you can _prove_ that
> accumulating deleted
> documents affects performance, institute a regular expungeDeletes.
> Optimize, but
> expungeDeletes is less expensive and on a changing index expungeDeletes is
> sufficient. Optimize is only really useful for a static index, so I’d
> avoid it in your
> situation.
>
> Best,
> Erick
>
> > On Oct 26, 2020, at 1:22 AM, Moulay Hicham <ma...@gmail.com>
> wrote:
> >
> > Some large segments were merged into 12GB segments and
> > deleted documents were physically removed.
>
>

Re: TieredMergePolicyFactory question

Posted by Erick Erickson <er...@gmail.com>.
"Some large segments were merged into 12GB segments and
deleted documents were physically removed.”
and
“So with the current natural merge strategy, I need to update solrconfig.xml
and increase the maxMergedSegmentMB often"

I strongly recommend you do not continue down this path. You’re making a
mountain out of a mole-hill. You have offered no proof that removing the
deleted documents is noticeably improving performance. If you replace
docs randomly, deleted docs will be removed eventually with the default
merge policy without you doing _anything_ special at all.

The fact that you think you need to continuously bump up the size of
your segments indicates your understanding is incomplete. When
you start changing settings basically at random in order to “fix” a problem,
especially one that you haven’t demonstrated _is_ a problem, you 
invariably make the problem worse.

By making segments larger, you’ve increased the work Solr (well Lucene) has
to do in order to merge them since the merge process has to handle these
larger segments. That’ll take longer. There are a fixed number of threads
that do merging. If they’re all tied up, incoming updates will block until
a thread frees up. I predict that if you continue down this path, eventually
your updates will start to misbehave and you’ll spend a week trying to figure
out why.

If you insist on worrying about deleted documents, just expungeDeletes
occasionally. I’d also set the segments size back to the default 5G. I can’t
emphasize strongly enough that the way you’re approaching this will lead
to problems, not to mention maintenance that is harder than it needs to
be. If you do set the max segment size back to 5G, your 12G segments will
_not_ merge until they have lots of deletes, making your problem worse. 
Then you’ll spend time trying to figure out why.

Recovering from what you’ve done already has problems. Those large segments
_will_ get rewritten (we call it “singleton merge”) when they’ve accumulated a
lot of deletes, but meanwhile you’ll think that your problem is getting worse and worse.

When those large segments have more than 10% deleted documents, expungeDeletes
will singleton merge them and they’ll gradually shrink.

So my prescription is:

1> set the max segment size back to 5G

2> monitor your segments. When you see your large segments  > 5G have 
more than 10% deleted documents, issue an expungeDeletes command (not optimize).
This will recover your index from the changes you’ve already made.

3> eventually, all your segments will be under 5G. When that happens, stop
issuing expungeDeletes.

4> gather some performance statistics and prove one way or another that as deleted
docs accumulate over time, it impacts performance. NOTE: after your last
expungeDeletes, deleted docs will accumulate over time until they reach a plateau and
shouldn’t continue increasing after that. If you can _prove_ that accumulating deleted
documents affects performance, institute a regular expungeDeletes. Optimize, but
expungeDeletes is less expensive and on a changing index expungeDeletes is
sufficient. Optimize is only really useful for a static index, so I’d avoid it in your
situation.

Best,
Erick

> On Oct 26, 2020, at 1:22 AM, Moulay Hicham <ma...@gmail.com> wrote:
> 
> Some large segments were merged into 12GB segments and
> deleted documents were physically removed.


Re: TieredMergePolicyFactory question

Posted by Moulay Hicham <ma...@gmail.com>.
Thanks so much for clarifying. I have deployed the change to prod and seems
to be working. Some large segments were merged into 12GB segments and
deleted documents were physically removed.

I am wondering about 3 other things:

1 - You mentioned that I need free disk space. Just to make sure that we
are talking about disc space here. RAM can still remain at the same size?
My current RAM size is  Index size < RAM < 1.5 Index size

2 - When the merge is happening, it happens in disc and when it's
completed, then the data is sync'ed with RAM. I am just guessing here ;-).
I couldn't find a good explanation online about this.

3 - Also I am wondering what recommendation you have for continuously
purging deleted documents. optimize? expungeDeletes? Natural Merge?
Here are more details about the need to purge documents.
My solr cluster is very expensive. So we would like to maintain the cost
and avoid scaling up if possible.
The solr index is being written at a rate > 100 TPS
Also we have a requirement to delete old data. So we are
continuously trimming millions of documents daily that are older than X
years.
So with the current natural merge strategy, I need to update solrconfig.xml
and increase the maxMergedSegmentMB often. So that I can reclaim physical
disc space.

Wondering if a feature of rewriting one single large merged segment into
another segment - and purging deleted documents in this process - can be
useful for use cases like mine. This will help purge deleted documents
without the need of continuously increasing the maxMergedSegmentMB.

Thanks,
Moulay











On Fri, Oct 23, 2020 at 11:10 AM Erick Erickson <er...@gmail.com>
wrote:

> Well, you mentioned that the segments you’re concerned were merged a year
> ago.
> If segments aren’t being merged, they’re pretty static.
>
> There’s no real harm in optimizing _occasionally_, even in an NRT index.
> If you have
> segments that were merged that long ago, you may be indexing continually
> but it
> sounds like it’s a situation where you update more recent docs rather than
> random
> ones over the entire corpus.
>
> That caution is more for indexes where you essentially replace docs in your
> corpus randomly, and it’s really about wasting a lot of cycles rather than
> bad stuff happening. When you randomly update documents (or delete them),
> the extra work isn’t worth it.
>
> Either operation will involve a lot of CPU cycles and can require that you
> have
> at least as much free space on your disk as the indexes occupy, so do be
> aware
> of that.
>
> All that said, what evidence do you have that this is worth any effort at
> all?
> Depending on the environment, you may not even be able to measure
> performance changes so this all may be irrelevant anyway.
>
> But to your question. Yes, you can cause regular merging to more
> aggressively
> merge segments with deleted docs by setting the
> deletesPctAllowed
> in solroconfig.xml. The default value is 33, and you can set it as low as
> 20 or as
> high as 50. We put
> a floor of 20% because the cost starts to rise quickly if it’s lower than
> that, and
> expungeDeletes is a better alternative at that point.
>
> This is not a hard number, and in practice the percentage of you index
> that consists
> of deleted documents tends to be lower than this number, depending of
> course
> on your particular environment.
>
> Best,
> Erick
>
> > On Oct 23, 2020, at 12:59 PM, Moulay Hicham <ma...@gmail.com>
> wrote:
> >
> > Thanks Eric.
> >
> > My index is near real time and frequently updated.
> > I checked this page
> >
> https://lucene.apache.org/solr/guide/8_1/uploading-data-with-index-handlers.html#xml-update-commands
> > and using forceMerge/expungeDeletes are NOT recommended.
> >
> > So I was hoping that the change in mergePolicyFactory will affect the
> > segments with high percent of deletes as part of the REGULAR segment
> > merging cycles. Is my understanding correct?
> >
> >
> >
> >
> > On Fri, Oct 23, 2020 at 9:47 AM Erick Erickson <er...@gmail.com>
> > wrote:
> >
> >> Just go ahead and optimize/forceMerge, but do _not_ optimize to one
> >> segment. Or you can expungeDeletes, that will rewrite all segments with
> >> more than 10% deleted docs. As of Solr 7.5, these operations respect
> the 5G
> >> limit.
> >>
> >> See:
> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
> >>
> >> Best
> >> Erick
> >>
> >> On Fri, Oct 23, 2020, 12:36 Moulay Hicham <ma...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am using solr 8.1 in production. We have about 30%-50% of deleted
> >>> documents in some old segments that were merged a year ago.
> >>>
> >>> These segments size is about 5GB.
> >>>
> >>> I was wondering why these segments have a high % of deleted docs and
> >> found
> >>> out that they are NOT being candidates for merging because the
> >>> default TieredMergePolicy maxMergedSegmentMB is 5G.
> >>>
> >>> So I have modified the TieredMergePolicyFactory config as below to
> >>> lower the delete docs %
> >>>
> >>> <mergePolicyFactory
> >> class="org.apache.solr.index.TieredMergePolicyFactory">
> >>>  <int name="maxMergeAtOnce">10</int>
> >>>  <int name="segmentsPerTier">10</int>
> >>>  <double name="maxMergedSegmentMB">12000</double>
> >>>  <double name="deletesPctAllowed">20</double>
> >>> </mergePolicyFactory>
> >>>
> >>>
> >>> Do you see any issues with increasing the max merged segment to 12GB
> and
> >>> lowered the deletedPctAllowed to 20%?
> >>>
> >>> Thanks,
> >>>
> >>> Moulay
> >>>
> >>
>
>

Re: TieredMergePolicyFactory question

Posted by Erick Erickson <er...@gmail.com>.
Well, you mentioned that the segments you’re concerned were merged a year ago.
If segments aren’t being merged, they’re pretty static.

There’s no real harm in optimizing _occasionally_, even in an NRT index. If you have
segments that were merged that long ago, you may be indexing continually but it
sounds like it’s a situation where you update more recent docs rather than random
ones over the entire corpus.

That caution is more for indexes where you essentially replace docs in your
corpus randomly, and it’s really about wasting a lot of cycles rather than
bad stuff happening. When you randomly update documents (or delete them),
the extra work isn’t worth it.

Either operation will involve a lot of CPU cycles and can require that you have
at least as much free space on your disk as the indexes occupy, so do be aware
of that.

All that said, what evidence do you have that this is worth any effort at all?
Depending on the environment, you may not even be able to measure
performance changes so this all may be irrelevant anyway.

But to your question. Yes, you can cause regular merging to more aggressively 
merge segments with deleted docs by setting the
deletesPctAllowed
in solroconfig.xml. The default value is 33, and you can set it as low as 20 or as
high as 50. We put
a floor of 20% because the cost starts to rise quickly if it’s lower than that, and
expungeDeletes is a better alternative at that point.

This is not a hard number, and in practice the percentage of you index that consists
of deleted documents tends to be lower than this number, depending of course
on your particular environment.

Best,
Erick

> On Oct 23, 2020, at 12:59 PM, Moulay Hicham <ma...@gmail.com> wrote:
> 
> Thanks Eric.
> 
> My index is near real time and frequently updated.
> I checked this page
> https://lucene.apache.org/solr/guide/8_1/uploading-data-with-index-handlers.html#xml-update-commands
> and using forceMerge/expungeDeletes are NOT recommended.
> 
> So I was hoping that the change in mergePolicyFactory will affect the
> segments with high percent of deletes as part of the REGULAR segment
> merging cycles. Is my understanding correct?
> 
> 
> 
> 
> On Fri, Oct 23, 2020 at 9:47 AM Erick Erickson <er...@gmail.com>
> wrote:
> 
>> Just go ahead and optimize/forceMerge, but do _not_ optimize to one
>> segment. Or you can expungeDeletes, that will rewrite all segments with
>> more than 10% deleted docs. As of Solr 7.5, these operations respect the 5G
>> limit.
>> 
>> See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
>> 
>> Best
>> Erick
>> 
>> On Fri, Oct 23, 2020, 12:36 Moulay Hicham <ma...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> I am using solr 8.1 in production. We have about 30%-50% of deleted
>>> documents in some old segments that were merged a year ago.
>>> 
>>> These segments size is about 5GB.
>>> 
>>> I was wondering why these segments have a high % of deleted docs and
>> found
>>> out that they are NOT being candidates for merging because the
>>> default TieredMergePolicy maxMergedSegmentMB is 5G.
>>> 
>>> So I have modified the TieredMergePolicyFactory config as below to
>>> lower the delete docs %
>>> 
>>> <mergePolicyFactory
>> class="org.apache.solr.index.TieredMergePolicyFactory">
>>>  <int name="maxMergeAtOnce">10</int>
>>>  <int name="segmentsPerTier">10</int>
>>>  <double name="maxMergedSegmentMB">12000</double>
>>>  <double name="deletesPctAllowed">20</double>
>>> </mergePolicyFactory>
>>> 
>>> 
>>> Do you see any issues with increasing the max merged segment to 12GB and
>>> lowered the deletedPctAllowed to 20%?
>>> 
>>> Thanks,
>>> 
>>> Moulay
>>> 
>> 


Re: TieredMergePolicyFactory question

Posted by Moulay Hicham <ma...@gmail.com>.
Thanks Eric.

My index is near real time and frequently updated.
I checked this page
https://lucene.apache.org/solr/guide/8_1/uploading-data-with-index-handlers.html#xml-update-commands
and using forceMerge/expungeDeletes are NOT recommended.

So I was hoping that the change in mergePolicyFactory will affect the
segments with high percent of deletes as part of the REGULAR segment
merging cycles. Is my understanding correct?




On Fri, Oct 23, 2020 at 9:47 AM Erick Erickson <er...@gmail.com>
wrote:

> Just go ahead and optimize/forceMerge, but do _not_ optimize to one
> segment. Or you can expungeDeletes, that will rewrite all segments with
> more than 10% deleted docs. As of Solr 7.5, these operations respect the 5G
> limit.
>
> See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
>
> Best
> Erick
>
> On Fri, Oct 23, 2020, 12:36 Moulay Hicham <ma...@gmail.com> wrote:
>
> > Hi,
> >
> > I am using solr 8.1 in production. We have about 30%-50% of deleted
> > documents in some old segments that were merged a year ago.
> >
> > These segments size is about 5GB.
> >
> > I was wondering why these segments have a high % of deleted docs and
> found
> > out that they are NOT being candidates for merging because the
> > default TieredMergePolicy maxMergedSegmentMB is 5G.
> >
> > So I have modified the TieredMergePolicyFactory config as below to
> > lower the delete docs %
> >
> > <mergePolicyFactory
> class="org.apache.solr.index.TieredMergePolicyFactory">
> >   <int name="maxMergeAtOnce">10</int>
> >   <int name="segmentsPerTier">10</int>
> >   <double name="maxMergedSegmentMB">12000</double>
> >   <double name="deletesPctAllowed">20</double>
> > </mergePolicyFactory>
> >
> >
> > Do you see any issues with increasing the max merged segment to 12GB and
> > lowered the deletedPctAllowed to 20%?
> >
> > Thanks,
> >
> > Moulay
> >
>

Re: TieredMergePolicyFactory question

Posted by Erick Erickson <er...@gmail.com>.
Just go ahead and optimize/forceMerge, but do _not_ optimize to one
segment. Or you can expungeDeletes, that will rewrite all segments with
more than 10% deleted docs. As of Solr 7.5, these operations respect the 5G
limit.

See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

Best
Erick

On Fri, Oct 23, 2020, 12:36 Moulay Hicham <ma...@gmail.com> wrote:

> Hi,
>
> I am using solr 8.1 in production. We have about 30%-50% of deleted
> documents in some old segments that were merged a year ago.
>
> These segments size is about 5GB.
>
> I was wondering why these segments have a high % of deleted docs and found
> out that they are NOT being candidates for merging because the
> default TieredMergePolicy maxMergedSegmentMB is 5G.
>
> So I have modified the TieredMergePolicyFactory config as below to
> lower the delete docs %
>
> <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
>   <int name="maxMergeAtOnce">10</int>
>   <int name="segmentsPerTier">10</int>
>   <double name="maxMergedSegmentMB">12000</double>
>   <double name="deletesPctAllowed">20</double>
> </mergePolicyFactory>
>
>
> Do you see any issues with increasing the max merged segment to 12GB and
> lowered the deletedPctAllowed to 20%?
>
> Thanks,
>
> Moulay
>