You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2017/12/28 20:30:01 UTC

[jira] [Commented] (LUCENE-7976) Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents

    [ https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305721#comment-16305721 ] 

Erick Erickson commented on LUCENE-7976:
----------------------------------------

OK, let's see if I can summarize where we are on this:

1> make TMP respect maxMergeSegmentSize, even during forcemerge unless maxSegments is specified (see <3>).

2> Add some documentation about how reclaimDeletesWeight can be used to tune the % deleted documents that will be in the index along with some guidance. Exactly how this should be set is rather opaque. It defaults to 2.0. The comment in the code is: "but be careful not to go so high that way too much merging takes place; a value of 3.0 is probably nearly too high". We need to keep people from setting it to 1000. Should we establish an upper bound with perhaps a warning if it's exceeded? 

3> If people want the old behavior they have two choices:
3a> set maxMergedSegmentMB very high. This has the consequence of kicking in when normal merging happens. I think this is sub-optimal for the pattern where once a day I index docs and then want to optimize at the end though.
3b> specify maxSegments = 1 during forceMerge. This will override any maxMergedSegmentMB settings.

<4b> is my attempt to reconcile the issue of _wanting_ one huge segment but only when doing forceMerge. Yes, they can back themselves into a the same corner they get into now by doing this, but this is acceptable IMO. We're not trying to make it _impossible_ to get into a bad state, just trying to make it so users don't do it by accident.

Is this at least good enough for going on with until we see how it behaves?

Meanwhile, I'll check in SOLR-7733

> Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents
> ------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7976
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7976
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>         Attachments: LUCENE-7976.patch
>
>
> We're seeing situations "in the wild" where there are very large indexes (on disk) handled quite easily in a single Lucene index. This is particularly true as features like docValues move data into MMapDirectory space. The current TMP algorithm allows on the order of 50% deleted documents as per a dev list conversation with Mike McCandless (and his blog here:  https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many TB) solutions like "you need to distribute your collection over more shards" become very costly. Additionally, the tempting "optimize" button exacerbates the issue since once you form, say, a 100G segment (by optimizing/forceMerging) it is not eligible for merging until 97.5G of the docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with smaller segments to bring the resulting segment up to 5G. If no smaller segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). It would be rewritten into a single segment removing all deleted docs no matter how big it is to start. The 100G example above would be rewritten to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the default would be the same behavior we see now. As it stands now, though, there's no way to recover from an optimize/forceMerge except to re-index from scratch. We routinely see 200G-300G Lucene indexes at this point "in the wild" with 10s of  shards replicated 3 or more times. And that doesn't even include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org