You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/09 23:25:30 UTC

[GitHub] [lucene] mdmarshmallow opened a new issue, #11761: Expand TieredMergePolicy deletePctAllowed limits

mdmarshmallow opened a new issue, #11761:
URL: https://github.com/apache/lucene/issues/11761

   ### Description
   
   I'm an engineer at Amazon Search and we have been experimenting with more aggressively getting rid of deleted documents. We use TieredMergePolicy and we would like to set `TieredMergePolicy#deletesPctAllowed` to be [lower than the current limit of 20%](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java#L157). 
   
   I was wondering why this limit was set in place. I'm sure I could be missing some context here. Maybe we could keep the limits in place but allow users to explicitly remove the checks? Any information would be much appreciated, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
jpountz commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1248288919

   Historically this was not configurable and Lucene would allow up to 50% deleted documents. When we introduced an option, we made sure to introduce a lower bound on the value because a value of zero would essentially require Lucene to rewrite every segment that has a deletion after every update operation, which is certainly undesirable. Allowing users to go from 50% to 20% felt like a significant improvement already.
   
   We could discuss lowering the limit if we feel like this could lead to merging patterns that are still acceptable. E.g. I used `BaseMergePolicyTestCase#doTestSimulateUpdates` in the past to get a sense of how this option would influence write amplification.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
mdmarshmallow commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1251965881

   Thanks for taking the time to look into this! I think 5% would be a good start, it would be near the threshold we want to test (we were thinking 2% but looking at your initial write amplification numbers, this may not be a great idea). I'm also planning on making an issue to create a `FilterDirectory` that tracks the write amplification factor. This should let us (Amazon Search) get some write amplification numbers that I can hopefully share here as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
mdmarshmallow commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1269018833

   I updated the PR to change the default as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
mdmarshmallow commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1264018168

   I can include that in the above PR as well if you all agree that it's a good idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz closed issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
jpountz closed issue #11761: Expand TieredMergePolicy deletePctAllowed limits
URL: https://github.com/apache/lucene/issues/11761


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
jpountz commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1277704569

   Closing: https://github.com/apache/lucene/pull/11831 has been merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
jpountz commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1251264246

   I got some numbers for write amplification for the case tested in `TestTieredMergePolicy#testSimulateUpdates`:
   
   | Allowed percentage of deletes | Write amplification |
   | ------------- | ------------- |
   | 50 (max) | 4.34 |
   | 33 (default) | 4.34 |
   | 20 (min) | 4.68 |
   | 10 | 6.13 |
   | 5 | 8.76 |
   | 4 | 10.31 |
   | 3 | 12.97 |
   | 2 | 18.76 |
   | 1 | 37.89 |
   | 0 | 10779.78 |
   
   Assuming these numbers are representative, maybe we could allow users to configure 5% as the allowed percentage of deletes that their indexes may have, which translates to ~2x more write amplification compared to the default of 33% according to the above numbers.
   
   For reference, the algorithm that `TieredMergePolicy` uses to keep the number of deletes under the threshold consists of running the most balanced merge (with a small bias towards merges that reclaim more deletes) until the number of deletes of the index is under the threshold.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
mdmarshmallow commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1252993085

   Here is the issue I created with a PR attached in case you were interested: https://github.com/apache/lucene/issues/11795


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
mdmarshmallow commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1248348122

   Hi, thanks for the response! Your explanation of 0% not being allowed makes complete sense. For some context though, using our own forked version of `TieredMergePolicy`, we have tested with down to 2% allowable deletion and still see that behavior is desirable for us (more specifically much lower index sizes than at 20% deletes).
   
   Maybe if we want to maintain those limits, we could create a `setDeletesPctAllowedUnsafe()` or something like that which has no limits on it (or at least drops the lower bound to being > 0 instead of > 20)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
uschindler commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1261531792

   I was also doing consulting for an huge Elasticsearch user and they also had this problem of keeping deletes as low as possible and the 20% limit was way too high. 20% looks like an arbitrary limitation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dan2097 commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
dan2097 commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1255309927

   I have also ran into this on our patent search system. In our index the problem is exagerrated by the larger documents tending to be more frequently reindexed so the 20% deleted documents can translate to 40% of the overall index size!
   For my use case 5% would be a massive improvement.
   
   I ccan definitely imagine that for a system where indexing is light and infrequent 2% may make sense to ensure optimal perfomance/disk usage, without requiring the explicit use expungeDeletes. Having said that 5% is definitely low enough for my use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
rmuir commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1262741708

   > I got some numbers for write amplification for the case tested in TestTieredMergePolicy#testSimulateUpdates:
   
   @jpountz based on these numbers wouldn't it also make sense to consider changing the default from `33%` to `20%` as well? Just an idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
jpountz commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1262350445

   If someone opens a PR to decrease the limit from 20% to 5% I'll happily approve the change given the results I shared above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

Posted by GitBox <gi...@apache.org>.
mdmarshmallow commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1262618406

   Here is a PR: https://github.com/apache/lucene/pull/11831


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org