You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/18 12:42:54 UTC
[GitHub] [lucene] mikemccand commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.

mikemccand commented on code in PR #900:
URL: https://github.com/apache/lucene/pull/900#discussion_r875839832


##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -532,13 +532,21 @@ private MergeSpecification doFindMerges(
         // segments, and already pre-excluded the too-large segments:
         assert candidate.size() > 0;
 
+        SegmentSizeAndDocs maxSegmentSize = segInfosSizes.get(candidate.get(0));

Review Comment:
   The incoming (sorted) infos are sorted by decreasing size, right? So the `candidate.get(0)` is indeed the max.
   
   Maybe rename to `maxCadidateSegmentSize`?



##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -532,13 +532,21 @@ private MergeSpecification doFindMerges(
         // segments, and already pre-excluded the too-large segments:
         assert candidate.size() > 0;
 
+        SegmentSizeAndDocs maxSegmentSize = segInfosSizes.get(candidate.get(0));
+        if (hitTooLarge == false
+            && mergeType == MERGE_TYPE.NATURAL
+            && bytesThisMerge < maxSegmentSize.sizeInBytes * 1.5) {
+          // Ignore any merge where the resulting segment is not at least 50% larger than the

Review Comment:
   Hmm so this new logic applies to all merges, not just the "under floor" ones?  I wonder if there is some risk here that this change will block "pathological" merges that we intentionally do today under heavy deletions count cases?  Maybe we should pro-rate by deletion percent?  Oh!  I think `SegmentSizeAndDocs` already does so (well the `size` method in `MergePolicy`).  Maybe we should add a comment / javadoc in this confusing class heh.



##########
lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java:
##########
@@ -582,23 +589,29 @@ public MergeSpecification findMerges(
         if (anyMerging) {
           // skip
         } else if (!anyTooLarge) {
-          if (spec == null) spec = new MergeSpecification();
-          final List<SegmentCommitInfo> mergeInfos = new ArrayList<>(end - start);
-          for (int i = start; i < end; i++) {
-            mergeInfos.add(levels.get(i).info);
-            assert infos.contains(levels.get(i).info);
-          }
-          if (verbose(mergeContext)) {
-            message(
-                "  add merge="
-                    + segString(mergeContext, mergeInfos)
-                    + " start="
-                    + start
-                    + " end="
-                    + end,
-                mergeContext);
-          }
-          spec.add(new OneMerge(mergeInfos));
+          if (mergeSize >= maxSegmentSize * 1.5) {
+            // Ignore any merge where the resulting segment is not at least 50% larger than the
+            // biggest input segment.
+            // Otherwise we could run into pathological O(N^2) merging where merges keep rewriting
+            // again and again the biggest input segment into a segment that is barely bigger.
+            if (spec == null) spec = new MergeSpecification();

Review Comment:
   Hmm, split this into multiple lines with { and }?  Does spotless/tidy do that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org