You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/04/23 02:47:54 UTC
[GitHub] [lucene] MichelLiu commented on pull request #92: Expunge big segment with oversize deletePct caused by continuously updating a batch of data

MichelLiu commented on pull request #92:
URL: https://github.com/apache/lucene/pull/92#issuecomment-825347763


   I had a problem with the tiered merge policy. As I continuously updated a batch of data over time and time, then I got a lot of segments with 4.9G which segDelPct already greater than deletePctAllowed and cannot be merged by tiered merge policy.
   Then I found the code here and figured out the reason:
   `
   if (segSizeDocs.sizeInBytes > maxMergedSegmentBytes / 2 && (totalDelPct <= deletesPctAllowed || segDelPct <= deletesPctAllowed)) {
           iter.remove();
           tooBigCount++; // Just for reporting purposes.
           totIndexBytes -= segSizeDocs.sizeInBytes;
           allowedDelCount -= segSizeDocs.delCount;
         }
   `
   
   Here was the segments I had met before:
   
   1613741580098 0     p      10.10.112.123 _2h             89    1224440       569330    4.9gb     4905832 true      true       8.4.0   false
   1613741580098 0     p      10.10.112.123 _4v            175    2383463       425919    4.9gb     5636245 true      true       8.4.0   false
   1613741580098 0     p      10.10.112.123 _6n            239    2891298       380212    4.9gb     5617940 true      true       8.4.0   false
   1613741580098 0     p      10.10.112.123 _1lwc        75036     468350       364104    4.3gb     3718611 true      true       8.4.0   false
   1613741580098 0     p      10.10.112.123 _1xh2        90038     678187       252779    3.6gb     3453739 true      true       8.4.0   false
   1613741580098 0     p      10.10.112.123 _25u8       100880     482795       237275    4.1gb     3370799 true      true       8.4.0   false
   1613741580098 0     p      10.10.112.123 _2fld       113521     721503       225160    4.1gb     3776954 true      true       8.4.0   false
   1613741580098 0     p      10.10.112.123 _2m9h       122165     831574       127572    4.2gb     3812013 true      true       8.4.0   false
   1613741580098 0     p      10.10.112.123 _2n01       123121      34000        27437  345.3mb      543426 true      true       8.4.0   true
   1613741580098 0     p      10.10.112.123 _2nq6       124062      36985        19838  319.2mb      515882 true      true       8.4.0   true
   1613741580098 0     p      10.10.112.123 _2o7d       124681      52725        40581  556.3mb      632128 true      true       8.4.0   true
   1613741580098 0     p      10.10.112.123 _2ouj       125515      11158         6330    114mb      235396 true      true       8.4.0   true
   
   
   And I had an index with 564G, after bulk updating for a month, then grows up to 1400G. That caused significant waste of disk, and also highed up the search delay to 450ms. So we have to  reindex the index per month now.
   
   My solution is to merge the large segments as low-frequency as possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org