You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Suhan Mao (Jira)" <ji...@apache.org> on 2021/08/05 10:19:00 UTC

[jira] [Commented] (LUCENE-10025) SoftDeletesRetentionMergePolicy#numDeletesToMerge caused indexing backlogged

    [ https://issues.apache.org/jira/browse/LUCENE-10025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393782#comment-17393782 ] 

Suhan Mao commented on LUCENE-10025:
------------------------------------

[~dnhatn] I think [~zhangchao.es]'s question probably refer to this code:

 
{code:java}
// @Override
public int numDeletesToMerge(SegmentCommitInfo info, int delCount, IOSupplier<CodecReader> readerSupplier) throws IOException {
  final int numDeletesToMerge = super.numDeletesToMerge(info, delCount, readerSupplier);
  if (numDeletesToMerge != 0 && info.getSoftDelCount() > 0) {
    final CodecReader reader = readerSupplier.get();
    if (reader.getLiveDocs() != null) {
      BooleanQuery.Builder builder = new BooleanQuery.Builder();
      builder.add(new DocValuesFieldExistsQuery(field), BooleanClause.Occur.FILTER);
      builder.add(retentionQuerySupplier.get(), BooleanClause.Occur.FILTER);
      Scorer scorer = getScorer(builder.build(), FilterCodecReader.wrapLiveDocs(reader, null, reader.maxDoc()));
      if (scorer != null) {
        DocIdSetIterator iterator = scorer.iterator();
        Bits liveDocs = reader.getLiveDocs();
        int numDeletedDocs = reader.numDeletedDocs();
        while (iterator.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
          if (liveDocs.get(iterator.docID()) == false) {
            numDeletedDocs--;
          }
        }
        return numDeletedDocs;
      }
    }
  }
{code}
 

Why we have to iterate the scorer and check if the doc id is not in liveDocs?

Since each doc id from scorer must contain a soft delete field, they should must not in live docs, why we should we do that check of *_liveDocs.get(iterator.docID()) == false_* ?

 

 

> SoftDeletesRetentionMergePolicy#numDeletesToMerge caused indexing backlogged
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-10025
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10025
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 8.4
>            Reporter: zhangchao.es
>            Priority: Major
>              Labels: indexing, soft-delete
>         Attachments: flamegraph.html, image-2021-07-14-16-52-34-740.png
>
>
> In lucene-8246, numDeletesToMerge is added in SoftDeletesRetentionMergePolicy.
> if soft deleted docs is very more, and they are also in retention lease,the numDeletesToMerge funcation have  performance issue
> for instance,a update indexing is writing to elasticsearch, then we move one a shard to an other node,If the moving continues for a long time, the size of old shard will become very big,because soft-deleted operations need to held by retention lease. The more soft-deleted documents, the slower the indexing. if the shard size is about 20GB, we can get the below flamegraph
>  
> !image-2021-07-14-16-52-34-740.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org