You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/04/06 18:19:00 UTC

[jira] [Commented] (LUCENE-9827) Small segments are slower to merge due to stored fields since 8.7

    [ https://issues.apache.org/jira/browse/LUCENE-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315763#comment-17315763 ] 

ASF subversion and git services commented on LUCENE-9827:
---------------------------------------------------------

Commit be94a667f2091b2c0fad570f9e726197d466767b in lucene's branch refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=be94a66 ]

LUCENE-9827: avoid wasteful recompression for small segments (#28)

Require that the segment has enough dirty documents to create a clean
chunk before recompressing during merge, there must be at least maxChunkSize.

This prevents wasteful recompression with small flushes (e.g. every
document): we ensure recompression achieves some "permanent" progress.

Expose maxDocsPerChunk as a parameter for Term vectors too, matching the
stored fields format. This allows for easy testing.

Increment numDirtyDocs for partially optimized merges:
If segment N needs recompression, we have to flush any buffered docs
before bulk-copying segment N+1. Don't just increment numDirtyChunks,
also make sure numDirtyDocs is incremented, too.
This doesn't have a performance impact, and is unrelated to tooDirty()
improvements, but it is easier to reason about things with correct
statistics in the index.

Further tuning of how dirtiness is measured: for simplification just use percentage
of dirty chunks.

Co-authored-by: Adrien Grand <jp...@gmail.com>

> Small segments are slower to merge due to stored fields since 8.7
> -----------------------------------------------------------------
>
>                 Key: LUCENE-9827
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9827
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: Indexer.java, log-and-lucene-9827.patch, merge-count-by-num-docs.png, merge-type-by-version.png, total-merge-time-by-num-docs-on-small-segments.png, total-merge-time-by-num-docs.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> [~dm] and [~dimitrisli] looked into an interesting case where indexing slowed down after upgrading to 8.7. After digging we identified that this was due to the merging of stored fields, which had become slower on average.
> This is due to changes to stored fields, which now have top-level blocks that are then split into sub-blocks and compressed using shared dictionaries (one dictionary per top-level block). As the top-level blocks are larger than they were before, segments are more likely to be considered "dirty" by the merging logic. Dirty segments are segments were 1% of the data or more consists of incomplete blocks. For large segments, the size of blocks doesn't really affect the dirtiness of segments: if you flush a segment that has 100 blocks or more, it will never be considered dirty as only the last block may be incomplete. But for small segments it does: for instance if your segment is only 10 blocks, it is very likely considered dirty given that the last block is always incomplete. And the fact that we increased the top-level block size means that segments that used to be considered clean might now be considered dirty.
> And indeed benchmarks reported that while large stored fields merges became slightly faster after upgrading to 8.7, the smaller merges actually became slower. See attached chart, which gives the total merge time as a function of the number of documents in the segment.
> I don't know how we can address this, this is a natural consequence of the larger block size, which is needed to achieve better compression ratios. But I wanted to open an issue about it in case someone has a bright idea how we could make things better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org