You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2020/02/01 15:02:00 UTC
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

    [ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028113#comment-17028113 ] 

David Smiley commented on LUCENE-8962:
--------------------------------------

Indeed, I meant "small bus factor" :)  I'm looking deeper at Michael Froh's PR this weekend, and by necessity the pertinent parts of IndexWriter.

>  Well, using {{SerialMergeScheduler}} allows the test to pass, since the merges kicked off due to new segments after the commit will run, synchronously (using the main thread in your test) to completion.

Yes, it's not "realistic", but my objective in this code snippet was merely to demonstrate that the _combination_ of a merge policy and a merge scheduler have the ability to affect the searchable segments on commit when using the NRT Reader/Searcher.  Apparently it doesn't work if a normal (non-NRT Reader/Searcher) is opened; I can see that.  Maybe this is a shortcoming of IndexWriter; why shouldn't IW be consistent on this matter?

It's not apparent to me that we need a new method on the MergePolicy when the MergeTrigger parameter is able to differentiate the types of merges so that a MP is able to behave differently depending on the circumstance.  Am I unclear on this?

> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>         Attachments: LUCENE-8962_demo.png
>
>          Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory segments to disk and open an {{IndexReader}} to search them, and this is typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} will accumulate write many small segments during {{refresh}} and this then adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if given a little time ... so, could we somehow improve {{IndexWriter'}}s refresh to optionally kick off merge policy to merge segments below some threshold before opening the near-real-time reader?  It'd be a bit tricky because while we are waiting for merges, indexing may continue, and new segments may be flushed, but those new segments shouldn't be included in the point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, and some hackity logic to have the merge policy target small segments just written by refresh, but it's tricky to then open a near-real-time reader, excluding newly flushed but including newly merged segments since the refresh originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org