You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2021/01/25 18:38:01 UTC
[jira] [Updated] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

     [ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley updated LUCENE-8962:
---------------------------------
    Fix Version/s: 8.7
      Description: 
Two improvements were added: 8.6 has merge-on-commit (by Froh et. all), 8.7 has merge-on-refresh (by Simon).  See \{{MergePolicy.findFullFlushMerges}}

The original description follows:
----
With near-real-time search we ask {{IndexWriter}} to write all in-memory segments to disk and open an {{IndexReader}} to search them, and this is typically a quick operation.

However, when you use many threads for concurrent indexing, {{IndexWriter}} will accumulate write many small segments during {{refresh}} and this then adds search-time cost as searching must visit all of these tiny segments.

The merge policy would normally quickly coalesce these small segments if given a little time ... so, could we somehow improve \{{IndexWriter'}}s refresh to optionally kick off merge policy to merge segments below some threshold before opening the near-real-time reader?  It'd be a bit tricky because while we are waiting for merges, indexing may continue, and new segments may be flushed, but those new segments shouldn't be included in the point-in-time segments returned by refresh ...

One could almost do this on top of Lucene today, with a custom merge policy, and some hackity logic to have the merge policy target small segments just written by refresh, but it's tricky to then open a near-real-time reader, excluding newly flushed but including newly merged segments since the refresh originally finished ...

I'm not yet sure how best to solve this, so I wanted to open an issue for discussion!

  was:
With near-real-time search we ask {{IndexWriter}} to write all in-memory segments to disk and open an {{IndexReader}} to search them, and this is typically a quick operation.

However, when you use many threads for concurrent indexing, {{IndexWriter}} will accumulate write many small segments during {{refresh}} and this then adds search-time cost as searching must visit all of these tiny segments.

The merge policy would normally quickly coalesce these small segments if given a little time ... so, could we somehow improve {{IndexWriter'}}s refresh to optionally kick off merge policy to merge segments below some threshold before opening the near-real-time reader?  It'd be a bit tricky because while we are waiting for merges, indexing may continue, and new segments may be flushed, but those new segments shouldn't be included in the point-in-time segments returned by refresh ...

One could almost do this on top of Lucene today, with a custom merge policy, and some hackity logic to have the merge policy target small segments just written by refresh, but it's tricky to then open a near-real-time reader, excluding newly flushed but including newly merged segments since the refresh originally finished ...

I'm not yet sure how best to solve this, so I wanted to open an issue for discussion!


> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>             Fix For: master (9.0), 8.6, 8.7
>
>         Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt, test.diff
>
>          Time Spent: 31h
>  Remaining Estimate: 0h
>
> Two improvements were added: 8.6 has merge-on-commit (by Froh et. all), 8.7 has merge-on-refresh (by Simon).  See \{{MergePolicy.findFullFlushMerges}}
> The original description follows:
> ----
> With near-real-time search we ask {{IndexWriter}} to write all in-memory segments to disk and open an {{IndexReader}} to search them, and this is typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} will accumulate write many small segments during {{refresh}} and this then adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if given a little time ... so, could we somehow improve \{{IndexWriter'}}s refresh to optionally kick off merge policy to merge segments below some threshold before opening the near-real-time reader?  It'd be a bit tricky because while we are waiting for merges, indexing may continue, and new segments may be flushed, but those new segments shouldn't be included in the point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, and some hackity logic to have the merge policy target small segments just written by refresh, but it's tricky to then open a near-real-time reader, excluding newly flushed but including newly merged segments since the refresh originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org