You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2021/07/19 14:38:00 UTC
[jira] [Created] (LUCENE-10029) Can we make refreshes cheaper via two-phase refresh?

Adrien Grand created LUCENE-10029:
-------------------------------------

             Summary: Can we make refreshes cheaper via two-phase refresh?
                 Key: LUCENE-10029
                 URL: https://issues.apache.org/jira/browse/LUCENE-10029
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


Currently our recommendation is to use something like {{SearcherManager}} to periodically refresh the current DirectoryReader, asynchronously from searches.

Under the hood, refreshes call {{DirectoryReader#reopen}}, which flushes all current {{DocumentsWriterPerThread}} instances so that pending changes become visible. For instance a user who would like the view of the index to be 10s old at most could refresh every 10 seconds.

But refreshes incur an indexing penalty because they may cause arbitrarily small segments to be written, which in-turn means that more merging will need to happen later to turn these small segments into larger ones. For data structures that may need lots of computation for merging, such as n-dimensional points which need to recompute the entire BKD tree or stored fields that might need to re-compress blocks of documents, this may be non negligible.

I wonder if we could make this a bit better by making refreshes a two-phase operation. The first operation would get the list of all current DWPTs, and the second one would consist of flushing them if they haven't been flushed already.

For instance if we take again the example of a user who wants the current point-in-time view of the data to be at most 10s old, SearcherManager could be configured so that every 5 seconds it would flush all DWPTs that already existed 5 seconds earlier. This would give the same guarantee that the current point-in-time view of the data is 10s old at most, while also ensuring that we never flush a DWPT that has been created less than 5 seconds ago.

At this point, this is only theoretical, I haven't done the work of checking whether this is something that would actually help in practice. This would likely only help when indexing either fast enough or with a small enough indexing buffer so that DWPTs would naturally get flushed because of memory usage between consecutive refreshes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org