You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/07/30 13:16:00 UTC

[jira] [Commented] (LUCENE-9507) Custom order for leaves in DirectoryReader, IndexWriter and searcher

    [ https://issues.apache.org/jira/browse/LUCENE-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390571#comment-17390571 ] 

ASF subversion and git services commented on LUCENE-9507:
---------------------------------------------------------

Commit 1daf7e7c742cf53cb62a55bc3993a76d878e3223 in lucene's branch refs/heads/main from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1daf7e7 ]

LUCENE-10027 provide leaf sorter from commit (#214)

Provide leaf sorter for directory readers opened from IndexCommit

LUCENE-9507 allowed to provide a leaf sorter for directory readers.
One API that was missed is to allow to provide a leaf sorter
for directory readers opened from an index commit.
This patch address this by adding an extra parameter: a custom
comparator for sorting leaf readers to the Directory reader open API
from indexCommit and minSupportedMajorVersion.

Relates to PR #32

> Custom order for leaves in DirectoryReader, IndexWriter and searcher
> --------------------------------------------------------------------
>
>                 Key: LUCENE-9507
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9507
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Jim Ferenczi
>            Priority: Minor
>             Fix For: main (9.0), 8.9
>
>          Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Now that we're able [to skip documents efficiently when sorting by a numeric field|https://issues.apache.org/jira/browse/LUCENE-9280], I was wondering if we could optimize sorted queries further by also sorting the leaf readers based on the primary sort.
> For time-based indices in Elasticsearch, we've implemented an optimization that does that at query time. If the query is sorted by a numeric docvalue field, prior to search, we sort the leaves according to the query sort. When sorting by timestamp this small optimization can have a big impact since early termination can be reached much faster if the sort values in the segments don't overlap too much. Applying this optimization at query time is challenging , it has the benefit to work on any numeric field sort and order but it requires to use a multi-reader that will reorganize the segments. It can also be deceptive that after a force merge to 1 segment sorted queries may be slower since there is nothing to sort anymore.
> So, another option that I look at is to add the ability to provide a leaf order directly in the IndexWriter and DirectoryReader. That could be similar to an index sort or even complementary to it since sorting segments based on the index sort could also help at query time. For time-based indices that cannot afford index sorting but have lots of sorted queries on timestamp, forcing the order of segments could speed up sorted queries significantly. 
> The advantage of forcing a single leaf sort in the writer/reader is that we can also use it to influence the merges by putting the segments with the highest value first. That would help with the case of indices that are merged to a single segment but would like to keep the sorted queries fast but also for the multi-segments case since big segments would have more chance to have highest values first too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org