You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Robert Muir (Jira)" <ji...@apache.org> on 2022/02/01 11:12:00 UTC
[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

    [ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485178#comment-17485178 ] 

Robert Muir commented on LUCENE-10396:
--------------------------------------

an entire new file format abstraction for this one slow range query? is there any other use-case for this data? I'm concerned about complexity, this seems VERY specialized.

> Automatically create sparse indexes for sort fields
> ---------------------------------------------------
>
>                 Key: LUCENE-10396
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10396
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to be able to early terminate sorted queries, but as a way to cluster doc IDs that share similar properties so that queries can take advantage of it. For instance imagine you're maintaining a catalog of cars for sale, by sorting by car type, then fuel type then price. Then all cars with the same type, fuel type and similar prices will be stored in a contiguous range of doc IDs. Without index sorting, conjunctions across these 3 fields would be almost a worst-case scenario as every clause might match lots of documents while their intersection might not. With index sorting enabled however, there's only a very small number of calls to advance() that would lead to doc IDs that do not match, because these advance() calls that do not lead to a match would always jump over a large number of doc IDs. I had created that example for ApacheCon last year that demonstrates the benefits of index sorting on conjunctions. In both cases, the index is storing the same data, it just gets different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there is a lot more we can do by taking advantage of the index sort explicitly. For instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range queries on fields that are primary sort fields by performing a binary search to identify the first and last documents that match the range query. I would like to introduce [sparse indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields that are used for index sorting, with the goal of improving the runtime of {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive and making it easier and more efficient to leverage index sorting to filter on subsequent sort fields. A simple form of a sparse index could consist of storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we wouldn't need to make it configurable and could enable it automatically as soon as index sorting is enabled. And it would get its own file format abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org