You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2022/02/01 07:32:00 UTC
[jira] [Comment Edited] (LUCENE-10396) Automatically create sparse indexes for sort fields

    [ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484830#comment-17484830 ] 

Adrien Grand edited comment on LUCENE-10396 at 2/1/22, 7:31 AM:
----------------------------------------------------------------

One interesting question about this will be the read API. What I have in mind for now looks something like this:

{code:java}
class SparseIndexProducer {

  /**
   * Get an iterator over ranges of doc IDs that may contain values that fall in the given range of values.
   */
  DocIdRangeIterator getNumeric(FieldInfo field, long rangeMin, long rangeMax);

  DocIdRangeIterator getSorted(FieldInfo field, long rangeMinOrd, long rangeMaxOrd);

  DocIdRangeIterator getBinary(FieldInfo field, BytesRef rangeMin, BytesRef rangeMax);
}

class DocIdRange {
  int first, last;
}

class DocIdRangeIterator {

  /**
   * Return the next range of doc IDs so that range.min is greater than or equal to the target.
   */
  DocIdRange advance(int target);

}
{code}


was (Author: jpountz):
One interesting question about this will be the read API. What I have in mind for now looks something like this:

{code:java}
class SparseIndexProducer {
  DocIdRangeIterator getIterator()
}

class DocIdRange {
  int first, last;
}

class DocIdRangeIterator {

  /**
   * Return a range of doc IDs that may contain {@code value} as the value of the sortPos-th sort field.
   * All documents between target included and {@code range.first} excluded are guaranteed to not
   * contain {@code value}. {@code range.first} equal to NO_MORE_DOCS is used to signal that no more
   * documents may contain {@code value}.
   */
  DocIdRange nextRangeNumeric(int target, int sortPos, long value);

  /**
   * Likewise for SORTED doc values.
   */
  DocIdRange nextRangeSorted(int target, int sortPos, long ord);

  /**
   * Likewise for BINARY doc values.
   */
  DocIdRange nextRangeBinary(int target, int sortPos, BytesRef binaryValue);

}
{code}

> Automatically create sparse indexes for sort fields
> ---------------------------------------------------
>
>                 Key: LUCENE-10396
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10396
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to be able to early terminate sorted queries, but as a way to cluster doc IDs that share similar properties so that queries can take advantage of it. For instance imagine you're maintaining a catalog of cars for sale, by sorting by car type, then fuel type then price. Then all cars with the same type, fuel type and similar prices will be stored in a contiguous range of doc IDs. Without index sorting, conjunctions across these 3 fields would be almost a worst-case scenario as every clause might match lots of documents while their intersection might not. With index sorting enabled however, there's only a very small number of calls to advance() that would lead to doc IDs that do not match, because these advance() calls that do not lead to a match would always jump over a large number of doc IDs. I had created that example for ApacheCon last year that demonstrates the benefits of index sorting on conjunctions. In both cases, the index is storing the same data, it just gets different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there is a lot more we can do by taking advantage of the index sort explicitly. For instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range queries on fields that are primary sort fields by performing a binary search to identify the first and last documents that match the range query. I would like to introduce [sparse indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields that are used for index sorting, with the goal of improving the runtime of {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive and making it easier and more efficient to leverage index sorting to filter on subsequent sort fields. A simple form of a sparse index could consist of storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we wouldn't need to make it configurable and could enable it automatically as soon as index sorting is enabled. And it would get its own file format abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org