You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Ignacio Vera (Jira)" <ji...@apache.org> on 2022/06/23 15:02:00 UTC

[jira] [Comment Edited] (LUCENE-10396) Automatically create sparse indexes for sort fields

    [ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558103#comment-17558103 ] 

Ignacio Vera edited comment on LUCENE-10396 at 6/23/22 3:01 PM:
----------------------------------------------------------------

I have been thinking on the ability if visiting a single documents per field value as explained for Adrien above and I think we can implement it in a not intrusive way for SortedDocValues that have low cardinality. The idea is to add the following method to the SortedDocValues api with a default implementation:

{code}
  /**
   * Advances the iterator the the next document whose ordinal is distinct to the current ordinal
   * and returns the document number itself. It returns {@link
   * #NO_MORE_DOCS} if there is no more ordinals distinct to the current one left.
   *
   * <p>The behaviour of this method is <b>undefined</b> when called after the iterator has exhausted and may result in unpredicted behaviour.
   *  
   * The default implementation just iterates over the documents and manually checks if the ordinal has changed but
   *  but some implementations are  considerably more efficient than that.
   *
   */
  public int advanceOrd() throws IOException {
    int doc = docID();
    if (doc == DocIdSetIterator.NO_MORE_DOCS) {
      return doc;
    }
    final long ord = ordValue();
    do  {
      doc = nextDoc();
    } while (doc != DocIdSetIterator.NO_MORE_DOCS && ordValue() == ord);
    assert doc == docID();
    return doc;
  }
{code}

When consuming the doc values, if the field is the primary sort of he index and the cardinality is low (average of documents per field >64?), then we will use a DirectMonotonicWriter to write the offset for each ord which should not use too much disk space. When producing the doc value, we will override this method with a faster DirectMonotonicReader implementation.





was (Author: ivera):
I have been thinking on the ability if visiting a single documents per field value as explained for Adrien above and I think we can implement it in a not intrusive way for SortedDocValues that have low cardinality. The idea is to add the following method to the SortedDocValues api with a default implementation:

{code}
  /**
   * Advances the iterator the the next document whose ordinal is distinct to the current ordinal
   * and returns the document number itself. Exhausts the iterator and returns {@link
   * #NO_MORE_DOCS} if there is no more ordinals distinct to the current one.
   *
   * <p>The behaviour of this method is <b>undefined</b> when called with <code> target &le; current
   * </code>, or after the iterator has exhausted. Both cases may result in unpredicted behaviour.
   *  
   * The default implementation just iterates over the documents and manually checks if the ordinal has changed but
   *  but some implementations are  considerably more efficient than that.
   *
   */
  public int advanceOrd() throws IOException {
    int doc = docID();
    if (doc == DocIdSetIterator.NO_MORE_DOCS) {
      return doc;
    }
    final long ord = ordValue();
    do  {
      doc = nextDoc();
    } while (doc != DocIdSetIterator.NO_MORE_DOCS && ordValue() == ord);
    assert doc == docID();
    return doc;
  }
{code}

When consuming the doc values, if the field is the primary sort of he index and the cardinality is low (average of documents per field >64?), then we will use a DirectMonotonicWriter to write the offset for each ord which should not use too much disk space. When producing the doc value, we will override this method with a faster DirectMonotonicReader implementation.




> Automatically create sparse indexes for sort fields
> ---------------------------------------------------
>
>                 Key: LUCENE-10396
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10396
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to be able to early terminate sorted queries, but as a way to cluster doc IDs that share similar properties so that queries can take advantage of it. For instance imagine you're maintaining a catalog of cars for sale, by sorting by car type, then fuel type then price. Then all cars with the same type, fuel type and similar prices will be stored in a contiguous range of doc IDs. Without index sorting, conjunctions across these 3 fields would be almost a worst-case scenario as every clause might match lots of documents while their intersection might not. With index sorting enabled however, there's only a very small number of calls to advance() that would lead to doc IDs that do not match, because these advance() calls that do not lead to a match would always jump over a large number of doc IDs. I had created that example for ApacheCon last year that demonstrates the benefits of index sorting on conjunctions. In both cases, the index is storing the same data, it just gets different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there is a lot more we can do by taking advantage of the index sort explicitly. For instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range queries on fields that are primary sort fields by performing a binary search to identify the first and last documents that match the range query. I would like to introduce [sparse indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields that are used for index sorting, with the goal of improving the runtime of {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive and making it easier and more efficient to leverage index sorting to filter on subsequent sort fields. A simple form of a sparse index could consist of storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we wouldn't need to make it configurable and could enable it automatically as soon as index sorting is enabled. And it would get its own file format abstraction.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org