You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/06/24 12:44:57 UTC

[GitHub] [lucene] iverase opened a new pull request, #979: LUCENE-10396: Add capability to jump to the next document with different ord in SortedDocValues

iverase opened a new pull request, #979:
URL: https://github.com/apache/lucene/pull/979

   This PR proposes to add a new method to SortedDocValues that helps users to advance an iterator to the next document that contains a different term that the current document, which can be specially useful when the index is sorted by this field. 
   
   The method contains a default implementation but this PR produces as well a fast implementation when the index is sorted by this field and it has low cardinality. In this case we write to disk a jump table that allows to quickly skip documents instead of manually iterating through the docs. 
   
   In https://issues.apache.org/jira/browse/LUCENE-10396 it is discussed some of the use cases where this method can be used, for example computing the number of unique values for documents that match a query. On the other hand, it diverges from the sparse index approach but as this ids less intrusive, it seems appealing.
   
   Note that in order to handle backwards compatibility, I have increase the version of the codec instead of creating a new one.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] iverase commented on pull request #979: LUCENE-10396: Add capability to jump to the next document with different ord in SortedDocValues

Posted by "iverase (via GitHub)" <gi...@apache.org>.
iverase commented on PR #979:
URL: https://github.com/apache/lucene/pull/979#issuecomment-1490608990

   Sure!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] iverase commented on pull request #979: LUCENE-10396: Add capability to jump to the next document with different ord in SortedDocValues

Posted by GitBox <gi...@apache.org>.
iverase commented on PR #979:
URL: https://github.com/apache/lucene/pull/979#issuecomment-1168539694

   I make a quick check if this patch by indexing 50 million documents in a sorted index. The documents just contain a SortedDocValues with a 10 bytes term. I checked the index size and the speed of retrieving the first document per term with different cardinalities and the results looks like:
   
   Cardinality ~1000
   ```
                          |  without patch       | with patch           
   Index Size (MB)        |  2.800084114074707   |  2.8039379119873047 
   average advanceOrd (ms)|  0.39255053534999995 |  0.0011012437999999999
   ```
   
   Cardinality ~10000
   ```
                          |  without patch       | with patch                             
   Index Size (MB)        |  16.125946044921875  |  16.164132118225098
   average advanceOrd (ms)|  0.52939177705       |  0.01008831655
   ```
   
   Cardinality ~10000
   ```
                          |  without patch       | with patch           
   Index Size (MB)        | 49.320682525634766   |  49.57721138000488
   average advanceOrd (ms)|  0.5479114709999999  |  0.03804306865
   ```
   Cardinality ~50000
   ```
                          |  without patch       | with patch           
   Index Size (MB)        |   52.81498718261719  |  53.66002082824707 
   average advanceOrd (ms)|   0.6515335270999999 |  0.06898821255000001
   ```
   
   The new jump table is tiny compared to the size of the doc value while this new way of navigation os at least one order of magnitude faster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #979: LUCENE-10396: Add capability to jump to the next document with different ord in SortedDocValues

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #979:
URL: https://github.com/apache/lucene/pull/979#issuecomment-1490487671

   @iverase I was playing with this idea a little bit for a use-case I'm working on. It didn't pan out unfortunately, but in the process, I did take the time to rebase this change on the tip of `main`. Do you mind if I push the rebase to your PR branch since I did the work to rebase?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org