You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/06/24 12:44:57 UTC
[GitHub] [lucene] iverase opened a new pull request, #979: LUCENE-10396: Add capability to jump to the next document with different ord in SortedDocValues
iverase opened a new pull request, #979:
URL: https://github.com/apache/lucene/pull/979
This PR proposes to add a new method to SortedDocValues that helps users to advance an iterator to the next document that contains a different term that the current document, which can be specially useful when the index is sorted by this field.
The method contains a default implementation but this PR produces as well a fast implementation when the index is sorted by this field and it has low cardinality. In this case we write to disk a jump table that allows to quickly skip documents instead of manually iterating through the docs.
In https://issues.apache.org/jira/browse/LUCENE-10396 it is discussed some of the use cases where this method can be used, for example computing the number of unique values for documents that match a query. On the other hand, it diverges from the sparse index approach but as this ids less intrusive, it seems appealing.
Note that in order to handle backwards compatibility, I have increase the version of the codec instead of creating a new one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] iverase commented on pull request #979: LUCENE-10396: Add capability to jump to the next document with different ord in SortedDocValues
Posted by "iverase (via GitHub)" <gi...@apache.org>.
iverase commented on PR #979:
URL: https://github.com/apache/lucene/pull/979#issuecomment-1490608990
Sure!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] iverase commented on pull request #979: LUCENE-10396: Add capability to jump to the next document with different ord in SortedDocValues
Posted by GitBox <gi...@apache.org>.
iverase commented on PR #979:
URL: https://github.com/apache/lucene/pull/979#issuecomment-1168539694
I make a quick check if this patch by indexing 50 million documents in a sorted index. The documents just contain a SortedDocValues with a 10 bytes term. I checked the index size and the speed of retrieving the first document per term with different cardinalities and the results looks like:
Cardinality ~1000
```
| without patch | with patch
Index Size (MB) | 2.800084114074707 | 2.8039379119873047
average advanceOrd (ms)| 0.39255053534999995 | 0.0011012437999999999
```
Cardinality ~10000
```
| without patch | with patch
Index Size (MB) | 16.125946044921875 | 16.164132118225098
average advanceOrd (ms)| 0.52939177705 | 0.01008831655
```
Cardinality ~10000
```
| without patch | with patch
Index Size (MB) | 49.320682525634766 | 49.57721138000488
average advanceOrd (ms)| 0.5479114709999999 | 0.03804306865
```
Cardinality ~50000
```
| without patch | with patch
Index Size (MB) | 52.81498718261719 | 53.66002082824707
average advanceOrd (ms)| 0.6515335270999999 | 0.06898821255000001
```
The new jump table is tiny compared to the size of the doc value while this new way of navigation os at least one order of magnitude faster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #979: LUCENE-10396: Add capability to jump to the next document with different ord in SortedDocValues
Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #979:
URL: https://github.com/apache/lucene/pull/979#issuecomment-1490487671
@iverase I was playing with this idea a little bit for a use-case I'm working on. It didn't pan out unfortunately, but in the process, I did take the time to rebase this change on the tip of `main`. Do you mind if I push the rebase to your PR branch since I did the work to rebase?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org