You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/02/09 12:18:34 UTC

[GitHub] [lucene] gf2121 opened a new pull request #666: LUCENE-10409: Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs

gf2121 opened a new pull request #666:
URL: https://github.com/apache/lucene/pull/666


   I recently improved DocIdsWriter for the case when doc IDs are dense and come in the same order as values via the CONTINUOUS_IDS and BITSET_IDS encodings.
   
   We could do the same for the case when doc IDs come in the opposite order to values. This would be used whenever searching on a field that is used for index sorting in the descending order. This would be a frequent case for Elasticsearch users as we're planning on using index sorting more and more on time-based data with a descending sort on the timestamp as the last sort field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] iverase commented on a change in pull request #666: LUCENE-10409: Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs

Posted by GitBox <gi...@apache.org>.
iverase commented on a change in pull request #666:
URL: https://github.com/apache/lucene/pull/666#discussion_r802644513



##########
File path: lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java
##########
@@ -166,6 +184,9 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
               throw new UnsupportedOperationException();
             }
           });
+      // ignore order
+      Arrays.sort(ints);

Review comment:
       I am confused here, I think emitting docs in the same order as they were written is important?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] iverase commented on a change in pull request #666: LUCENE-10409: Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs

Posted by GitBox <gi...@apache.org>.
iverase commented on a change in pull request #666:
URL: https://github.com/apache/lucene/pull/666#discussion_r802747917



##########
File path: lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java
##########
@@ -166,6 +184,9 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
               throw new UnsupportedOperationException();
             }
           });
+      // ignore order
+      Arrays.sort(ints);

Review comment:
       Looking into the contract of IntersectVisitor#visit(int docID), it does not have any mention on the order so we are not breaking the contract, still this might be surprising depending of you r index sort. Let's see what other people thinks I am not sure of the right answer here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gf2121 commented on a change in pull request #666: LUCENE-10409: Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs

Posted by GitBox <gi...@apache.org>.
gf2121 commented on a change in pull request #666:
URL: https://github.com/apache/lucene/pull/666#discussion_r802715789



##########
File path: lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java
##########
@@ -166,6 +184,9 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
               throw new UnsupportedOperationException();
             }
           });
+      // ignore order
+      Arrays.sort(ints);

Review comment:
       Thanks @iverase for feedback!
   
   There are two methods to decode docIDs in DocIdsWriter: 
   
   * `#readInts(IndexInput, int, int[])` for cases that point values are **different** in one leaf node.
   * `#readInts(IndexInput, int, IntersectVisitor)`  for cases that point values are **same** in one leaf node.
   
   This is changing the test of `#readInts(IndexInput, int, IntersectVisitor)`, which means all docs in the leaf node has the same value. So the order of ids could probably be ignored? As the method we added before in is `#visit(DocIdSetIterator iter)` and docIDs in `DocIdSetIterator` must be in increasing order. So i think we need to either change the `IntersectVisitor` interface (a bit more invasive) or ignore order to introduce this optimization. 
   
   What do you think?
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gf2121 commented on a change in pull request #666: LUCENE-10409: Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs

Posted by GitBox <gi...@apache.org>.
gf2121 commented on a change in pull request #666:
URL: https://github.com/apache/lucene/pull/666#discussion_r802715789



##########
File path: lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java
##########
@@ -166,6 +184,9 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
               throw new UnsupportedOperationException();
             }
           });
+      // ignore order
+      Arrays.sort(ints);

Review comment:
       Thanks @iverase for feedback!
   
   There are two methods to decode docIDs in DocIdsWriter: 
   
   * `#readInts(IndexInput, int, int[])` for cases that docs are **partial** required in one leaf node.
   * `#readInts(IndexInput, int, IntersectVisitor)`  for cases that docs are **all** required in one leaf node.
   
   This is changing the test of `#readInts(IndexInput, int, IntersectVisitor)`, which means all docs in the leaf node are required. So the order of ids could probably be ignored? 
   
   I'm changing this because the method we added before is `#visit(DocIdSetIterator iter)`, and docIDs in `DocIdSetIterator` must be in increasing order. So i think we need to either change the `IntersectVisitor` interface (a bit more invasive) or ignore order to introduce this optimization. 
   
   What do you think?
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gf2121 commented on a change in pull request #666: LUCENE-10409: Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs

Posted by GitBox <gi...@apache.org>.
gf2121 commented on a change in pull request #666:
URL: https://github.com/apache/lucene/pull/666#discussion_r802715789



##########
File path: lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java
##########
@@ -166,6 +184,9 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
               throw new UnsupportedOperationException();
             }
           });
+      // ignore order
+      Arrays.sort(ints);

Review comment:
       Thanks @iverase for feedback!
   
   There are two methods to decode docIDs in DocIdsWriter: 
   
   * `#readInts(IndexInput, int, int[])` for cases that point values are **different** in one leaf node.
   * `#readInts(IndexInput, int, IntersectVisitor)`  for cases that point values are **same** in one leaf node.
   
   This is changing the test of `#readInts(IndexInput, int, IntersectVisitor)`, which means all docs in the leaf node have the same value. So the order of ids could probably be ignored? As the method we added before in is `#visit(DocIdSetIterator iter)` and docIDs in `DocIdSetIterator` must be in increasing order. So i think we need to either change the `IntersectVisitor` interface (a bit more invasive) or ignore order to introduce this optimization. 
   
   What do you think?
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gf2121 commented on a change in pull request #666: LUCENE-10409: Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs

Posted by GitBox <gi...@apache.org>.
gf2121 commented on a change in pull request #666:
URL: https://github.com/apache/lucene/pull/666#discussion_r802715789



##########
File path: lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java
##########
@@ -166,6 +184,9 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
               throw new UnsupportedOperationException();
             }
           });
+      // ignore order
+      Arrays.sort(ints);

Review comment:
       Thanks @iverase for feedback!
   
   There are two methods to decode docIDs in DocIdsWriter: 
   
   * `#readInts(IndexInput, int, int[])` for cases that docs are **partial** required in one leaf node.
   * `#readInts(IndexInput, int, IntersectVisitor)`  for cases that docs are **all** required in one leaf node.
   
   This is changing the test of `#readInts(IndexInput, int, IntersectVisitor)`, which means all docs in the leaf node are required. So the order of ids could probably be ignored? As the method we added before is `#visit(DocIdSetIterator iter)`, and docIDs in `DocIdSetIterator` must be in increasing order. So i think we need to either change the `IntersectVisitor` interface (a bit more invasive) or ignore order to introduce this optimization. 
   
   What do you think?
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gf2121 commented on a change in pull request #666: LUCENE-10409: Improve BKDWriter's DocIdsWriter to better encode decreasing sequences of doc IDs

Posted by GitBox <gi...@apache.org>.
gf2121 commented on a change in pull request #666:
URL: https://github.com/apache/lucene/pull/666#discussion_r802715789



##########
File path: lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java
##########
@@ -166,6 +184,9 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
               throw new UnsupportedOperationException();
             }
           });
+      // ignore order
+      Arrays.sort(ints);

Review comment:
       Thanks @iverase for feedback!
   
   There are two methods to decode docIDs in DocIdsWriter: 
   
   * `#readInts(IndexInput, int, int[])` for cases that point values are **different** in one leaf node.
   * `#readInts(IndexInput, int, IntersectVisitor)`  for cases that point values are **same** in one leaf node.
   
   This is changing the test of `#readInts(IndexInput, int, IntersectVisitor)`, which means all docs in the leaf node have the same value. So the order of ids could probably be ignored? As the method we added before is `#visit(DocIdSetIterator iter)`, and docIDs in `DocIdSetIterator` must be in increasing order. So i think we need to either change the `IntersectVisitor` interface (a bit more invasive) or ignore order to introduce this optimization. 
   
   What do you think?
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org