You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/09/11 00:37:39 UTC

[GitHub] [lucene] goankur opened a new pull request #293: Lucene-10070: Skip deleted documents during facet counting for all do…

goankur opened a new pull request #293:
URL: https://github.com/apache/lucene/pull/293


   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene:
   
   * https://issues.apache.org/jira/projects/LUCENE
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   
   LUCENE must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   Following `Facets` implementations are changed to ignore deleted documents for `count all` queries that don't use `FacetsCollector` instance.
     - SortedSetDocValueFacetCounts
     - ConcurrentSortedSetDocValueFacetCounts
     - LongValueFacetCounts
     - StringValueFacetCounts 
   
   # Solution
   
   Ignore deleted documents during docValues iteration to calculate facet label counts
   
   # Tests
   
   Following test methods have been copied from pull request
    https://github.com/apache/lucene/pull/263/files (Thanks Greg Miller)
     - TestLongValueFacetCounts.testCountAll() 
     - TestStringValueFacetCount.testCountAll()
     - TestSortedSetDocValuesFacets.testCountAll()
     
   # Checklist
   
   Please review the following and check all that apply:
   
   - [X] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [X] I have created a Jira issue and added the issue ID to my pull request title.
   - [X] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [X] I have developed this patch against the `main` branch.
   - [X] I have run `./gradlew check`.
   - [X] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a change in pull request #293: Lucene-10070: Skip deleted documents during facet counting for all do…

Posted by GitBox <gi...@apache.org>.
gsmiller commented on a change in pull request #293:
URL: https://github.com/apache/lucene/pull/293#discussion_r708632754



##########
File path: lucene/core/src/java/org/apache/lucene/search/ConjunctionUtils.java
##########
@@ -97,4 +99,46 @@ public static void addIterator(
       List<TwoPhaseIterator> twoPhaseIterators) {
     ConjunctionDISI.addIterator(disi, allIterators, twoPhaseIterators);
   }
+
+  /**
+   * Wrap the given DocIdSetIterator and liveDocs into another DocIdSetIterator that returns
+   * non-deleted documents during iteration. This is useful for match-all style queries that need to
+   * exclude deleted documents.
+   *
+   * @param it {@link DocIdSetIterator} being wrapped
+   * @param liveDocs {@link Bits} containing set bits for non-deleted docs
+   * @return wrapped iterator
+   */
+  public static DocIdSetIterator liveDocsDISI(DocIdSetIterator it, Bits liveDocs) {

Review comment:
       Hmm, a couple thoughts here.
   1. I really like this approach of creating a DISI that accounts for deleted docs! Clean way of solving this.
   2. I wonder if this is general-purpose enough to expose here, or if we should keep it just for faceting for now (e.g., in a pkg-private class within the faceting module). Might be worth waiting to expose it in a public API until there are more specific use-cases to justify it.
   3. Could you take a look at the way `ConjunctionDISI` implements next/advance? I think it might be a good pattern to follow in this implementation, specifically the [doNext](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java#L166) pattern. I'm not 100% sure, but I _think_ this might explode if you called advance for a doc that's not present, that sends you to NO_MORE_DOCS. (Generalizing this in a public API will require rigorous testing and handling use-cases that may not come up in faceting).

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -152,7 +153,8 @@ private FacetResult getDim(String dim, OrdRange ordRange, int topN) throws IOExc
   }
 
   private void countOneSegment(
-      OrdinalMap ordinalMap, LeafReader reader, int segOrd, MatchingDocs hits) throws IOException {

Review comment:
       I like how you were able to extend this functionality to work for both the "count all" and "typical" counting cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] goankur commented on a change in pull request #293: Lucene-10070: Skip deleted documents during facet counting for all do…

Posted by GitBox <gi...@apache.org>.
goankur commented on a change in pull request #293:
URL: https://github.com/apache/lucene/pull/293#discussion_r709691554



##########
File path: lucene/core/src/java/org/apache/lucene/search/ConjunctionUtils.java
##########
@@ -97,4 +99,46 @@ public static void addIterator(
       List<TwoPhaseIterator> twoPhaseIterators) {
     ConjunctionDISI.addIterator(disi, allIterators, twoPhaseIterators);
   }
+
+  /**
+   * Wrap the given DocIdSetIterator and liveDocs into another DocIdSetIterator that returns
+   * non-deleted documents during iteration. This is useful for match-all style queries that need to
+   * exclude deleted documents.
+   *
+   * @param it {@link DocIdSetIterator} being wrapped
+   * @param liveDocs {@link Bits} containing set bits for non-deleted docs
+   * @return wrapped iterator
+   */
+  public static DocIdSetIterator liveDocsDISI(DocIdSetIterator it, Bits liveDocs) {

Review comment:
       I created a new class - `org.apache.lucene.facet.FacetUtils` and moved the `liveDocsDISI(...)` method to  it. 
   I modified the implementation to model the  [doNext](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java#L166) pattern as suggested. 
   
   Both `advance(int docId)` and `nextDoc(int docId)` method now delegate to the corresponding methods of the wrapped DISI before invoking `doNext(int docId)`  which skips the docId until it finds one that is in the liveDocs or the wrapped DISI runs out of documents (i.e NO_MORE_DOCS).
   
   Unfortunately that class cannot be package private and that limits its visibility to other classes in a sub-package. Also added a new test case - `org.apache.lucene.facet.TestFacetUtils`
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] goankur commented on a change in pull request #293: Lucene-10070: Skip deleted documents during facet counting for all do…

Posted by GitBox <gi...@apache.org>.
goankur commented on a change in pull request #293:
URL: https://github.com/apache/lucene/pull/293#discussion_r709691554



##########
File path: lucene/core/src/java/org/apache/lucene/search/ConjunctionUtils.java
##########
@@ -97,4 +99,46 @@ public static void addIterator(
       List<TwoPhaseIterator> twoPhaseIterators) {
     ConjunctionDISI.addIterator(disi, allIterators, twoPhaseIterators);
   }
+
+  /**
+   * Wrap the given DocIdSetIterator and liveDocs into another DocIdSetIterator that returns
+   * non-deleted documents during iteration. This is useful for match-all style queries that need to
+   * exclude deleted documents.
+   *
+   * @param it {@link DocIdSetIterator} being wrapped
+   * @param liveDocs {@link Bits} containing set bits for non-deleted docs
+   * @return wrapped iterator
+   */
+  public static DocIdSetIterator liveDocsDISI(DocIdSetIterator it, Bits liveDocs) {

Review comment:
       I created a new class - `org.apache.lucene.facet.FacetUtils` and moved the `liveDocsDISI(...)` method to  it. 
   I modified the implementation to model the  [doNext](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java#L166) pattern as suggested. 
   
   Both `advance(int docId)` and `nextDoc(int docId)` method now delegate to the corresponding methods of the wrapped DISI before invoking `doNext(int docId)`  which skips the docId until it finds one that is in the liveDocs or the wrapped DISI runs out of documents (i.e NO_MORE_DOCS).
   
   Unfortunately that class cannot be package private as that limits its visibility to other classes in a sub-package. Also added a new test case - `org.apache.lucene.facet.TestFacetUtils`
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a change in pull request #293: Lucene-10070: Skip deleted documents during facet counting for all do…

Posted by GitBox <gi...@apache.org>.
gsmiller commented on a change in pull request #293:
URL: https://github.com/apache/lucene/pull/293#discussion_r710560496



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetUtils.java
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.facet;
+
+import java.io.IOException;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/** Utility class with a single method for getting a DocIdSetIterator that skips deleted docs */
+public final class FacetUtils {

Review comment:
       Thanks for going to all the work to break this out and add tests!
   
   Could you add `@lucene.experimental` to the javadoc for this class? Ideally we could make this pkg-private but I understand why that's not possible with the facet sub-packages. I'd like to avoid a situation where this becomes part of our public API though and we have to start worrying about backwards compatibility and such if we want to tweak or change it, since it's really an internal helper for faceting. By marking it "experimental", I think that gives us some flexibility to worry less about backwards compatibility here (see: https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] goankur commented on a change in pull request #293: Lucene-10070: Skip deleted documents during facet counting for all do…

Posted by GitBox <gi...@apache.org>.
goankur commented on a change in pull request #293:
URL: https://github.com/apache/lucene/pull/293#discussion_r710606591



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetUtils.java
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.facet;
+
+import java.io.IOException;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/** Utility class with a single method for getting a DocIdSetIterator that skips deleted docs */
+public final class FacetUtils {

Review comment:
       Thanks for reviewing this @gsmiller. I agree on marking this as experimental giving us flexibility to worry less about backwards compatibility.  Added `@lucene.experimental` to the javadoc for this class as suggested.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #293: Lucene-10070: Skip deleted documents during facet counting for all do…

Posted by GitBox <gi...@apache.org>.
gsmiller commented on pull request #293:
URL: https://github.com/apache/lucene/pull/293#issuecomment-921966795


   Merged. Thanks @goankur !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] goankur commented on a change in pull request #293: Lucene-10070: Skip deleted documents during facet counting for all do…

Posted by GitBox <gi...@apache.org>.
goankur commented on a change in pull request #293:
URL: https://github.com/apache/lucene/pull/293#discussion_r709691554



##########
File path: lucene/core/src/java/org/apache/lucene/search/ConjunctionUtils.java
##########
@@ -97,4 +99,46 @@ public static void addIterator(
       List<TwoPhaseIterator> twoPhaseIterators) {
     ConjunctionDISI.addIterator(disi, allIterators, twoPhaseIterators);
   }
+
+  /**
+   * Wrap the given DocIdSetIterator and liveDocs into another DocIdSetIterator that returns
+   * non-deleted documents during iteration. This is useful for match-all style queries that need to
+   * exclude deleted documents.
+   *
+   * @param it {@link DocIdSetIterator} being wrapped
+   * @param liveDocs {@link Bits} containing set bits for non-deleted docs
+   * @return wrapped iterator
+   */
+  public static DocIdSetIterator liveDocsDISI(DocIdSetIterator it, Bits liveDocs) {

Review comment:
       I created a new class - `org.apache.lucene.facet.FacetUtils` and moved the `liveDocsDISI(...)` method to that class.
   I modified the implementation to model the  [doNext](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java#L166) pattern as suggested. Both `advance(int docId)` and `nextDoc(int docId)` method now delegate to the corresponding methods of the wrapped DISI before invoking `doNext(int docId)`  which skips the docId until it finds one that is in the liveDocs or the wrapped DISI runs out of documents (i.e NO_MORE_DOCS).
   
   Unfortunately that class cannot be package private and that limits its visibility to other classes in a sub-package. Also added a new test case - `org.apache.lucene.facet.TestFacetUtils`
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #293: Lucene-10070: Skip deleted documents during facet counting for all do…

Posted by GitBox <gi...@apache.org>.
gsmiller commented on pull request #293:
URL: https://github.com/apache/lucene/pull/293#issuecomment-921326616


   Looks good. One very minor comment and I think this is good-to-go. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller merged pull request #293: Lucene-10070: Skip deleted documents during facet counting for all do…

Posted by GitBox <gi...@apache.org>.
gsmiller merged pull request #293:
URL: https://github.com/apache/lucene/pull/293


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org