You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/12 16:23:58 UTC

[GitHub] [lucene] gsmiller opened a new pull request, #11764: GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children

gsmiller opened a new pull request, #11764:
URL: https://github.com/apache/lucene/pull/11764

   ### Description
   
   See: GITHUB#11742
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #11764: GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children

Posted by GitBox <gi...@apache.org>.
gsmiller commented on code in PR #11764:
URL: https://github.com/apache/lucene/pull/11764#discussion_r968941406


##########
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##########
@@ -156,7 +157,43 @@ public FacetResult getAllChildren(String dim, String... path) throws IOException
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException {
     validateTopN(topN);
-    return getAllChildren(dim, path);
+
+    topN = Math.min(topN, counts.length);
+
+    PriorityQueue<Entry> pq =
+        new PriorityQueue<>(topN) {
+          @Override
+          protected boolean lessThan(Entry a, Entry b) {
+            int cmp = Integer.compare(a.count, b.count);
+            if (cmp == 0) {
+              cmp = b.label.compareTo(a.label);
+            }
+            return cmp < 0;
+          }
+        };
+
+    int childCount = 0;
+    Entry reuse = null;
+    for (int i = 0; i < counts.length; i++) {
+      int count = counts[i];
+      if (count > 0) {
+        childCount++;
+        if (reuse == null) {
+          reuse = new Entry();
+        }
+        reuse.label = facetSetMatchers[i].label;
+        reuse.count = count;
+        reuse = pq.insertWithOverflow(reuse);
+      }
+    }
+
+    LabelAndValue[] labelValues = new LabelAndValue[topN];
+    for (int i = topN - 1; i >= 0; i--) {
+      Entry e = pq.pop();

Review Comment:
   Ah, never mind on the NPE concern. Since we rely on PQ#size as a bound (and since we don't use sentinel values in the other implementations), we're safe there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #11764: GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children

Posted by GitBox <gi...@apache.org>.
gsmiller commented on code in PR #11764:
URL: https://github.com/apache/lucene/pull/11764#discussion_r968955759


##########
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##########
@@ -156,7 +157,43 @@ public FacetResult getAllChildren(String dim, String... path) throws IOException
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException {
     validateTopN(topN);
-    return getAllChildren(dim, path);
+
+    topN = Math.min(topN, counts.length);
+
+    PriorityQueue<Entry> pq =
+        new PriorityQueue<>(topN) {
+          @Override
+          protected boolean lessThan(Entry a, Entry b) {
+            int cmp = Integer.compare(a.count, b.count);
+            if (cmp == 0) {
+              cmp = b.label.compareTo(a.label);
+            }
+            return cmp < 0;
+          }
+        };
+
+    int childCount = 0;
+    Entry reuse = null;
+    for (int i = 0; i < counts.length; i++) {
+      int count = counts[i];
+      if (count > 0) {
+        childCount++;
+        if (reuse == null) {
+          reuse = new Entry();
+        }
+        reuse.label = facetSetMatchers[i].label;
+        reuse.count = count;
+        reuse = pq.insertWithOverflow(reuse);
+      }
+    }
+
+    LabelAndValue[] labelValues = new LabelAndValue[topN];
+    for (int i = topN - 1; i >= 0; i--) {
+      Entry e = pq.pop();

Review Comment:
   Hmm... and I think we can actually do better at dealing with sentinels than `continuing` our loop and truncating the resulting array. We can just `pop` them all off initially. I'll update the PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #11764: GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children

Posted by GitBox <gi...@apache.org>.
gsmiller commented on code in PR #11764:
URL: https://github.com/apache/lucene/pull/11764#discussion_r968921672


##########
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##########
@@ -156,7 +157,43 @@ public FacetResult getAllChildren(String dim, String... path) throws IOException
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException {
     validateTopN(topN);
-    return getAllChildren(dim, path);
+
+    topN = Math.min(topN, counts.length);
+
+    PriorityQueue<Entry> pq =
+        new PriorityQueue<>(topN) {

Review Comment:
   Sure, I think this is a good suggestion. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #11764: GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children

Posted by GitBox <gi...@apache.org>.
gsmiller commented on code in PR #11764:
URL: https://github.com/apache/lucene/pull/11764#discussion_r968922810


##########
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##########
@@ -156,7 +157,43 @@ public FacetResult getAllChildren(String dim, String... path) throws IOException
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException {
     validateTopN(topN);
-    return getAllChildren(dim, path);
+
+    topN = Math.min(topN, counts.length);
+
+    PriorityQueue<Entry> pq =
+        new PriorityQueue<>(topN) {
+          @Override
+          protected boolean lessThan(Entry a, Entry b) {
+            int cmp = Integer.compare(a.count, b.count);
+            if (cmp == 0) {
+              cmp = b.label.compareTo(a.label);
+            }
+            return cmp < 0;
+          }
+        };
+
+    int childCount = 0;
+    Entry reuse = null;
+    for (int i = 0; i < counts.length; i++) {
+      int count = counts[i];
+      if (count > 0) {
+        childCount++;
+        if (reuse == null) {
+          reuse = new Entry();
+        }
+        reuse.label = facetSetMatchers[i].label;
+        reuse.count = count;
+        reuse = pq.insertWithOverflow(reuse);
+      }
+    }
+
+    LabelAndValue[] labelValues = new LabelAndValue[topN];
+    for (int i = topN - 1; i >= 0; i--) {
+      Entry e = pq.pop();

Review Comment:
   I think we'd actually need to `continue` the loop while we have sentinel values on top right? Not `break`? Then we'd actually need to truncate those sentinel values off the final array? This actually makes me wonder if we don't have a pre-existing bug in our other faceting implementations in the case that we have fewer non-zero counts than requested top-n (but in that case I think we'd NPE). Hmm...



##########
lucene/facet/src/test/org/apache/lucene/facet/facetset/TestExactFacetSetMatcher.java:
##########
@@ -46,6 +49,105 @@ public class TestExactFacetSetMatcher extends FacetTestCase {
   private static final int[] MANUFACTURER_ORDS = {FORD_ORD, TOYOTA_ORD, CHEVY_ORD, NISSAN_ORD};
   private static final int[] YEARS = {2010, 2011, 2012};
 
+  public void testTopChildren() throws Exception {

Review Comment:
   Sure, makes sense to me. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] shaie commented on a diff in pull request #11764: GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children

Posted by GitBox <gi...@apache.org>.
shaie commented on code in PR #11764:
URL: https://github.com/apache/lucene/pull/11764#discussion_r969152978


##########
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##########
@@ -156,7 +157,43 @@ public FacetResult getAllChildren(String dim, String... path) throws IOException
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException {
     validateTopN(topN);
-    return getAllChildren(dim, path);
+
+    topN = Math.min(topN, counts.length);
+
+    PriorityQueue<Entry> pq =
+        new PriorityQueue<>(topN) {
+          @Override
+          protected boolean lessThan(Entry a, Entry b) {
+            int cmp = Integer.compare(a.count, b.count);
+            if (cmp == 0) {
+              cmp = b.label.compareTo(a.label);
+            }
+            return cmp < 0;
+          }
+        };
+
+    int childCount = 0;
+    Entry reuse = null;
+    for (int i = 0; i < counts.length; i++) {
+      int count = counts[i];
+      if (count > 0) {
+        childCount++;
+        if (reuse == null) {
+          reuse = new Entry();
+        }
+        reuse.label = facetSetMatchers[i].label;
+        reuse.count = count;
+        reuse = pq.insertWithOverflow(reuse);
+      }
+    }
+
+    LabelAndValue[] labelValues = new LabelAndValue[topN];
+    for (int i = topN - 1; i >= 0; i--) {
+      Entry e = pq.pop();

Review Comment:
   Right, I meant `continue` :). I forgot that the sentinel values are "on top". And yes, `pop()` all the sentinel values before we populate the array is the way to get rid of them.



##########
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##########
@@ -156,7 +157,46 @@ public FacetResult getAllChildren(String dim, String... path) throws IOException
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException {
     validateTopN(topN);
-    return getAllChildren(dim, path);
+
+    topN = Math.min(topN, counts.length);
+
+    PriorityQueue<Entry> pq =
+        new PriorityQueue<>(topN, () -> new Entry("", 0)) {
+          @Override
+          protected boolean lessThan(Entry a, Entry b) {
+            return compare(a.count, b.count, a.label, b.label) < 0;
+          }
+        };
+
+    int childCount = 0;
+    Entry reuse = pq.top();
+    for (int i = 0; i < counts.length; i++) {
+      int count = counts[i];
+      if (count > 0) {
+        childCount++;
+        String label = facetSetMatchers[i].label;
+        if (compare(reuse.count, count, reuse.label, label) < 0) {
+          reuse.label = label;
+          reuse.count = count;
+          reuse = pq.updateTop();
+        }
+      }
+    }
+
+    // Pop off any sentinel values in the case that we had fewer child labels with non-zero
+    // counts than the requested top-n:
+    while (childCount < pq.size()) {
+      pq.pop();
+    }
+
+    LabelAndValue[] labelValues = new LabelAndValue[Math.min(topN, childCount)];
+    for (int i = pq.size() - 1; i >= 0; i--) {
+      Entry e = pq.pop();
+      assert e != null;

Review Comment:
   nit: I don't think this assertion contributes much because if it's `null` we'll fail in the next line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller merged pull request #11764: GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children

Posted by GitBox <gi...@apache.org>.
gsmiller merged PR #11764:
URL: https://github.com/apache/lucene/pull/11764


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] shaie commented on a diff in pull request #11764: GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children

Posted by GitBox <gi...@apache.org>.
shaie commented on code in PR #11764:
URL: https://github.com/apache/lucene/pull/11764#discussion_r968832961


##########
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##########
@@ -156,7 +157,43 @@ public FacetResult getAllChildren(String dim, String... path) throws IOException
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException {
     validateTopN(topN);
-    return getAllChildren(dim, path);
+
+    topN = Math.min(topN, counts.length);
+
+    PriorityQueue<Entry> pq =
+        new PriorityQueue<>(topN) {

Review Comment:
   Would you like to use the variant which takes a supplier of sentinel objects? It will allow you to avoid the `null` checks below and in general improve the performance of the algorithm below (unless we don't expect `topN` to be much smaller than `counts`. The usage pattern then becomes comparing the current count+label to the `top()` and if it's better you update the `top()` and call `updateTop()`.



##########
lucene/facet/src/test/org/apache/lucene/facet/facetset/TestExactFacetSetMatcher.java:
##########
@@ -46,6 +49,105 @@ public class TestExactFacetSetMatcher extends FacetTestCase {
   private static final int[] MANUFACTURER_ORDS = {FORD_ORD, TOYOTA_ORD, CHEVY_ORD, NISSAN_ORD};
   private static final int[] YEARS = {2010, 2011, 2012};
 
+  public void testTopChildren() throws Exception {

Review Comment:
   Since the change is only in `MatchingFacetSetsCounts` and has nothing to do with the matchers, I think the test should be in `TestMatchingFacetSetsCounts` and remove the duplication in the "Range" test class. WDYT?



##########
lucene/facet/src/java/org/apache/lucene/facet/facetset/MatchingFacetSetsCounts.java:
##########
@@ -156,7 +157,43 @@ public FacetResult getAllChildren(String dim, String... path) throws IOException
   @Override
   public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException {
     validateTopN(topN);
-    return getAllChildren(dim, path);
+
+    topN = Math.min(topN, counts.length);
+
+    PriorityQueue<Entry> pq =
+        new PriorityQueue<>(topN) {
+          @Override
+          protected boolean lessThan(Entry a, Entry b) {
+            int cmp = Integer.compare(a.count, b.count);
+            if (cmp == 0) {
+              cmp = b.label.compareTo(a.label);
+            }
+            return cmp < 0;
+          }
+        };
+
+    int childCount = 0;
+    Entry reuse = null;
+    for (int i = 0; i < counts.length; i++) {
+      int count = counts[i];
+      if (count > 0) {
+        childCount++;
+        if (reuse == null) {
+          reuse = new Entry();
+        }
+        reuse.label = facetSetMatchers[i].label;
+        reuse.count = count;
+        reuse = pq.insertWithOverflow(reuse);
+      }
+    }
+
+    LabelAndValue[] labelValues = new LabelAndValue[topN];
+    for (int i = topN - 1; i >= 0; i--) {
+      Entry e = pq.pop();

Review Comment:
   If you accept my proposal above, you will need to break the loop when hitting a sentinel value



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org