You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/03/19 07:40:24 UTC

[GitHub] [lucene] mdmarshmallow commented on a change in pull request #747: LUCENE-10325: Add getTopDims functionality to Facets

mdmarshmallow commented on a change in pull request #747:
URL: https://github.com/apache/lucene/pull/747#discussion_r830444294



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -114,7 +116,7 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
         return null;
       }
       DimTree dimTree = state.getDimTree(dim);
-      return getPathResult(dimConfig, dim, path, pathOrd, dimTree.iterator(pathOrd), topN);
+      return getPathResult(dimConfig, dim, path, pathOrd, dimTree.iterator(pathOrd), topN, null);

Review comment:
       Small nit, but could we have an overloaded method signature here? I think it will make the code a bit cleaner than just passing `null`.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -133,7 +135,7 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
         // child:
         childIt.next();
       }
-      return getPathResult(dimConfig, dim, null, dimOrd, childIt, topN);
+      return getPathResult(dimConfig, dim, null, dimOrd, childIt, topN, null);

Review comment:
       Same thing here. Actually there is a few places where `null` is just passed into various methods. I think new signatures excluding the `cacheChildOrdsResult` with comments explaining what they are would make the code clearer.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -411,7 +484,109 @@ public int compare(FacetResult a, FacetResult b) {
             }
           }
         });
-
     return results;
   }
+
+  @Override
+  public List<FacetResult> getTopDims(int topNDims, int topNChildren) throws IOException {
+    // Creates priority queue to store top dimensions and sort by their aggregated values/hits and
+    // string values.
+    PriorityQueue<DimValueResult> pq =
+        new PriorityQueue<>(topNDims) {
+          @Override
+          protected boolean lessThan(DimValueResult a, DimValueResult b) {
+            if (a.value > b.value) {
+              return false;
+            } else if (a.value < b.value) {
+              return true;
+            } else {
+              return a.dim.compareTo(b.dim) > 0;
+            }
+          }
+        };
+
+    HashMap<String, ChildOrdsResult> cacheChildOrdsResult = new HashMap<>();
+    int dimCount;
+
+    for (String dim : state.getDims()) {
+      FacetsConfig.DimConfig dimConfig = stateConfig.getDimConfig(dim);
+      if (dimConfig.hierarchical) {
+        DimTree dimTree = state.getDimTree(dim);
+        int dimOrd = dimTree.dimStartOrd;
+        // get dim value
+        dimCount =
+            getDimValue(
+                dimConfig, dim, dimOrd, dimTree.iterator(), topNChildren, cacheChildOrdsResult);
+      } else {
+        OrdRange ordRange = state.getOrdRange(dim);
+        int dimOrd = ordRange.start;
+        PrimitiveIterator.OfInt childIt = ordRange.iterator();
+        if (dimConfig.multiValued && dimConfig.requireDimCount) {
+          // If the dim is multi-valued and requires dim counts, we know we've explicitly indexed
+          // the dimension and we need to skip past it so the iterator is positioned on the first
+          // child:
+          childIt.next();
+        }
+        dimCount = getDimValue(dimConfig, dim, dimOrd, childIt, topNChildren, cacheChildOrdsResult);
+      }
+
+      if (dimCount != 0) {
+        // use priority queue to store DimValueResult for topNDims
+        if (pq.size() < topNDims) {
+          pq.insertWithOverflow(new DimValueResult(dim, dimCount));
+        } else {
+          if (dimCount > pq.top().value
+              || (dimCount == pq.top().value && dim.compareTo(pq.top().dim) < 0)) {
+            DimValueResult bottomDim = pq.pop();
+            bottomDim.dim = dim;
+            bottomDim.value = dimCount;
+            pq.insertWithOverflow(bottomDim);
+          }
+        }
+      }
+    }
+
+    // get FacetResult for topNDims
+    int resultSize = pq.size();
+    FacetResult[] results = new FacetResult[resultSize];
+
+    while (pq.size() > 0) {
+      DimValueResult dimValueResult = pq.pop();
+      FacetResult facetResult =
+          getFacetResultForDim(
+              dimValueResult.dim, topNChildren, cacheChildOrdsResult.get(dimValueResult.dim));
+      results[--resultSize] = facetResult;
+    }
+    return Arrays.asList(results);
+  }
+
+  /**
+   * Creates ChildOrdsResult to store dimCount, childCount, and TopOrdAndIntQueue q for
+   * getPathResult.

Review comment:
       The `TopOrdAndIntQueue` is a queue for the dimension's top children right? Maybe just include that in the comments to make it a bit more clear about what it is.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -411,7 +484,109 @@ public int compare(FacetResult a, FacetResult b) {
             }
           }
         });
-
     return results;
   }
+
+  @Override
+  public List<FacetResult> getTopDims(int topNDims, int topNChildren) throws IOException {
+    // Creates priority queue to store top dimensions and sort by their aggregated values/hits and
+    // string values.
+    PriorityQueue<DimValueResult> pq =
+        new PriorityQueue<>(topNDims) {
+          @Override
+          protected boolean lessThan(DimValueResult a, DimValueResult b) {
+            if (a.value > b.value) {
+              return false;
+            } else if (a.value < b.value) {
+              return true;
+            } else {
+              return a.dim.compareTo(b.dim) > 0;
+            }
+          }
+        };
+
+    HashMap<String, ChildOrdsResult> cacheChildOrdsResult = new HashMap<>();
+    int dimCount;
+
+    for (String dim : state.getDims()) {
+      FacetsConfig.DimConfig dimConfig = stateConfig.getDimConfig(dim);
+      if (dimConfig.hierarchical) {
+        DimTree dimTree = state.getDimTree(dim);
+        int dimOrd = dimTree.dimStartOrd;
+        // get dim value
+        dimCount =
+            getDimValue(
+                dimConfig, dim, dimOrd, dimTree.iterator(), topNChildren, cacheChildOrdsResult);
+      } else {
+        OrdRange ordRange = state.getOrdRange(dim);
+        int dimOrd = ordRange.start;
+        PrimitiveIterator.OfInt childIt = ordRange.iterator();
+        if (dimConfig.multiValued && dimConfig.requireDimCount) {
+          // If the dim is multi-valued and requires dim counts, we know we've explicitly indexed
+          // the dimension and we need to skip past it so the iterator is positioned on the first
+          // child:
+          childIt.next();
+        }
+        dimCount = getDimValue(dimConfig, dim, dimOrd, childIt, topNChildren, cacheChildOrdsResult);
+      }
+
+      if (dimCount != 0) {
+        // use priority queue to store DimValueResult for topNDims
+        if (pq.size() < topNDims) {
+          pq.insertWithOverflow(new DimValueResult(dim, dimCount));
+        } else {
+          if (dimCount > pq.top().value
+              || (dimCount == pq.top().value && dim.compareTo(pq.top().dim) < 0)) {
+            DimValueResult bottomDim = pq.pop();
+            bottomDim.dim = dim;
+            bottomDim.value = dimCount;
+            pq.insertWithOverflow(bottomDim);
+          }
+        }
+      }
+    }
+
+    // get FacetResult for topNDims
+    int resultSize = pq.size();
+    FacetResult[] results = new FacetResult[resultSize];
+
+    while (pq.size() > 0) {
+      DimValueResult dimValueResult = pq.pop();
+      FacetResult facetResult =
+          getFacetResultForDim(
+              dimValueResult.dim, topNChildren, cacheChildOrdsResult.get(dimValueResult.dim));
+      results[--resultSize] = facetResult;

Review comment:
       Can we put `resultSize--` in the previous line and then just pass in `resultSize` here. I've gotten some comments before about inline pre-increments like this not being the most readable, so figured I'd just let you know.

##########
File path: lucene/facet/src/test/org/apache/lucene/facet/sortedset/TestSortedSetDocValuesFacets.java
##########
@@ -1221,6 +1326,12 @@ public void testRandomHierarchicalFlatMix() throws Exception {
 
               assertEquals(expectedAllDims, actualAllDims);
 
+              // test getTopDims(1, 10)
+              if (actualAllDims.size() > 0) {
+                List<FacetResult> topDimsResults1 = facets.getTopDims(1, 10);

Review comment:
       Since this unit test is pretty thorough with what it checks, I think you should just `facets.getTopDims(n, 10)` where `n` goes from 1 to `actualAllDims.size()`.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -411,7 +484,109 @@ public int compare(FacetResult a, FacetResult b) {
             }
           }
         });
-
     return results;
   }
+
+  @Override
+  public List<FacetResult> getTopDims(int topNDims, int topNChildren) throws IOException {
+    // Creates priority queue to store top dimensions and sort by their aggregated values/hits and
+    // string values.
+    PriorityQueue<DimValueResult> pq =
+        new PriorityQueue<>(topNDims) {
+          @Override
+          protected boolean lessThan(DimValueResult a, DimValueResult b) {
+            if (a.value > b.value) {
+              return false;
+            } else if (a.value < b.value) {
+              return true;
+            } else {
+              return a.dim.compareTo(b.dim) > 0;
+            }
+          }
+        };
+
+    HashMap<String, ChildOrdsResult> cacheChildOrdsResult = new HashMap<>();

Review comment:
       Another nit: Maybe change this name to `dimToChildOrdsResult` or something like that? Calling it a cache was a bit confusing to me. Ignore if you think the name makes sense though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org