You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by GitBox <gi...@apache.org> on 2019/07/16 04:44:15 UTC

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #780: SOLR-11866: Support efficient subset matching in query elevation rules

dsmiley commented on a change in pull request #780: SOLR-11866: Support efficient subset matching in query elevation rules
URL: https://github.com/apache/lucene-solr/pull/780#discussion_r303651927
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/handler/component/QueryElevationComponent.java
 ##########
 @@ -1131,4 +1227,333 @@ public int compareTop(int doc) {
       };
     }
   }
+
+  /**
+   * Matches a collection of subsets of type &lt;E&gt;.
+   * <p>
+   * Given a collection of subsets <code>N</code>, finds all the subsets that are contained (ignoring duplicate elements)
+   * by a provided set <code>s</code>.
+   * That is, finds all subsets <code>n</code> in <code>N</code> for which <code>s.containsAll(n)</code>
+   * (<code>s</code> contains all the elements of <code>n</code>, in any order).
+   * <p>
+   * Associates a match value of type &lt;M&gt; to each subset and provides it each time the subset matches (i.e. is
+   * contained by the provided set).
+   *
+   * @param <E> Subset element type.
+   * @param <M> Subset match value type.
+   */
+  protected interface SubsetMatcher<E, M> {
+
+    /**
+     * Returns an iterator over all the subsets that are contained by the provided set.
+     * The returned iterator does not support removal.
+     *
+     * @param set The provided set. It can be any {@link Collection} structure, see the implementation for more details.
+     */
+    Iterator<M> findSubsetsMatching(Collection<E> set);
+
+    /**
+     * Gets the number of subsets in this matcher.
+     */
+    int getSubsetCount();
+
+    interface Builder<E, M> {
+
+      Builder<E, M> addSubset(Collection<E> subset, M matchValue);
+
+      SubsetMatcher<E, M> build();
+    }
+  }
+
+  /**
+   * Matches a potentially large collection of subsets with a trie implementation.
+   * <p>
+   * This matcher imposes the elements are {@link Comparable}.
+   * It does not keep the subset insertion order.
+   * Duplicate subsets stack their match values.
+   * <p>
+   * The time complexity of adding a subset is <code>O(n.log(n))</code>, where <code>n</code> is the size of the subset.
+   * <p>
+   * The worst case time complexity of the subset matching is <code>O(2^s)</code>, however a more typical case time
+   * complexity is <code>O(s^3)</code> where s is the size of the set to partially match.
+   * Note it does not depend on <code>N</code>, the size of the collection of subsets, nor on <code>n</code>, the size of
+   * a subset.
+   */
+  protected static class TrieSubsetMatcher<E extends Comparable<? super E>, M> implements SubsetMatcher<E, M> {
+
+      /*
+      Trie structure:
+      ---------------
+      - A subset element on each edge.
+      - Each node may contain zero or more match values.
+
+      Sample construction:
+      --------------------
+      - given the subsets "B A C", "A B", "A B A", "B", "D B".
+      - remove duplicates and sort each subset => "A B C", "A B", "A B", "B", "B D".
+      - N() means a node with no match value.
+      - N(x, y) means a node with 2 match values x and y.
+
+        root
+          --A--> N()
+                   --B--> N("A B", "A B A")
+                            --C--> N("B A C")
+          --B--> N("B")
+                   --D--> N("D B")
+
+      Subset matching algorithm:
+      --------------------------
+      - given a set s
+
+      In the above sample, with s="A B C B", then the matching subsets are "B A C", "A B", "A B A", "B"
+
+      remove duplicates in s
+      sort s
+      keep a queue Q of current nodes
+      Add root node to Q
+      Another queue Q' will hold the child nodes (initially empty)
+      for each element e in s {
+        for each current node in Q {
+          if current node has a child for edge e {
+            add the child to Q'
+            record the child match values
+          }
+          if e is greater than or equal to current node greatest edge {
+            remove current node from Q (as we are sure this current node children cannot match anymore)
+          }
+        }
+        Move all child nodes from Q' to Q
+      }
+
+      Time complexity:
+      ----------------
+      s = size of the set to partially
 
 Review comment:
   size of the set to partially match?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org