You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by GitBox <gi...@apache.org> on 2020/06/01 00:58:27 UTC

[GitHub] [incubator-datasketches-java] leerho opened a new pull request #319: Tuple theta extension

leerho opened a new pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319


   This makes some significant API changes in the Tuple package hierarchy that will require a bump to a 2.0.0 release.
   
   1. The first task was that the root tuple directory was cluttered with two different groups of classes that made it difficult for anyone to figure out what is going on.  One group of classes form the base generic classes of the tuple sketch on which the concrete extensions "adouble" (a single double), "aninteger" (a single integer), and "strings" (array of strings) depend.  These three concrete extensions are already in their own sub directories. 
   
   The second, largest group of classes were a dedicated non-generic implementation of the tuple sketch, which implemented an array of doubles.  All of these classes had "ArrayOfDoubles" in their name.  These classes shared no code with the root generic tuple classes except for a few methods in the SerializerDeserializer and the Util classes.  By making a few methods public, I was able to move all of the "ArrayOfDoubles" classes into their own subdirectory.  This creates an incompatible API break, which will force us to move to a 2.0.0 for the next version.   Now the tuple root directory is much cleaner and easier to navigate and understand.  There are several reasons for this separate dedicated implementation. First, we felt that a configurable array of doubles would be a relatively common use case.  Second, we wanted a full concrete example of the tuple sketch as an example of what it would look like including both on-heap and off-heap variants.   It is this ArrayOfDoubles implementation that has been integrated into Druid, for example. 
   
   2.  The next task was to update the set operations to allow integration with the Theta sketches. It turns out that modifying the generic Union and Intersection classes only required adding one method to each.  I did some minor code cleanup and code documentation at the same time.
   
   The AnotB operator is another story.  We have never been really happy with how this was implemented the first time.  The current API is clumsy.  So I have taken the opportunity to redesign the API for this class.  It still has the current API methods but deprecated.  With the new modified class the user has several ways of performing AnotB.
   
   As stateless operations:
   
       - With Tuple: resultSk = aNotB(skTupleA, skTupleB);
       - With Theta: resultSk = aNotB(skTupleA, skThetaB);
   
   As stateful, sequential operations:
   
       - void setA(skTupleA);
       - void notB(skTupleB);   or   void notB(skThetaB);   //These are interchangable.
       - ...
       - void notB(skTupleB);   or   void notB(skThetaB);   //These are interchangable.
       - resultSk = getResult(reset = false);  // This allows getting an intermediate result
       - void notB(skTupleB);   or   void notB(skThetaB);   //Continue...
       - resultSK = getResult(reset = true); //This returns the result and clears the internal state to empty.
   
   
   The test class for AnotB was also completely rewritten.
   
   
   
   3. The Intersection code required a major rewrite.  There were too many problems to list but this rewrite should clear up some major discrepancies and errors caused by improper null and empty handling.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#issuecomment-637035665


   This is an excellent review! Thank you!  I think the protocol is for you to "Resolve" the issues you found. 
   
   One issue you didn't bring up, is the policy on how _nulls_ should be interpreted with the different set operations.  So far, our policy has been to treat nulls as equivalent to the Empty Sketch, i.e., Theta = 1.0 and count = 0.  With this policy nulls and empties have no effect on a Union.  However, submitting a null or empty to an Intersection or as the first argument to an AnotB operation has major impact on the result.  
   
   My colleagues and I have had many discussions on this issue and I think where have left the issue for now is that consistency in implementing a policy is very important.  Currently, our policy has been: We try really hard to not return nulls as they can propagate in other calculations and become bugs that are difficult to trace.  We also prefer to ignore nulls on input, if possible, because real data is dirty and has lots of nulls and missing values.  Also we prefer not to throw exceptions on receiving a null, if possible, because throwing exceptions in a system environment is a big headache for system operators.  
   
   However, the intersection and difference operators are a quandry.   If a user does not like our treatment of nulls, they can catch nulls before entering the set operator, If we returned nulls the user would have to catch these null cases on output.  There is no perfect answer here.  
   
   I would be interested in your view.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on a change in pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on a change in pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#discussion_r433391927



##########
File path: src/main/java/org/apache/datasketches/tuple/Intersection.java
##########
@@ -19,98 +19,176 @@
 
 package org.apache.datasketches.tuple;
 
+import static java.lang.Math.ceil;
+import static java.lang.Math.max;
 import static java.lang.Math.min;
+import static org.apache.datasketches.HashOperations.hashInsertOnly;
+import static org.apache.datasketches.HashOperations.hashSearch;
+import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
+import static org.apache.datasketches.Util.ceilingPowerOf2;
 
 import java.lang.reflect.Array;
 
-import org.apache.datasketches.ResizeFactor;
 import org.apache.datasketches.SketchesStateException;
 
+
 /**
  * Computes an intersection of two or more generic tuple sketches.
- * A new instance represents the Universal Set.
- * Every update() computes an intersection with the internal set
- * and can only reduce the internal set.
+ * A new instance represents the Universal Set. Because the Universal Set
+ * cannot be realized a <i>getResult()</i> on a new instance will produce an error.
+ * Every update() computes an intersection with the internal state, which will never
+ * grow larger and may be reduced to zero.
+ *
  * @param <S> Type of Summary
  */
+@SuppressWarnings("unchecked")
 public class Intersection<S extends Summary> {
-
   private final SummarySetOperations<S> summarySetOps_;
-  private QuickSelectSketch<S> sketch_;
-  private boolean isEmpty_;
-  private long theta_;
-  private boolean isFirstCall_;
+  //private QuickSelectSketch<S> sketch_;

Review comment:
       Accident. Will remove.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] davecromberge commented on a change in pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
davecromberge commented on a change in pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#discussion_r433478030



##########
File path: src/main/java/org/apache/datasketches/tuple/AnotB.java
##########
@@ -22,45 +22,327 @@
 import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
 import static org.apache.datasketches.Util.REBUILD_THRESHOLD;
 import static org.apache.datasketches.Util.ceilingPowerOf2;
+import static org.apache.datasketches.Util.simpleIntLog2;
 
 import java.lang.reflect.Array;
 import java.util.Arrays;
 
 import org.apache.datasketches.HashOperations;
+import org.apache.datasketches.theta.HashIterator;
 
 /**
- * Computes a set difference of two generic tuple sketches
+ * Computes a set difference, A-AND-NOT-B, of two generic tuple sketches
  * @param <S> Type of Summary
  */
 public final class AnotB<S extends Summary> {
-  private boolean isEmpty_ = true;
-  private long theta_ = Long.MAX_VALUE;
-  private long[] keys_;
-  private S[] summaries_;
-  private int count_;
+  private boolean empty_ = true;
+  private long thetaLong_ = Long.MAX_VALUE;
+  private long[] keys_ = null;   //always in compact form, not necessarily sorted
+  private S[] summaries_ = null; //always in compact form, not necessarily sorted
+  private int count_ = 0;
+
+  /**
+   * Sets the given Tuple sketch as the first argument <i>A</i>. This overwrites the internal state of
+   * this AnotB operator with the contents of the given sketch. This sets the stage for multiple
+   * following <i>notB</i> operations.
+   * @param skA The incoming sketch for the first argument, <i>A</i>.
+   */
+  public void setA(final Sketch<S> skA) {
+    if ((skA == null) || skA.isEmpty()) { return; }
+    //skA is not empty
+    empty_ = false;
+    thetaLong_ = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    keys_ = cskA.keys_;
+    summaries_ = cskA.summaries_;
+    count_ = cskA.getRetainedEntries();
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Tuple sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final Sketch<S> skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB = convertToHashTable(skB.keys_, countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summaries_[i];
+          nonMatches++;
+        }
+      }
+    }
+    keys_ = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    summaries_ = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    count_ = nonMatches;
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Theta sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final org.apache.datasketches.theta.Sketch skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB =
+        convertToHashTable(extractThetaHashArray(skB, countB), countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta

Review comment:
       I hadn't considered the case where the contents of the sketch itself are subject to corruption.  I agree that it is valid to preserve the check in this case. 
   It's very useful background information, and it's interesting to see the actual probability explained in those terms and contrasted to the actual RSE of the sketch.  Thanks for the explanation!  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on a change in pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on a change in pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#discussion_r433375023



##########
File path: src/main/java/org/apache/datasketches/tuple/AnotB.java
##########
@@ -22,45 +22,327 @@
 import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
 import static org.apache.datasketches.Util.REBUILD_THRESHOLD;
 import static org.apache.datasketches.Util.ceilingPowerOf2;
+import static org.apache.datasketches.Util.simpleIntLog2;
 
 import java.lang.reflect.Array;
 import java.util.Arrays;
 
 import org.apache.datasketches.HashOperations;
+import org.apache.datasketches.theta.HashIterator;
 
 /**
- * Computes a set difference of two generic tuple sketches
+ * Computes a set difference, A-AND-NOT-B, of two generic tuple sketches
  * @param <S> Type of Summary
  */
 public final class AnotB<S extends Summary> {
-  private boolean isEmpty_ = true;
-  private long theta_ = Long.MAX_VALUE;
-  private long[] keys_;
-  private S[] summaries_;
-  private int count_;
+  private boolean empty_ = true;
+  private long thetaLong_ = Long.MAX_VALUE;
+  private long[] keys_ = null;   //always in compact form, not necessarily sorted
+  private S[] summaries_ = null; //always in compact form, not necessarily sorted
+  private int count_ = 0;
+
+  /**
+   * Sets the given Tuple sketch as the first argument <i>A</i>. This overwrites the internal state of
+   * this AnotB operator with the contents of the given sketch. This sets the stage for multiple
+   * following <i>notB</i> operations.
+   * @param skA The incoming sketch for the first argument, <i>A</i>.
+   */
+  public void setA(final Sketch<S> skA) {
+    if ((skA == null) || skA.isEmpty()) { return; }
+    //skA is not empty
+    empty_ = false;
+    thetaLong_ = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    keys_ = cskA.keys_;
+    summaries_ = cskA.summaries_;
+    count_ = cskA.getRetainedEntries();
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Tuple sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final Sketch<S> skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB = convertToHashTable(skB.keys_, countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summaries_[i];
+          nonMatches++;
+        }
+      }
+    }
+    keys_ = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    summaries_ = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    count_ = nonMatches;
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Theta sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final org.apache.datasketches.theta.Sketch skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB =
+        convertToHashTable(extractThetaHashArray(skB, countB), countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta

Review comment:
       Thank you for these comments and bring it on!  
   
   Hashes are 64 bits. The likelihood that a zero hash would occur is extremely unlikely. Assuming the hash function is perfect in its uniform random properties (which it is not!), a zero hash would occur about once in about 10^19 events.  There is no way to create a hash table of hashes where the empty slot = 0, if zero hashes were significant.  Currently, if a zero hash actually did occur it is effectively ignored, since inserting a zero into an empty slot results in the slot still being empty.   
   
   So how significant is that?  Remember that these are probabilistic algorithms where the RSE on even the largest sketch (with the smallest error) is many orders of magnitude larger than the error that would be produced by ignoring hashes that happen to be zero because they are so rare.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] davecromberge commented on pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
davecromberge commented on pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#issuecomment-637094410


   > However, the intersection and difference operators are a quandry. If a user does not like our treatment of nulls, they can catch nulls before entering the set operator, If we returned nulls the user would have to catch these null cases on output. There is no perfect answer here.
   
   > I would be interested in your view.
   
   I would agree that there is no ideal way to handle nulls in these circumstances, and that there is probably a lot more harm to be done in allowing them to taint calculations undetected.
   My viewpoint is really to try and avoid implicit contracts between the caller and callee as much as possible, and to make the behaviour of the method explicit with regard to input arguments.  I would argue for raising an exception on invalid arguments, and to avoid interpreting nulls as an empty sketch, which is not clear from the signature.  
   
   I realise there is a lot of merit to the argument regarding dirty data, but I think that this is the data owner's concern.  The disclaimer here is that I'm not a library author and don't have to contend with all these edge cases :)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho merged pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho merged pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] davecromberge edited a comment on pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
davecromberge edited a comment on pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#issuecomment-639373396


   > I discussed it with my colleagues and we agree with your suggestion to throw exceptions on null inputs when they could change the state of the sketch, or no meaningful result can be returned. If a null has no impact then it will just be ignored.
   > 
   > Thank you, your feedback helped us resolve what to do here!
   
   I'm glad I could provide feedback, it sounds like a sensible approach in the end.  
   
   How do you know when a null would have an effect or not?  Does it depend on the current state of the operation (i.e. `IsEmpty` or `IsFirstCall`) or is it dependent on the method that is being called, such as the stateless ANotB?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on a change in pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on a change in pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#discussion_r433405011



##########
File path: src/main/java/org/apache/datasketches/tuple/Intersection.java
##########
@@ -19,98 +19,176 @@
 
 package org.apache.datasketches.tuple;
 
+import static java.lang.Math.ceil;
+import static java.lang.Math.max;
 import static java.lang.Math.min;
+import static org.apache.datasketches.HashOperations.hashInsertOnly;
+import static org.apache.datasketches.HashOperations.hashSearch;
+import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
+import static org.apache.datasketches.Util.ceilingPowerOf2;
 
 import java.lang.reflect.Array;
 
-import org.apache.datasketches.ResizeFactor;
 import org.apache.datasketches.SketchesStateException;
 
+
 /**
  * Computes an intersection of two or more generic tuple sketches.
- * A new instance represents the Universal Set.
- * Every update() computes an intersection with the internal set
- * and can only reduce the internal set.
+ * A new instance represents the Universal Set. Because the Universal Set
+ * cannot be realized a <i>getResult()</i> on a new instance will produce an error.
+ * Every update() computes an intersection with the internal state, which will never
+ * grow larger and may be reduced to zero.
+ *
  * @param <S> Type of Summary
  */
+@SuppressWarnings("unchecked")
 public class Intersection<S extends Summary> {
-
   private final SummarySetOperations<S> summarySetOps_;
-  private QuickSelectSketch<S> sketch_;
-  private boolean isEmpty_;
-  private long theta_;
-  private boolean isFirstCall_;
+  //private QuickSelectSketch<S> sketch_;
+  private boolean empty_;
+  private long thetaLong_;
+  private HashTables hashTables_;
+  private boolean firstCall_;
 
   /**
    * Creates new instance
    * @param summarySetOps instance of SummarySetOperations
    */
   public Intersection(final SummarySetOperations<S> summarySetOps) {
     summarySetOps_ = summarySetOps;
-    isEmpty_ = false; // universal set at the start
-    theta_ = Long.MAX_VALUE;
-    isFirstCall_ = true;
+    empty_ = false; // universal set at the start
+    thetaLong_ = Long.MAX_VALUE;
+    hashTables_ = new HashTables();
+    firstCall_ = true;
   }
 
   /**
-   * Updates the internal set by intersecting it with the given sketch
-   * @param sketchIn input sketch to intersect with the internal set
+   * Updates the internal state by intersecting it with the given sketch
+   * @param sketchIn input sketch to intersect with the internal state
    */
-  @SuppressWarnings({ "unchecked", "null" })
   public void update(final Sketch<S> sketchIn) {
-    final boolean isFirstCall = isFirstCall_;
-    isFirstCall_ = false;
+    final boolean firstCall = firstCall_;
+    firstCall_ = false;
     if (sketchIn == null) {
-      isEmpty_ = true;
-      sketch_ = null;
+      empty_ = true;
+      thetaLong_ = Long.MAX_VALUE;
+      hashTables_.clear();
       return;
     }
-    theta_ = min(theta_, sketchIn.getThetaLong());
-    isEmpty_ |= sketchIn.isEmpty();
-    if (isEmpty_ || (sketchIn.getRetainedEntries() == 0)) {
-      sketch_ = null;
+    // input sketch is not null, could be first or next call
+    final long thetaLongIn = sketchIn.getThetaLong();
+    final int countIn = sketchIn.getRetainedEntries();
+    thetaLong_ = min(thetaLong_, thetaLongIn); //Theta rule
+    // Empty rule extended in case incoming sketch does not have empty bit properly set
+    empty_ |= (countIn == 0) && (thetaLongIn == Long.MAX_VALUE);
+    if (countIn == 0) {
+      hashTables_.clear();
       return;
     }
-    // assumes that constructor of QuickSelectSketch bumps the requested size up to the nearest power of 2
-    if (isFirstCall) {
-      sketch_ = new QuickSelectSketch<>(sketchIn.getRetainedEntries(), ResizeFactor.X1.lg(), null);
-      final SketchIterator<S> it = sketchIn.iterator();
-      while (it.next()) {
-        final S summary = (S)it.getSummary().copy();
-        sketch_.insert(it.getKey(), summary);
-      }
-    } else {
-      if (sketch_ == null) {
+    // input sketch will have valid entries > 0
+
+    if (firstCall) {
+      final Sketch<S> firstSketch = sketchIn;
+      //Copy firstSketch data into local instance hashTables_
+      hashTables_.fromSketch(firstSketch);
+    }
+
+    //Next Call
+    else {
+      if (hashTables_.count_ == 0) {
         return;
       }
-      final int matchSize = min(sketch_.getRetainedEntries(), sketchIn.getRetainedEntries());
-      final long[] matchKeys = new long[matchSize];
+      final Sketch<S> nextSketch = sketchIn;
+      //Match nextSketch data with local instance data, filtering by theta
+      final int maxMatchSize = min(hashTables_.count_, nextSketch.getRetainedEntries());
+
+      final long[] matchKeys = new long[maxMatchSize];
       S[] matchSummaries = null;
       int matchCount = 0;
-      final SketchIterator<S> it = sketchIn.iterator();
+
+      final SketchIterator<S> it = nextSketch.iterator();
       while (it.next()) {
-        final S summary = sketch_.find(it.getKey());
-        if (summary != null) {
-          matchKeys[matchCount] = it.getKey();
-          if (matchSummaries == null) {
-            matchSummaries = (S[]) Array.newInstance(summary.getClass(), matchSize);
-          }
-          matchSummaries[matchCount] =
-              summarySetOps_.intersection(summary, it.getSummary());
-          matchCount++;
+        final long key = it.getKey();
+        if (key >= thetaLong_) { continue; }
+        final int index = hashSearch(hashTables_.keys_, hashTables_.lgTableSize_, key);
+        if (index < 0) { continue; }
+        //Copy the intersecting items from local hashTables_
+        // sequentially into local matchKeys_ and matchSummaries_
+        final S mySummary = hashTables_.summaries_[index];
+
+        if (matchSummaries == null) {
+          matchSummaries = (S[]) Array.newInstance(mySummary.getClass(), maxMatchSize);
         }
+        matchKeys[matchCount] = key;
+        matchSummaries[matchCount] = summarySetOps_.intersection(mySummary, it.getSummary());
+        matchCount++;
+      }
+      hashTables_.fromArrays(matchKeys, matchSummaries, matchCount);
+    }
+  }
+
+  /**
+   * Updates the internal set by intersecting it with the given Theta sketch
+   * @param sketchIn input Theta Sketch to intersect with the internal state.
+   * @param summary the given proxy summary for the Theta Sketch, which doesn't have one.
+   */
+  public void update(final org.apache.datasketches.theta.Sketch sketchIn, final S summary) {

Review comment:
       This is a good catch!  Thank you. Will Fix.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on a change in pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on a change in pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#discussion_r433391046



##########
File path: src/main/java/org/apache/datasketches/tuple/AnotB.java
##########
@@ -22,45 +22,327 @@
 import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
 import static org.apache.datasketches.Util.REBUILD_THRESHOLD;
 import static org.apache.datasketches.Util.ceilingPowerOf2;
+import static org.apache.datasketches.Util.simpleIntLog2;
 
 import java.lang.reflect.Array;
 import java.util.Arrays;
 
 import org.apache.datasketches.HashOperations;
+import org.apache.datasketches.theta.HashIterator;
 
 /**
- * Computes a set difference of two generic tuple sketches
+ * Computes a set difference, A-AND-NOT-B, of two generic tuple sketches
  * @param <S> Type of Summary
  */
 public final class AnotB<S extends Summary> {
-  private boolean isEmpty_ = true;
-  private long theta_ = Long.MAX_VALUE;
-  private long[] keys_;
-  private S[] summaries_;
-  private int count_;
+  private boolean empty_ = true;
+  private long thetaLong_ = Long.MAX_VALUE;
+  private long[] keys_ = null;   //always in compact form, not necessarily sorted
+  private S[] summaries_ = null; //always in compact form, not necessarily sorted
+  private int count_ = 0;
+
+  /**
+   * Sets the given Tuple sketch as the first argument <i>A</i>. This overwrites the internal state of
+   * this AnotB operator with the contents of the given sketch. This sets the stage for multiple
+   * following <i>notB</i> operations.
+   * @param skA The incoming sketch for the first argument, <i>A</i>.
+   */
+  public void setA(final Sketch<S> skA) {
+    if ((skA == null) || skA.isEmpty()) { return; }
+    //skA is not empty
+    empty_ = false;
+    thetaLong_ = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    keys_ = cskA.keys_;
+    summaries_ = cskA.summaries_;
+    count_ = cskA.getRetainedEntries();
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Tuple sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final Sketch<S> skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB = convertToHashTable(skB.keys_, countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summaries_[i];
+          nonMatches++;
+        }
+      }
+    }
+    keys_ = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    summaries_ = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    count_ = nonMatches;
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Theta sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final org.apache.datasketches.theta.Sketch skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB =
+        convertToHashTable(extractThetaHashArray(skB, countB), countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summaries_[i];
+          nonMatches++;
+        }
+      }
+    }
+    keys_ = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    summaries_ = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    count_ = nonMatches;
+  }
+
+
+  /**
+   * Returns the A-and-not-B set operation on the two given sketches.
+   * A null sketch argument is interpreted as an empty sketch.
+   * This is not an accumulating update.
+   *
+   * @param skA The incoming sketch for the first argument
+   * @param skB The incoming sketch for the second argument
+   * @param <S> Type of Summary
+   * @return the result as a compact sketch
+   */
+  @SuppressWarnings("unchecked")
+  public static <S extends Summary>
+        CompactSketch<S> aNotB(final Sketch<S> skA, final Sketch<S> skB) {
+    if ((skA == null) || skA.isEmpty()) {
+      return new CompactSketch<>(null, null, Long.MAX_VALUE, true);
+    }
+    //skA is not empty
+    final boolean empty = false;
+    final long thetaLongA = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    final long[] keysA = cskA.keys_;
+    final S[] summariesA = cskA.summaries_;
+    final int countA = cskA.getRetainedEntries();
+
+    if ((skB == null) || skB.isEmpty()) {
+      return new CompactSketch<>(keysA, summariesA, thetaLongA, empty);
+    }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    final long thetaLong = Math.min(thetaLongA, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB =
+        convertToHashTable(skB.keys_, countB, thetaLong);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[countA];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summariesA.getClass().getComponentType(), countA);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < countA; i++) {
+      final long key = keysA[i];
+      if ((key != 0) && (key < thetaLong)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summariesA[i];
+          nonMatches++;
+        }
+      }
+    }
+    final long[] keys = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    final S[] summaries = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    final CompactSketch<S> result =
+        new CompactSketch<>(keys, summaries, thetaLong, empty);
+    return result;
+  }
+
+  /**
+   * Returns the A-and-not-B set operation on a Tuple sketch and a Theta sketch.
+   * A null sketch argument is interpreted as an empty sketch.
+   * This is not an accumulating update.
+   *
+   * @param skA The incoming Tuple sketch for the first argument
+   * @param skB The incoming Theta sketch for the second argument
+   * @param <S> Type of Summary
+   * @return the result as a compact sketch
+   */
+  @SuppressWarnings("unchecked")
+  public static <S extends Summary>
+        CompactSketch<S> aNotB(final Sketch<S> skA, final org.apache.datasketches.theta.Sketch skB) {
+    if ((skA == null) || skA.isEmpty()) {
+      return new CompactSketch<>(null, null, Long.MAX_VALUE, true);
+    }
+    //skA is not empty
+    final boolean empty = false;
+    final long thetaLongA = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    final long[] keysA = cskA.keys_;
+    final S[] summariesA = cskA.summaries_;
+    final int countA = cskA.getRetainedEntries();
+
+    if ((skB == null) || skB.isEmpty()) {
+      return new CompactSketch<>(keysA, summariesA, thetaLongA, empty);
+    }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    final long thetaLong = Math.min(thetaLongA, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB =
+        convertToHashTable(extractThetaHashArray(skB, countB), countB, thetaLong);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[countA];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summariesA.getClass().getComponentType(), countA);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < countA; i++) {
+      final long key = keysA[i];
+      if ((key != 0) && (key < thetaLong)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summariesA[i];
+          nonMatches++;
+        }
+      }
+    }
+    final long[] keys = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    final S[] summaries = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    final CompactSketch<S> result =
+        new CompactSketch<>(keys, summaries, thetaLong, empty);
+    return result;
+  }
+
+  /**
+   * Gets the result of this operation.
+   * @param reset if true, clears this operator to the empty state after result is returned.
+   * @return the result of this operation as a CompactSketch.
+   */
+  public CompactSketch<S> getResult(final boolean reset) {
+    if (count_ == 0) {
+      return new CompactSketch<>(null, null, thetaLong_, empty_);
+    }
+    final CompactSketch<S> result =
+        new CompactSketch<>(Arrays.copyOfRange(keys_, 0, count_),
+            Arrays.copyOfRange(summaries_, 0, count_), thetaLong_, empty_);
+    if (reset) { reset(); }
+    return result;
+  }
+
+  private static long[] extractThetaHashArray(
+      final org.apache.datasketches.theta.Sketch sketch,
+      final int count) {
+    final HashIterator itr = sketch.iterator();
+    final long[] hashArr = new long[count];
+    int ctr = 0;
+    while (itr.next()) {
+      hashArr[ctr++] = itr.get();
+    }
+    assert ctr == count;

Review comment:
       This is another "peace-of-mind" sanity check and its performance cost is negligible.  The assert statement disappears at run time unless the user specifically enables them with "-ea".  What is the other case?  Corruption or bugs in the source sketch.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] AlexanderSaydakov commented on a change in pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
AlexanderSaydakov commented on a change in pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#discussion_r435680993



##########
File path: src/main/java/org/apache/datasketches/tuple/AnotB.java
##########
@@ -22,45 +22,351 @@
 import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
 import static org.apache.datasketches.Util.REBUILD_THRESHOLD;
 import static org.apache.datasketches.Util.ceilingPowerOf2;
+import static org.apache.datasketches.Util.simpleIntLog2;
 
 import java.lang.reflect.Array;
 import java.util.Arrays;
 
 import org.apache.datasketches.HashOperations;
+import org.apache.datasketches.SketchesArgumentException;
+import org.apache.datasketches.theta.HashIterator;
 
 /**
- * Computes a set difference of two generic tuple sketches
+ * Computes a set difference, A-AND-NOT-B, of two generic tuple sketches
  * @param <S> Type of Summary
  */
 public final class AnotB<S extends Summary> {
-  private boolean isEmpty_ = true;
-  private long theta_ = Long.MAX_VALUE;
-  private long[] keys_;
-  private S[] summaries_;
-  private int count_;
+  private boolean empty_ = true;
+  private long thetaLong_ = Long.MAX_VALUE;
+  private long[] keys_ = null;   //always in compact form, not necessarily sorted
+  private S[] summaries_ = null; //always in compact form, not necessarily sorted
+  private int count_ = 0;
+
+  /**
+   * Sets the given Tuple sketch as the first argument <i>A</i>. This overwrites the internal state of
+   * this AnotB operator with the contents of the given sketch. This sets the stage for multiple
+   * following <i>notB</i> operations.
+   *
+   * <p>An input argument of null will throw an exception.</p>
+   *
+   * @param skA The incoming sketch for the first argument, <i>A</i>.
+   */
+  public void setA(final Sketch<S> skA) {
+    if (skA == null) {
+      throw new SketchesArgumentException("The input argument may not be null");
+    }
+    if (skA.isEmpty()) { return; }
+    //skA is not empty
+    empty_ = false;
+    thetaLong_ = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    keys_ = cskA.keys_;
+    summaries_ = cskA.summaries_;
+    count_ = cskA.getRetainedEntries();
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   *
+   * <p>An input argument of null or empty is ignored.</p>
+   *
+   * @param skB The incoming Tuple sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final Sketch<S> skB) {
+    if (empty_ || (skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB = convertToHashTable(skB.keys_, countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summaries_[i];
+          nonMatches++;
+        }
+      }
+    }
+    keys_ = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    summaries_ = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    count_ = nonMatches;
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.

Review comment:
       AnotB

##########
File path: src/main/java/org/apache/datasketches/tuple/AnotB.java
##########
@@ -22,45 +22,351 @@
 import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
 import static org.apache.datasketches.Util.REBUILD_THRESHOLD;
 import static org.apache.datasketches.Util.ceilingPowerOf2;
+import static org.apache.datasketches.Util.simpleIntLog2;
 
 import java.lang.reflect.Array;
 import java.util.Arrays;
 
 import org.apache.datasketches.HashOperations;
+import org.apache.datasketches.SketchesArgumentException;
+import org.apache.datasketches.theta.HashIterator;
 
 /**
- * Computes a set difference of two generic tuple sketches
+ * Computes a set difference, A-AND-NOT-B, of two generic tuple sketches
  * @param <S> Type of Summary
  */
 public final class AnotB<S extends Summary> {
-  private boolean isEmpty_ = true;
-  private long theta_ = Long.MAX_VALUE;
-  private long[] keys_;
-  private S[] summaries_;
-  private int count_;
+  private boolean empty_ = true;
+  private long thetaLong_ = Long.MAX_VALUE;
+  private long[] keys_ = null;   //always in compact form, not necessarily sorted
+  private S[] summaries_ = null; //always in compact form, not necessarily sorted
+  private int count_ = 0;
+
+  /**
+   * Sets the given Tuple sketch as the first argument <i>A</i>. This overwrites the internal state of
+   * this AnotB operator with the contents of the given sketch. This sets the stage for multiple
+   * following <i>notB</i> operations.
+   *
+   * <p>An input argument of null will throw an exception.</p>
+   *
+   * @param skA The incoming sketch for the first argument, <i>A</i>.
+   */
+  public void setA(final Sketch<S> skA) {
+    if (skA == null) {
+      throw new SketchesArgumentException("The input argument may not be null");
+    }
+    if (skA.isEmpty()) { return; }
+    //skA is not empty
+    empty_ = false;
+    thetaLong_ = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    keys_ = cskA.keys_;
+    summaries_ = cskA.summaries_;
+    count_ = cskA.getRetainedEntries();
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   *
+   * <p>An input argument of null or empty is ignored.</p>
+   *
+   * @param skB The incoming Tuple sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final Sketch<S> skB) {
+    if (empty_ || (skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta

Review comment:
       removes -> remove

##########
File path: src/main/java/org/apache/datasketches/tuple/AnotB.java
##########
@@ -22,45 +22,351 @@
 import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
 import static org.apache.datasketches.Util.REBUILD_THRESHOLD;
 import static org.apache.datasketches.Util.ceilingPowerOf2;
+import static org.apache.datasketches.Util.simpleIntLog2;
 
 import java.lang.reflect.Array;
 import java.util.Arrays;
 
 import org.apache.datasketches.HashOperations;
+import org.apache.datasketches.SketchesArgumentException;
+import org.apache.datasketches.theta.HashIterator;
 
 /**
- * Computes a set difference of two generic tuple sketches
+ * Computes a set difference, A-AND-NOT-B, of two generic tuple sketches
  * @param <S> Type of Summary
  */
 public final class AnotB<S extends Summary> {
-  private boolean isEmpty_ = true;
-  private long theta_ = Long.MAX_VALUE;
-  private long[] keys_;
-  private S[] summaries_;
-  private int count_;
+  private boolean empty_ = true;
+  private long thetaLong_ = Long.MAX_VALUE;
+  private long[] keys_ = null;   //always in compact form, not necessarily sorted
+  private S[] summaries_ = null; //always in compact form, not necessarily sorted
+  private int count_ = 0;
+
+  /**
+   * Sets the given Tuple sketch as the first argument <i>A</i>. This overwrites the internal state of
+   * this AnotB operator with the contents of the given sketch. This sets the stage for multiple
+   * following <i>notB</i> operations.
+   *
+   * <p>An input argument of null will throw an exception.</p>
+   *
+   * @param skA The incoming sketch for the first argument, <i>A</i>.
+   */
+  public void setA(final Sketch<S> skA) {
+    if (skA == null) {
+      throw new SketchesArgumentException("The input argument may not be null");
+    }
+    if (skA.isEmpty()) { return; }
+    //skA is not empty
+    empty_ = false;
+    thetaLong_ = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    keys_ = cskA.keys_;
+    summaries_ = cskA.summaries_;
+    count_ = cskA.getRetainedEntries();
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.

Review comment:
       AnotB

##########
File path: src/main/java/org/apache/datasketches/tuple/AnotB.java
##########
@@ -22,45 +22,351 @@
 import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
 import static org.apache.datasketches.Util.REBUILD_THRESHOLD;
 import static org.apache.datasketches.Util.ceilingPowerOf2;
+import static org.apache.datasketches.Util.simpleIntLog2;
 
 import java.lang.reflect.Array;
 import java.util.Arrays;
 
 import org.apache.datasketches.HashOperations;
+import org.apache.datasketches.SketchesArgumentException;
+import org.apache.datasketches.theta.HashIterator;
 
 /**
- * Computes a set difference of two generic tuple sketches
+ * Computes a set difference, A-AND-NOT-B, of two generic tuple sketches
  * @param <S> Type of Summary
  */
 public final class AnotB<S extends Summary> {
-  private boolean isEmpty_ = true;
-  private long theta_ = Long.MAX_VALUE;
-  private long[] keys_;
-  private S[] summaries_;
-  private int count_;
+  private boolean empty_ = true;
+  private long thetaLong_ = Long.MAX_VALUE;
+  private long[] keys_ = null;   //always in compact form, not necessarily sorted
+  private S[] summaries_ = null; //always in compact form, not necessarily sorted
+  private int count_ = 0;
+
+  /**
+   * Sets the given Tuple sketch as the first argument <i>A</i>. This overwrites the internal state of
+   * this AnotB operator with the contents of the given sketch. This sets the stage for multiple
+   * following <i>notB</i> operations.
+   *
+   * <p>An input argument of null will throw an exception.</p>
+   *
+   * @param skA The incoming sketch for the first argument, <i>A</i>.
+   */
+  public void setA(final Sketch<S> skA) {
+    if (skA == null) {
+      throw new SketchesArgumentException("The input argument may not be null");
+    }
+    if (skA.isEmpty()) { return; }
+    //skA is not empty
+    empty_ = false;
+    thetaLong_ = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    keys_ = cskA.keys_;
+    summaries_ = cskA.summaries_;
+    count_ = cskA.getRetainedEntries();
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   *
+   * <p>An input argument of null or empty is ignored.</p>
+   *
+   * @param skB The incoming Tuple sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final Sketch<S> skB) {
+    if (empty_ || (skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB = convertToHashTable(skB.keys_, countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summaries_[i];
+          nonMatches++;
+        }
+      }
+    }
+    keys_ = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    summaries_ = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    count_ = nonMatches;
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   *
+   * <p>An input argument of null or empty is ignored.</p>
+   *
+   * @param skB The incoming Theta sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final org.apache.datasketches.theta.Sketch skB) {
+    if (empty_ || (skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB =
+        convertToHashTable(extractThetaHashArray(skB, countB), countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key > 0) && (key < thetaLong_)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summaries_[i];
+          nonMatches++;
+        }
+      }
+    }
+    keys_ = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    summaries_ = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    count_ = nonMatches;
+  }
+
+  /**
+   * Returns the A-and-not-B set operation on the two given Tuple sketches.

Review comment:
       the operation is named inconsistently: AnotB, A-AND-NOT-B, A-and-not-B




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#issuecomment-639063968


   WRT the AOD sketch.  I think it is better to move this discussion to dev@.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on a change in pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on a change in pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#discussion_r433427280



##########
File path: src/main/java/org/apache/datasketches/tuple/Intersection.java
##########
@@ -19,98 +19,176 @@
 
 package org.apache.datasketches.tuple;
 
+import static java.lang.Math.ceil;
+import static java.lang.Math.max;
 import static java.lang.Math.min;
+import static org.apache.datasketches.HashOperations.hashInsertOnly;
+import static org.apache.datasketches.HashOperations.hashSearch;
+import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
+import static org.apache.datasketches.Util.ceilingPowerOf2;
 
 import java.lang.reflect.Array;
 
-import org.apache.datasketches.ResizeFactor;
 import org.apache.datasketches.SketchesStateException;
 
+
 /**
  * Computes an intersection of two or more generic tuple sketches.
- * A new instance represents the Universal Set.
- * Every update() computes an intersection with the internal set
- * and can only reduce the internal set.
+ * A new instance represents the Universal Set. Because the Universal Set
+ * cannot be realized a <i>getResult()</i> on a new instance will produce an error.
+ * Every update() computes an intersection with the internal state, which will never
+ * grow larger and may be reduced to zero.
+ *
  * @param <S> Type of Summary
  */
+@SuppressWarnings("unchecked")
 public class Intersection<S extends Summary> {
-
   private final SummarySetOperations<S> summarySetOps_;
-  private QuickSelectSketch<S> sketch_;
-  private boolean isEmpty_;
-  private long theta_;
-  private boolean isFirstCall_;
+  //private QuickSelectSketch<S> sketch_;
+  private boolean empty_;
+  private long thetaLong_;
+  private HashTables hashTables_;
+  private boolean firstCall_;
 
   /**
    * Creates new instance
    * @param summarySetOps instance of SummarySetOperations
    */
   public Intersection(final SummarySetOperations<S> summarySetOps) {
     summarySetOps_ = summarySetOps;
-    isEmpty_ = false; // universal set at the start
-    theta_ = Long.MAX_VALUE;
-    isFirstCall_ = true;
+    empty_ = false; // universal set at the start
+    thetaLong_ = Long.MAX_VALUE;
+    hashTables_ = new HashTables();
+    firstCall_ = true;
   }
 
   /**
-   * Updates the internal set by intersecting it with the given sketch
-   * @param sketchIn input sketch to intersect with the internal set
+   * Updates the internal state by intersecting it with the given sketch
+   * @param sketchIn input sketch to intersect with the internal state
    */
-  @SuppressWarnings({ "unchecked", "null" })
   public void update(final Sketch<S> sketchIn) {
-    final boolean isFirstCall = isFirstCall_;
-    isFirstCall_ = false;
+    final boolean firstCall = firstCall_;
+    firstCall_ = false;
     if (sketchIn == null) {
-      isEmpty_ = true;
-      sketch_ = null;
+      empty_ = true;
+      thetaLong_ = Long.MAX_VALUE;
+      hashTables_.clear();
       return;
     }
-    theta_ = min(theta_, sketchIn.getThetaLong());
-    isEmpty_ |= sketchIn.isEmpty();
-    if (isEmpty_ || (sketchIn.getRetainedEntries() == 0)) {
-      sketch_ = null;
+    // input sketch is not null, could be first or next call
+    final long thetaLongIn = sketchIn.getThetaLong();
+    final int countIn = sketchIn.getRetainedEntries();
+    thetaLong_ = min(thetaLong_, thetaLongIn); //Theta rule
+    // Empty rule extended in case incoming sketch does not have empty bit properly set
+    empty_ |= (countIn == 0) && (thetaLongIn == Long.MAX_VALUE);
+    if (countIn == 0) {
+      hashTables_.clear();
       return;
     }
-    // assumes that constructor of QuickSelectSketch bumps the requested size up to the nearest power of 2
-    if (isFirstCall) {
-      sketch_ = new QuickSelectSketch<>(sketchIn.getRetainedEntries(), ResizeFactor.X1.lg(), null);
-      final SketchIterator<S> it = sketchIn.iterator();
-      while (it.next()) {
-        final S summary = (S)it.getSummary().copy();
-        sketch_.insert(it.getKey(), summary);
-      }
-    } else {
-      if (sketch_ == null) {
+    // input sketch will have valid entries > 0
+
+    if (firstCall) {
+      final Sketch<S> firstSketch = sketchIn;
+      //Copy firstSketch data into local instance hashTables_
+      hashTables_.fromSketch(firstSketch);
+    }
+
+    //Next Call
+    else {
+      if (hashTables_.count_ == 0) {
         return;
       }
-      final int matchSize = min(sketch_.getRetainedEntries(), sketchIn.getRetainedEntries());
-      final long[] matchKeys = new long[matchSize];
+      final Sketch<S> nextSketch = sketchIn;
+      //Match nextSketch data with local instance data, filtering by theta
+      final int maxMatchSize = min(hashTables_.count_, nextSketch.getRetainedEntries());
+
+      final long[] matchKeys = new long[maxMatchSize];
       S[] matchSummaries = null;
       int matchCount = 0;
-      final SketchIterator<S> it = sketchIn.iterator();
+
+      final SketchIterator<S> it = nextSketch.iterator();
       while (it.next()) {
-        final S summary = sketch_.find(it.getKey());
-        if (summary != null) {
-          matchKeys[matchCount] = it.getKey();
-          if (matchSummaries == null) {
-            matchSummaries = (S[]) Array.newInstance(summary.getClass(), matchSize);
-          }
-          matchSummaries[matchCount] =
-              summarySetOps_.intersection(summary, it.getSummary());
-          matchCount++;
+        final long key = it.getKey();
+        if (key >= thetaLong_) { continue; }
+        final int index = hashSearch(hashTables_.keys_, hashTables_.lgTableSize_, key);
+        if (index < 0) { continue; }
+        //Copy the intersecting items from local hashTables_
+        // sequentially into local matchKeys_ and matchSummaries_
+        final S mySummary = hashTables_.summaries_[index];
+
+        if (matchSummaries == null) {
+          matchSummaries = (S[]) Array.newInstance(mySummary.getClass(), maxMatchSize);
         }
+        matchKeys[matchCount] = key;
+        matchSummaries[matchCount] = summarySetOps_.intersection(mySummary, it.getSummary());
+        matchCount++;
+      }
+      hashTables_.fromArrays(matchKeys, matchSummaries, matchCount);
+    }
+  }
+
+  /**
+   * Updates the internal set by intersecting it with the given Theta sketch
+   * @param sketchIn input Theta Sketch to intersect with the internal state.
+   * @param summary the given proxy summary for the Theta Sketch, which doesn't have one.
+   */
+  public void update(final org.apache.datasketches.theta.Sketch sketchIn, final S summary) {

Review comment:
       BTW, I chose to throw an exception here, because I can't figure out what else to do.  I can't assume a default summary since I have no idea what the user would consider to be a default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] davecromberge commented on a change in pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
davecromberge commented on a change in pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#discussion_r433281646



##########
File path: src/test/java/org/apache/datasketches/tuple/adouble/AdoubleAnotBTest.java
##########
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.datasketches.tuple.adouble;
+
+import static org.testng.Assert.assertEquals;
+import static org.testng.Assert.assertTrue;
+
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.datasketches.theta.UpdateSketchBuilder;
+import org.apache.datasketches.tuple.AnotB;
+import org.apache.datasketches.tuple.CompactSketch;
+import org.apache.datasketches.tuple.Sketch;
+import org.apache.datasketches.tuple.SketchIterator;
+import org.apache.datasketches.tuple.UpdatableSketch;
+import org.apache.datasketches.tuple.UpdatableSketchBuilder;
+import org.apache.datasketches.tuple.adouble.DoubleSummary.Mode;
+import org.testng.Assert;
+import org.testng.annotations.Test;
+
+/**
+ * @author Lee Rhodes
+ */
+@SuppressWarnings("javadoc")
+public class AdoubleAnotBTest {
+  private final DoubleSummary.Mode mode = Mode.Sum;
+  private final Results results = new Results();
+
+  private static void threeMethodsWithTheta(
+      final AnotB<DoubleSummary> aNotB,
+      final Sketch<DoubleSummary> sketchA,
+      final Sketch<DoubleSummary> sketchB,
+      final org.apache.datasketches.theta.Sketch skThetaB,
+      final Results results) {
+    //deprecated
+    aNotB.update(sketchA, sketchB);
+    CompactSketch<DoubleSummary> result = aNotB.getResult();
+    results.check(result);
+    //Stateless
+    result = AnotB.aNotB(sketchA, sketchB);
+    results.check(result);
+    result = AnotB.aNotB(sketchA, skThetaB);
+    //Stateful w B = Tuple
+    aNotB.setA(sketchA);
+    aNotB.notB(sketchB);
+    result = aNotB.getResult(true);
+    results.check(result);
+    //Stateful w B = Theta
+    aNotB.setA(sketchA);
+    aNotB.notB(skThetaB);
+    result = aNotB.getResult(false);
+    results.check(result);
+    result = aNotB.getResult(true);
+    results.check(result);
+  }

Review comment:
       This is a very neat way to ensure consistency between all the variants on ANotB 👍 

##########
File path: src/test/java/org/apache/datasketches/tuple/adouble/AdoubleAnotBTest.java
##########
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.datasketches.tuple.adouble;
+
+import static org.testng.Assert.assertEquals;
+import static org.testng.Assert.assertTrue;
+
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.datasketches.theta.UpdateSketchBuilder;
+import org.apache.datasketches.tuple.AnotB;
+import org.apache.datasketches.tuple.CompactSketch;
+import org.apache.datasketches.tuple.Sketch;
+import org.apache.datasketches.tuple.SketchIterator;
+import org.apache.datasketches.tuple.UpdatableSketch;
+import org.apache.datasketches.tuple.UpdatableSketchBuilder;
+import org.apache.datasketches.tuple.adouble.DoubleSummary.Mode;
+import org.testng.Assert;
+import org.testng.annotations.Test;
+
+/**
+ * @author Lee Rhodes
+ */
+@SuppressWarnings("javadoc")
+public class AdoubleAnotBTest {
+  private final DoubleSummary.Mode mode = Mode.Sum;
+  private final Results results = new Results();
+
+  private static void threeMethodsWithTheta(
+      final AnotB<DoubleSummary> aNotB,
+      final Sketch<DoubleSummary> sketchA,
+      final Sketch<DoubleSummary> sketchB,
+      final org.apache.datasketches.theta.Sketch skThetaB,
+      final Results results) {
+    //deprecated
+    aNotB.update(sketchA, sketchB);
+    CompactSketch<DoubleSummary> result = aNotB.getResult();
+    results.check(result);
+    //Stateless
+    result = AnotB.aNotB(sketchA, sketchB);
+    results.check(result);
+    result = AnotB.aNotB(sketchA, skThetaB);
+    //Stateful w B = Tuple
+    aNotB.setA(sketchA);
+    aNotB.notB(sketchB);
+    result = aNotB.getResult(true);
+    results.check(result);
+    //Stateful w B = Theta
+    aNotB.setA(sketchA);
+    aNotB.notB(skThetaB);
+    result = aNotB.getResult(false);
+    results.check(result);
+    result = aNotB.getResult(true);
+    results.check(result);
+  }
+
+  private static class Results {
+    private int retEnt = 0;
+    private boolean empty = true;
+    private double expect = 0.0;
+    private double toll = 0.0;

Review comment:
       Is `toll` shorthand for `tolerance`, as in error tolerance?

##########
File path: src/main/java/org/apache/datasketches/tuple/Intersection.java
##########
@@ -19,98 +19,176 @@
 
 package org.apache.datasketches.tuple;
 
+import static java.lang.Math.ceil;
+import static java.lang.Math.max;
 import static java.lang.Math.min;
+import static org.apache.datasketches.HashOperations.hashInsertOnly;
+import static org.apache.datasketches.HashOperations.hashSearch;
+import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
+import static org.apache.datasketches.Util.ceilingPowerOf2;
 
 import java.lang.reflect.Array;
 
-import org.apache.datasketches.ResizeFactor;
 import org.apache.datasketches.SketchesStateException;
 
+
 /**
  * Computes an intersection of two or more generic tuple sketches.
- * A new instance represents the Universal Set.
- * Every update() computes an intersection with the internal set
- * and can only reduce the internal set.
+ * A new instance represents the Universal Set. Because the Universal Set
+ * cannot be realized a <i>getResult()</i> on a new instance will produce an error.
+ * Every update() computes an intersection with the internal state, which will never
+ * grow larger and may be reduced to zero.
+ *
  * @param <S> Type of Summary
  */
+@SuppressWarnings("unchecked")
 public class Intersection<S extends Summary> {
-
   private final SummarySetOperations<S> summarySetOps_;
-  private QuickSelectSketch<S> sketch_;
-  private boolean isEmpty_;
-  private long theta_;
-  private boolean isFirstCall_;
+  //private QuickSelectSketch<S> sketch_;
+  private boolean empty_;
+  private long thetaLong_;
+  private HashTables hashTables_;
+  private boolean firstCall_;
 
   /**
    * Creates new instance
    * @param summarySetOps instance of SummarySetOperations
    */
   public Intersection(final SummarySetOperations<S> summarySetOps) {
     summarySetOps_ = summarySetOps;
-    isEmpty_ = false; // universal set at the start
-    theta_ = Long.MAX_VALUE;
-    isFirstCall_ = true;
+    empty_ = false; // universal set at the start
+    thetaLong_ = Long.MAX_VALUE;
+    hashTables_ = new HashTables();
+    firstCall_ = true;
   }
 
   /**
-   * Updates the internal set by intersecting it with the given sketch
-   * @param sketchIn input sketch to intersect with the internal set
+   * Updates the internal state by intersecting it with the given sketch
+   * @param sketchIn input sketch to intersect with the internal state
    */
-  @SuppressWarnings({ "unchecked", "null" })
   public void update(final Sketch<S> sketchIn) {
-    final boolean isFirstCall = isFirstCall_;
-    isFirstCall_ = false;
+    final boolean firstCall = firstCall_;
+    firstCall_ = false;
     if (sketchIn == null) {
-      isEmpty_ = true;
-      sketch_ = null;
+      empty_ = true;
+      thetaLong_ = Long.MAX_VALUE;
+      hashTables_.clear();
       return;
     }
-    theta_ = min(theta_, sketchIn.getThetaLong());
-    isEmpty_ |= sketchIn.isEmpty();
-    if (isEmpty_ || (sketchIn.getRetainedEntries() == 0)) {
-      sketch_ = null;
+    // input sketch is not null, could be first or next call
+    final long thetaLongIn = sketchIn.getThetaLong();
+    final int countIn = sketchIn.getRetainedEntries();
+    thetaLong_ = min(thetaLong_, thetaLongIn); //Theta rule
+    // Empty rule extended in case incoming sketch does not have empty bit properly set
+    empty_ |= (countIn == 0) && (thetaLongIn == Long.MAX_VALUE);
+    if (countIn == 0) {
+      hashTables_.clear();
       return;
     }
-    // assumes that constructor of QuickSelectSketch bumps the requested size up to the nearest power of 2
-    if (isFirstCall) {
-      sketch_ = new QuickSelectSketch<>(sketchIn.getRetainedEntries(), ResizeFactor.X1.lg(), null);
-      final SketchIterator<S> it = sketchIn.iterator();
-      while (it.next()) {
-        final S summary = (S)it.getSummary().copy();
-        sketch_.insert(it.getKey(), summary);
-      }
-    } else {
-      if (sketch_ == null) {
+    // input sketch will have valid entries > 0
+
+    if (firstCall) {
+      final Sketch<S> firstSketch = sketchIn;
+      //Copy firstSketch data into local instance hashTables_
+      hashTables_.fromSketch(firstSketch);
+    }
+
+    //Next Call
+    else {
+      if (hashTables_.count_ == 0) {
         return;
       }
-      final int matchSize = min(sketch_.getRetainedEntries(), sketchIn.getRetainedEntries());
-      final long[] matchKeys = new long[matchSize];
+      final Sketch<S> nextSketch = sketchIn;
+      //Match nextSketch data with local instance data, filtering by theta
+      final int maxMatchSize = min(hashTables_.count_, nextSketch.getRetainedEntries());
+
+      final long[] matchKeys = new long[maxMatchSize];
       S[] matchSummaries = null;
       int matchCount = 0;
-      final SketchIterator<S> it = sketchIn.iterator();
+
+      final SketchIterator<S> it = nextSketch.iterator();
       while (it.next()) {
-        final S summary = sketch_.find(it.getKey());
-        if (summary != null) {
-          matchKeys[matchCount] = it.getKey();
-          if (matchSummaries == null) {
-            matchSummaries = (S[]) Array.newInstance(summary.getClass(), matchSize);
-          }
-          matchSummaries[matchCount] =
-              summarySetOps_.intersection(summary, it.getSummary());
-          matchCount++;
+        final long key = it.getKey();
+        if (key >= thetaLong_) { continue; }
+        final int index = hashSearch(hashTables_.keys_, hashTables_.lgTableSize_, key);
+        if (index < 0) { continue; }
+        //Copy the intersecting items from local hashTables_
+        // sequentially into local matchKeys_ and matchSummaries_
+        final S mySummary = hashTables_.summaries_[index];
+
+        if (matchSummaries == null) {
+          matchSummaries = (S[]) Array.newInstance(mySummary.getClass(), maxMatchSize);
         }
+        matchKeys[matchCount] = key;
+        matchSummaries[matchCount] = summarySetOps_.intersection(mySummary, it.getSummary());
+        matchCount++;
+      }
+      hashTables_.fromArrays(matchKeys, matchSummaries, matchCount);
+    }
+  }
+
+  /**
+   * Updates the internal set by intersecting it with the given Theta sketch
+   * @param sketchIn input Theta Sketch to intersect with the internal state.
+   * @param summary the given proxy summary for the Theta Sketch, which doesn't have one.
+   */
+  public void update(final org.apache.datasketches.theta.Sketch sketchIn, final S summary) {

Review comment:
       For unions, the summary is checked for null.  Is the same check necessary here, or does the internal intersection on summaries already verify this?

##########
File path: src/main/java/org/apache/datasketches/tuple/AnotB.java
##########
@@ -22,45 +22,327 @@
 import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
 import static org.apache.datasketches.Util.REBUILD_THRESHOLD;
 import static org.apache.datasketches.Util.ceilingPowerOf2;
+import static org.apache.datasketches.Util.simpleIntLog2;
 
 import java.lang.reflect.Array;
 import java.util.Arrays;
 
 import org.apache.datasketches.HashOperations;
+import org.apache.datasketches.theta.HashIterator;
 
 /**
- * Computes a set difference of two generic tuple sketches
+ * Computes a set difference, A-AND-NOT-B, of two generic tuple sketches
  * @param <S> Type of Summary
  */
 public final class AnotB<S extends Summary> {
-  private boolean isEmpty_ = true;
-  private long theta_ = Long.MAX_VALUE;
-  private long[] keys_;
-  private S[] summaries_;
-  private int count_;
+  private boolean empty_ = true;
+  private long thetaLong_ = Long.MAX_VALUE;
+  private long[] keys_ = null;   //always in compact form, not necessarily sorted
+  private S[] summaries_ = null; //always in compact form, not necessarily sorted
+  private int count_ = 0;
+
+  /**
+   * Sets the given Tuple sketch as the first argument <i>A</i>. This overwrites the internal state of
+   * this AnotB operator with the contents of the given sketch. This sets the stage for multiple
+   * following <i>notB</i> operations.
+   * @param skA The incoming sketch for the first argument, <i>A</i>.
+   */
+  public void setA(final Sketch<S> skA) {
+    if ((skA == null) || skA.isEmpty()) { return; }
+    //skA is not empty
+    empty_ = false;
+    thetaLong_ = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    keys_ = cskA.keys_;
+    summaries_ = cskA.summaries_;
+    count_ = cskA.getRetainedEntries();
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Tuple sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final Sketch<S> skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB = convertToHashTable(skB.keys_, countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summaries_[i];
+          nonMatches++;
+        }
+      }
+    }
+    keys_ = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    summaries_ = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    count_ = nonMatches;
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Theta sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final org.apache.datasketches.theta.Sketch skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB =
+        convertToHashTable(extractThetaHashArray(skB, countB), countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summaries_[i];
+          nonMatches++;
+        }
+      }
+    }
+    keys_ = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    summaries_ = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    count_ = nonMatches;
+  }
+
+
+  /**
+   * Returns the A-and-not-B set operation on the two given sketches.
+   * A null sketch argument is interpreted as an empty sketch.
+   * This is not an accumulating update.
+   *
+   * @param skA The incoming sketch for the first argument
+   * @param skB The incoming sketch for the second argument
+   * @param <S> Type of Summary
+   * @return the result as a compact sketch
+   */
+  @SuppressWarnings("unchecked")
+  public static <S extends Summary>
+        CompactSketch<S> aNotB(final Sketch<S> skA, final Sketch<S> skB) {
+    if ((skA == null) || skA.isEmpty()) {
+      return new CompactSketch<>(null, null, Long.MAX_VALUE, true);
+    }
+    //skA is not empty
+    final boolean empty = false;
+    final long thetaLongA = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    final long[] keysA = cskA.keys_;
+    final S[] summariesA = cskA.summaries_;
+    final int countA = cskA.getRetainedEntries();
+
+    if ((skB == null) || skB.isEmpty()) {
+      return new CompactSketch<>(keysA, summariesA, thetaLongA, empty);
+    }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    final long thetaLong = Math.min(thetaLongA, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB =
+        convertToHashTable(skB.keys_, countB, thetaLong);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[countA];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summariesA.getClass().getComponentType(), countA);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < countA; i++) {
+      final long key = keysA[i];
+      if ((key != 0) && (key < thetaLong)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summariesA[i];
+          nonMatches++;
+        }
+      }
+    }
+    final long[] keys = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    final S[] summaries = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    final CompactSketch<S> result =
+        new CompactSketch<>(keys, summaries, thetaLong, empty);
+    return result;
+  }
+
+  /**
+   * Returns the A-and-not-B set operation on a Tuple sketch and a Theta sketch.
+   * A null sketch argument is interpreted as an empty sketch.
+   * This is not an accumulating update.
+   *
+   * @param skA The incoming Tuple sketch for the first argument
+   * @param skB The incoming Theta sketch for the second argument
+   * @param <S> Type of Summary
+   * @return the result as a compact sketch
+   */
+  @SuppressWarnings("unchecked")
+  public static <S extends Summary>
+        CompactSketch<S> aNotB(final Sketch<S> skA, final org.apache.datasketches.theta.Sketch skB) {
+    if ((skA == null) || skA.isEmpty()) {
+      return new CompactSketch<>(null, null, Long.MAX_VALUE, true);
+    }
+    //skA is not empty
+    final boolean empty = false;
+    final long thetaLongA = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    final long[] keysA = cskA.keys_;
+    final S[] summariesA = cskA.summaries_;
+    final int countA = cskA.getRetainedEntries();
+
+    if ((skB == null) || skB.isEmpty()) {
+      return new CompactSketch<>(keysA, summariesA, thetaLongA, empty);
+    }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    final long thetaLong = Math.min(thetaLongA, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB =
+        convertToHashTable(extractThetaHashArray(skB, countB), countB, thetaLong);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[countA];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summariesA.getClass().getComponentType(), countA);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < countA; i++) {
+      final long key = keysA[i];
+      if ((key != 0) && (key < thetaLong)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summariesA[i];
+          nonMatches++;
+        }
+      }
+    }
+    final long[] keys = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    final S[] summaries = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    final CompactSketch<S> result =
+        new CompactSketch<>(keys, summaries, thetaLong, empty);
+    return result;
+  }
+
+  /**
+   * Gets the result of this operation.
+   * @param reset if true, clears this operator to the empty state after result is returned.
+   * @return the result of this operation as a CompactSketch.
+   */
+  public CompactSketch<S> getResult(final boolean reset) {
+    if (count_ == 0) {
+      return new CompactSketch<>(null, null, thetaLong_, empty_);
+    }
+    final CompactSketch<S> result =
+        new CompactSketch<>(Arrays.copyOfRange(keys_, 0, count_),
+            Arrays.copyOfRange(summaries_, 0, count_), thetaLong_, empty_);
+    if (reset) { reset(); }
+    return result;
+  }
+
+  private static long[] extractThetaHashArray(
+      final org.apache.datasketches.theta.Sketch sketch,
+      final int count) {
+    final HashIterator itr = sketch.iterator();
+    final long[] hashArr = new long[count];
+    int ctr = 0;
+    while (itr.next()) {
+      hashArr[ctr++] = itr.get();
+    }
+    assert ctr == count;

Review comment:
       Since the count is often sourced from the sketch retained entries, is there a case for another count?  Is it a performance concern to cache the count in the caller?

##########
File path: src/main/java/org/apache/datasketches/tuple/Intersection.java
##########
@@ -19,98 +19,176 @@
 
 package org.apache.datasketches.tuple;
 
+import static java.lang.Math.ceil;
+import static java.lang.Math.max;
 import static java.lang.Math.min;
+import static org.apache.datasketches.HashOperations.hashInsertOnly;
+import static org.apache.datasketches.HashOperations.hashSearch;
+import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
+import static org.apache.datasketches.Util.ceilingPowerOf2;
 
 import java.lang.reflect.Array;
 
-import org.apache.datasketches.ResizeFactor;
 import org.apache.datasketches.SketchesStateException;
 
+
 /**
  * Computes an intersection of two or more generic tuple sketches.
- * A new instance represents the Universal Set.
- * Every update() computes an intersection with the internal set
- * and can only reduce the internal set.
+ * A new instance represents the Universal Set. Because the Universal Set
+ * cannot be realized a <i>getResult()</i> on a new instance will produce an error.
+ * Every update() computes an intersection with the internal state, which will never
+ * grow larger and may be reduced to zero.
+ *
  * @param <S> Type of Summary
  */
+@SuppressWarnings("unchecked")
 public class Intersection<S extends Summary> {
-
   private final SummarySetOperations<S> summarySetOps_;
-  private QuickSelectSketch<S> sketch_;
-  private boolean isEmpty_;
-  private long theta_;
-  private boolean isFirstCall_;
+  //private QuickSelectSketch<S> sketch_;

Review comment:
       Is it intentional to leave this local state commented?

##########
File path: src/main/java/org/apache/datasketches/tuple/Intersection.java
##########
@@ -19,98 +19,176 @@
 
 package org.apache.datasketches.tuple;
 
+import static java.lang.Math.ceil;
+import static java.lang.Math.max;
 import static java.lang.Math.min;
+import static org.apache.datasketches.HashOperations.hashInsertOnly;
+import static org.apache.datasketches.HashOperations.hashSearch;
+import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
+import static org.apache.datasketches.Util.ceilingPowerOf2;
 
 import java.lang.reflect.Array;
 
-import org.apache.datasketches.ResizeFactor;
 import org.apache.datasketches.SketchesStateException;
 
+
 /**
  * Computes an intersection of two or more generic tuple sketches.
- * A new instance represents the Universal Set.
- * Every update() computes an intersection with the internal set
- * and can only reduce the internal set.
+ * A new instance represents the Universal Set. Because the Universal Set
+ * cannot be realized a <i>getResult()</i> on a new instance will produce an error.
+ * Every update() computes an intersection with the internal state, which will never
+ * grow larger and may be reduced to zero.
+ *

Review comment:
       This additional documentation is great!

##########
File path: src/main/java/org/apache/datasketches/tuple/AnotB.java
##########
@@ -22,45 +22,327 @@
 import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
 import static org.apache.datasketches.Util.REBUILD_THRESHOLD;
 import static org.apache.datasketches.Util.ceilingPowerOf2;
+import static org.apache.datasketches.Util.simpleIntLog2;
 
 import java.lang.reflect.Array;
 import java.util.Arrays;
 
 import org.apache.datasketches.HashOperations;
+import org.apache.datasketches.theta.HashIterator;
 
 /**
- * Computes a set difference of two generic tuple sketches
+ * Computes a set difference, A-AND-NOT-B, of two generic tuple sketches
  * @param <S> Type of Summary
  */
 public final class AnotB<S extends Summary> {
-  private boolean isEmpty_ = true;
-  private long theta_ = Long.MAX_VALUE;
-  private long[] keys_;
-  private S[] summaries_;
-  private int count_;
+  private boolean empty_ = true;
+  private long thetaLong_ = Long.MAX_VALUE;
+  private long[] keys_ = null;   //always in compact form, not necessarily sorted
+  private S[] summaries_ = null; //always in compact form, not necessarily sorted
+  private int count_ = 0;
+
+  /**
+   * Sets the given Tuple sketch as the first argument <i>A</i>. This overwrites the internal state of
+   * this AnotB operator with the contents of the given sketch. This sets the stage for multiple
+   * following <i>notB</i> operations.
+   * @param skA The incoming sketch for the first argument, <i>A</i>.
+   */
+  public void setA(final Sketch<S> skA) {
+    if ((skA == null) || skA.isEmpty()) { return; }
+    //skA is not empty
+    empty_ = false;
+    thetaLong_ = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    keys_ = cskA.keys_;
+    summaries_ = cskA.summaries_;
+    count_ = cskA.getRetainedEntries();
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Tuple sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final Sketch<S> skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB = convertToHashTable(skB.keys_, countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summaries_[i];
+          nonMatches++;
+        }
+      }
+    }
+    keys_ = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    summaries_ = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    count_ = nonMatches;
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Theta sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final org.apache.datasketches.theta.Sketch skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB =
+        convertToHashTable(extractThetaHashArray(skB, countB), countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta

Review comment:
       How do you interpret a hashed key that equals 0?

##########
File path: src/test/java/org/apache/datasketches/tuple/adouble/AdoubleIntersectionTest.java
##########
@@ -0,0 +1,332 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.datasketches.tuple.adouble;
+
+import static org.testng.Assert.fail;
+
+import org.apache.datasketches.SketchesStateException;
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.datasketches.theta.UpdateSketchBuilder;
+import org.apache.datasketches.tuple.CompactSketch;
+import org.apache.datasketches.tuple.Intersection;
+import org.apache.datasketches.tuple.Sketch;
+import org.apache.datasketches.tuple.SketchIterator;
+import org.apache.datasketches.tuple.Sketches;
+import org.apache.datasketches.tuple.UpdatableSketch;
+import org.apache.datasketches.tuple.UpdatableSketchBuilder;
+import org.apache.datasketches.tuple.adouble.DoubleSummary.Mode;
+import org.testng.Assert;
+import org.testng.annotations.Test;
+
+/**
+ * @author Lee Rhodes
+ */
+@SuppressWarnings("javadoc")
+public class AdoubleIntersectionTest {
+  private final DoubleSummary.Mode mode = Mode.Sum;
+
+  @Test
+  public void intersectionNotEmptyNoEntries() {
+    UpdatableSketch<Double, DoubleSummary> sketch1 =
+        new UpdatableSketchBuilder<>
+          (new DoubleSummaryFactory(mode)).setSamplingProbability(0.01f).build();
+    sketch1.update("a", 1.0); // this happens to get rejected because of sampling with low probability
+    Intersection<DoubleSummary> intersection =
+        new Intersection<>(new DoubleSummarySetOperations(mode, mode));
+    intersection.update(sketch1);
+    CompactSketch<DoubleSummary> result = intersection.getResult();
+    Assert.assertEquals(result.getRetainedEntries(), 0);
+    Assert.assertFalse(result.isEmpty());
+    Assert.assertEquals(result.getEstimate(), 0.0);
+    Assert.assertEquals(result.getLowerBound(1), 0.0, 0.0001);
+    Assert.assertTrue(result.getUpperBound(1) > 0);
+  }
+
+  @Test
+  public void intersectionExactWithNull() {
+    UpdatableSketch<Double, DoubleSummary> sketch1 =
+        new UpdatableSketchBuilder<>(new DoubleSummaryFactory(mode)).build();
+    sketch1.update(1, 1.0);
+    sketch1.update(2, 1.0);
+    sketch1.update(3, 1.0);
+
+    Intersection<DoubleSummary> intersection =
+        new Intersection<>(new DoubleSummarySetOperations(mode, mode));
+    intersection.update(sketch1);
+    intersection.update(null);
+    CompactSketch<DoubleSummary> result = intersection.getResult();
+    Assert.assertEquals(result.getRetainedEntries(), 0);
+    Assert.assertTrue(result.isEmpty());
+    Assert.assertEquals(result.getEstimate(), 0.0);
+    Assert.assertEquals(result.getLowerBound(1), 0.0);
+    Assert.assertEquals(result.getUpperBound(1), 0.0);
+  }
+
+  @Test
+  public void intersectionExactWithEmpty() {
+    UpdatableSketch<Double, DoubleSummary> sketch1 =
+        new UpdatableSketchBuilder<>(new DoubleSummaryFactory(mode)).build();
+    sketch1.update(1, 1.0);
+    sketch1.update(2, 1.0);
+    sketch1.update(3, 1.0);
+
+    Sketch<DoubleSummary> sketch2 = Sketches.createEmptySketch();
+
+    Intersection<DoubleSummary> intersection =
+        new Intersection<>(new DoubleSummarySetOperations(mode, mode));
+    intersection.update(sketch1);
+    intersection.update(sketch2);
+    CompactSketch<DoubleSummary> result = intersection.getResult();
+    Assert.assertEquals(result.getRetainedEntries(), 0);
+    Assert.assertTrue(result.isEmpty());
+    Assert.assertEquals(result.getEstimate(), 0.0);
+    Assert.assertEquals(result.getLowerBound(1), 0.0);
+    Assert.assertEquals(result.getUpperBound(1), 0.0);
+  }
+
+  @Test
+  public void intersectionExactMode() {
+    UpdatableSketch<Double, DoubleSummary> sketch1 =
+        new UpdatableSketchBuilder<>(new DoubleSummaryFactory(mode)).build();
+    sketch1.update(1, 1.0);
+    sketch1.update(1, 1.0);
+    sketch1.update(2, 1.0);
+    sketch1.update(2, 1.0);
+
+    UpdatableSketch<Double, DoubleSummary> sketch2 =
+        new UpdatableSketchBuilder<>(new DoubleSummaryFactory(mode)).build();
+    sketch2.update(2, 1.0);
+    sketch2.update(2, 1.0);
+    sketch2.update(3, 1.0);
+    sketch2.update(3, 1.0);
+
+    Intersection<DoubleSummary> intersection =
+        new Intersection<>(new DoubleSummarySetOperations(mode, mode));
+    intersection.update(sketch1);
+    intersection.update(sketch2);
+    CompactSketch<DoubleSummary> result = intersection.getResult();
+    Assert.assertEquals(result.getRetainedEntries(), 1);
+    Assert.assertFalse(result.isEmpty());
+    Assert.assertEquals(result.getEstimate(), 1.0);
+    Assert.assertEquals(result.getLowerBound(1), 1.0);
+    Assert.assertEquals(result.getUpperBound(1), 1.0);
+    SketchIterator<DoubleSummary> it = result.iterator();
+    Assert.assertTrue(it.next());
+    Assert.assertEquals(it.getSummary().getValue(), 4.0);
+    Assert.assertFalse(it.next());
+
+    intersection.reset();
+    intersection.update(null);
+    result = intersection.getResult();
+    Assert.assertTrue(result.isEmpty());
+    Assert.assertFalse(result.isEstimationMode());
+    Assert.assertEquals(result.getEstimate(), 0.0);
+    Assert.assertEquals(result.getUpperBound(1), 0.0);
+    Assert.assertEquals(result.getLowerBound(1), 0.0);
+    Assert.assertEquals(result.getTheta(), 1.0);
+}
+
+  @Test
+  public void intersectionDisjointEstimationMode() {
+    int key = 0;
+    UpdatableSketch<Double, DoubleSummary> sketch1 =
+        new UpdatableSketchBuilder<>(new DoubleSummaryFactory(mode)).build();
+    for (int i = 0; i < 8192; i++) {
+      sketch1.update(key++, 1.0);
+    }
+
+    UpdatableSketch<Double, DoubleSummary> sketch2 =
+        new UpdatableSketchBuilder<>(new DoubleSummaryFactory(mode)).build();
+    for (int i = 0; i < 8192; i++) {
+      sketch2.update(key++, 1.0);
+    }
+
+    Intersection<DoubleSummary> intersection =
+        new Intersection<>(new DoubleSummarySetOperations(mode, mode));
+    intersection.update(sketch1);
+    intersection.update(sketch2);
+    CompactSketch<DoubleSummary> result = intersection.getResult();
+    Assert.assertEquals(result.getRetainedEntries(), 0);
+    Assert.assertFalse(result.isEmpty());
+    Assert.assertEquals(result.getEstimate(), 0.0);
+    Assert.assertEquals(result.getLowerBound(1), 0.0);
+    Assert.assertTrue(result.getUpperBound(1) > 0);
+
+    // an intersection with no entries must survive more updates
+    intersection.update(sketch1);
+    result = intersection.getResult();
+    Assert.assertEquals(result.getRetainedEntries(), 0);
+    Assert.assertFalse(result.isEmpty());
+    Assert.assertEquals(result.getEstimate(), 0.0);
+    Assert.assertEquals(result.getLowerBound(1), 0.0);
+    Assert.assertTrue(result.getUpperBound(1) > 0);
+  }
+
+  @Test
+  public void intersectionEstimationMode() {
+    int key = 0;
+    UpdatableSketch<Double, DoubleSummary> sketch1 =
+        new UpdatableSketchBuilder<>(new DoubleSummaryFactory(mode)).build();
+    for (int i = 0; i < 8192; i++) {
+      sketch1.update(key++, 1.0);
+    }
+
+    key -= 4096; // overlap half of the entries
+    UpdatableSketch<Double, DoubleSummary> sketch2 =
+        new UpdatableSketchBuilder<>(new DoubleSummaryFactory(mode)).build();
+    for (int i = 0; i < 8192; i++) {
+      sketch2.update(key++, 1.0);
+    }
+
+    Intersection<DoubleSummary> intersection =
+        new Intersection<>(new DoubleSummarySetOperations(mode, mode));
+    intersection.update(sketch1);
+    intersection.update(sketch2);
+    CompactSketch<DoubleSummary> result = intersection.getResult();
+    Assert.assertFalse(result.isEmpty());
+    // crude estimate of RSE(95%) = 2 / sqrt(result.getRetainedEntries())
+    Assert.assertEquals(result.getEstimate(), 4096.0, 4096 * 0.03);
+    Assert.assertTrue(result.getLowerBound(1) <= result.getEstimate());
+    Assert.assertTrue(result.getUpperBound(1) > result.getEstimate());
+    SketchIterator<DoubleSummary> it = result.iterator();
+    while (it.next()) {
+      Assert.assertEquals(it.getSummary().getValue(), 2.0);
+    }
+  }
+
+  @Test
+  public void checkExactIntersectionWithTheta() {
+    UpdateSketch thSkNull = null;
+    UpdateSketch thSkEmpty = new UpdateSketchBuilder().build();
+    UpdateSketch thSk10 = new UpdateSketchBuilder().build();
+    UpdateSketch thSk15 = new UpdateSketchBuilder().build();
+    for (int i = 0; i < 10; i++) { thSk10.update(i); }
+    for (int i = 0; i < 10; i++) { thSk15.update(i + 5); } //overlap = 5
+
+    DoubleSummary dsum = new DoubleSummaryFactory(mode).newSummary();
+    Intersection<DoubleSummary> intersection =
+        new Intersection<>(new DoubleSummarySetOperations(mode, mode));
+    CompactSketch<DoubleSummary> result;
+
+    try { intersection.getResult(); fail(); }
+    catch (SketchesStateException e ) { } //OK.
+
+    intersection.update(thSkNull, dsum);
+    result = intersection.getResult();
+    Assert.assertTrue(result.isEmpty()); //Empty after null first call
+    intersection.reset();
+
+    intersection.update(thSkEmpty, dsum);
+    result = intersection.getResult();
+    Assert.assertTrue(result.isEmpty()); //Empty after empty first call
+    intersection.reset();
+
+    intersection.update(thSk10, dsum);
+    result = intersection.getResult();
+    Assert.assertEquals(result.getEstimate(), 10.0); //Returns valid first call
+    intersection.reset();
+
+    intersection.update(thSk10, dsum);  // Valid first call
+    intersection.update(thSkNull, dsum);
+    result = intersection.getResult();
+    Assert.assertTrue(result.isEmpty()); //Returns Empty after null second call
+    intersection.reset();
+
+    intersection.update(thSk10, dsum);  // Valid first call
+    intersection.update(thSkEmpty, dsum);
+    result = intersection.getResult();
+    Assert.assertTrue(result.isEmpty()); //Returns Empty after empty second call
+    intersection.reset();
+
+    intersection.update(thSk10, dsum);
+    intersection.update(thSk15, dsum);
+    result = intersection.getResult();
+    Assert.assertEquals(result.getEstimate(), 5.0); //Returns intersection
+    intersection.reset();
+  }
+
+  @Test
+  public void checkExactIntersectionWithThetaDisjoint() {
+    UpdateSketch thSkA = new UpdateSketchBuilder().setLogNominalEntries(10).build();
+    UpdateSketch thSkB = new UpdateSketchBuilder().setLogNominalEntries(10).build();
+    int key = 0;
+    for (int i = 0; i < 32;  i++) { thSkA.update(key++); }
+    for (int i = 0; i < 32; i++) { thSkB.update(key++); }
+
+    DoubleSummary dsum = new DoubleSummaryFactory(mode).newSummary();
+    Intersection<DoubleSummary> intersection =
+        new Intersection<>(new DoubleSummarySetOperations(mode, mode));
+    CompactSketch<DoubleSummary> result;
+
+    intersection.update(thSkA, dsum);
+    intersection.update(thSkB, dsum);
+    result = intersection.getResult();
+    Assert.assertEquals(result.getRetainedEntries(), 0);
+
+    // an intersection with no entries must survive more updates
+    intersection.update(thSkA, dsum);
+    result = intersection.getResult();
+    Assert.assertEquals(result.getRetainedEntries(), 0);
+    intersection.reset();
+  }
+
+  @Test
+  public void checkEstimatingIntersectionWithThetaOverlapping() {
+    UpdateSketch thSkA = new UpdateSketchBuilder().setLogNominalEntries(4).build();
+    UpdateSketch thSkB = new UpdateSketchBuilder().setLogNominalEntries(10).build();
+    for (int i = 0; i < 64;  i++) { thSkA.update(i); } //dense mode, low theta
+    for (int i = 32; i < 96; i++) { thSkB.update(i); } //exact overlapping
+
+    DoubleSummary dsum = new DoubleSummaryFactory(mode).newSummary();
+    Intersection<DoubleSummary> intersection =
+        new Intersection<>(new DoubleSummarySetOperations(mode, mode));
+    CompactSketch<DoubleSummary> result;
+
+    intersection.update(thSkA, dsum);
+    intersection.update(thSkB, dsum);
+    result = intersection.getResult();
+    Assert.assertEquals(result.getRetainedEntries(), 14);
+
+    thSkB.reset();
+    for (int i = 100; i < 164; i++) { thSkB.update(i); } //exact, disjoint
+    intersection.update(thSkB, dsum); //remove existing entries
+    result = intersection.getResult();
+    Assert.assertEquals(result.getRetainedEntries(), 0);
+    intersection.update(thSkB, dsum);
+    result = intersection.getResult();
+    Assert.assertEquals(result.getRetainedEntries(), 0);

Review comment:
       When I attempted adding the set intersection with theta, I struggled with how many of the existing test cases to duplicate for theta.  I think that you have struck a good balance here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#issuecomment-639171298


   Similar null related changes need to be made in Theta, but that will be another PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on a change in pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on a change in pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#discussion_r433402959



##########
File path: src/test/java/org/apache/datasketches/tuple/adouble/AdoubleAnotBTest.java
##########
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.datasketches.tuple.adouble;
+
+import static org.testng.Assert.assertEquals;
+import static org.testng.Assert.assertTrue;
+
+import org.apache.datasketches.theta.UpdateSketch;
+import org.apache.datasketches.theta.UpdateSketchBuilder;
+import org.apache.datasketches.tuple.AnotB;
+import org.apache.datasketches.tuple.CompactSketch;
+import org.apache.datasketches.tuple.Sketch;
+import org.apache.datasketches.tuple.SketchIterator;
+import org.apache.datasketches.tuple.UpdatableSketch;
+import org.apache.datasketches.tuple.UpdatableSketchBuilder;
+import org.apache.datasketches.tuple.adouble.DoubleSummary.Mode;
+import org.testng.Assert;
+import org.testng.annotations.Test;
+
+/**
+ * @author Lee Rhodes
+ */
+@SuppressWarnings("javadoc")
+public class AdoubleAnotBTest {
+  private final DoubleSummary.Mode mode = Mode.Sum;
+  private final Results results = new Results();
+
+  private static void threeMethodsWithTheta(
+      final AnotB<DoubleSummary> aNotB,
+      final Sketch<DoubleSummary> sketchA,
+      final Sketch<DoubleSummary> sketchB,
+      final org.apache.datasketches.theta.Sketch skThetaB,
+      final Results results) {
+    //deprecated
+    aNotB.update(sketchA, sketchB);
+    CompactSketch<DoubleSummary> result = aNotB.getResult();
+    results.check(result);
+    //Stateless
+    result = AnotB.aNotB(sketchA, sketchB);
+    results.check(result);
+    result = AnotB.aNotB(sketchA, skThetaB);
+    //Stateful w B = Tuple
+    aNotB.setA(sketchA);
+    aNotB.notB(sketchB);
+    result = aNotB.getResult(true);
+    results.check(result);
+    //Stateful w B = Theta
+    aNotB.setA(sketchA);
+    aNotB.notB(skThetaB);
+    result = aNotB.getResult(false);
+    results.check(result);
+    result = aNotB.getResult(true);
+    results.check(result);
+  }
+
+  private static class Results {
+    private int retEnt = 0;
+    private boolean empty = true;
+    private double expect = 0.0;
+    private double toll = 0.0;

Review comment:
       Yes. You can see that it is used in the _TestNG Assert.assertEquals(double, double double)_ call where the 3rd argument is an error tolerance on the comparison.  But my abbreviation is misspelled!  It should be _tol_!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#issuecomment-639065892


   I discussed it with my colleagues and we agree with your suggestion to throw exceptions on null inputs when they could change the state of the sketch, or no meaningful result can be returned.  If a null has no impact then it will just be ignored.
   
   Thank you, your feedback helped us resolve what to do here!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#issuecomment-640098235


   This has been reviewed and approved, so I'm going to merge this.  However, there is more work to be done, which will be separate PRs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] davecromberge commented on pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
davecromberge commented on pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#issuecomment-637097701


   > This is an excellent review! Thank you! I think the protocol is for you to "Resolve" the issues you found.
   
   Thanks for taking the time to respond to my comments, and for your helpful explanations!  Unfortunately, I do not have the appropriate permissions to resolve my comments, as the button is not available to me on this repository.  I know what you are referring to, but just don't have access in this case.
   I have instead opted to mark them all with a +1, as they have been addressed.
   
   One comment that I also did not raise is the Array of Doubles sketch - please let me know if there is any way in which I can help with this.  I realise that this has been discussed in the dev forum and is probably beyond the scope of this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on a change in pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
leerho commented on a change in pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#discussion_r433386568



##########
File path: src/main/java/org/apache/datasketches/tuple/AnotB.java
##########
@@ -22,45 +22,327 @@
 import static org.apache.datasketches.Util.MIN_LG_NOM_LONGS;
 import static org.apache.datasketches.Util.REBUILD_THRESHOLD;
 import static org.apache.datasketches.Util.ceilingPowerOf2;
+import static org.apache.datasketches.Util.simpleIntLog2;
 
 import java.lang.reflect.Array;
 import java.util.Arrays;
 
 import org.apache.datasketches.HashOperations;
+import org.apache.datasketches.theta.HashIterator;
 
 /**
- * Computes a set difference of two generic tuple sketches
+ * Computes a set difference, A-AND-NOT-B, of two generic tuple sketches
  * @param <S> Type of Summary
  */
 public final class AnotB<S extends Summary> {
-  private boolean isEmpty_ = true;
-  private long theta_ = Long.MAX_VALUE;
-  private long[] keys_;
-  private S[] summaries_;
-  private int count_;
+  private boolean empty_ = true;
+  private long thetaLong_ = Long.MAX_VALUE;
+  private long[] keys_ = null;   //always in compact form, not necessarily sorted
+  private S[] summaries_ = null; //always in compact form, not necessarily sorted
+  private int count_ = 0;
+
+  /**
+   * Sets the given Tuple sketch as the first argument <i>A</i>. This overwrites the internal state of
+   * this AnotB operator with the contents of the given sketch. This sets the stage for multiple
+   * following <i>notB</i> operations.
+   * @param skA The incoming sketch for the first argument, <i>A</i>.
+   */
+  public void setA(final Sketch<S> skA) {
+    if ((skA == null) || skA.isEmpty()) { return; }
+    //skA is not empty
+    empty_ = false;
+    thetaLong_ = skA.getThetaLong();
+    final CompactSketch<S> cskA = (skA instanceof CompactSketch)
+        ? (CompactSketch<S>)skA
+        : ((QuickSelectSketch<S>)skA).compact();
+    keys_ = cskA.keys_;
+    summaries_ = cskA.summaries_;
+    count_ = cskA.getRetainedEntries();
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Tuple sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final Sketch<S> skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB = convertToHashTable(skB.keys_, countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta
+        final int index =
+            HashOperations.hashSearch(hashTableKeysB, simpleIntLog2(hashTableKeysB.length), key);
+        if (index == -1) {
+          tmpKeysA[nonMatches] = key;
+          tmpSummariesA[nonMatches] = summaries_[i];
+          nonMatches++;
+        }
+      }
+    }
+    keys_ = Arrays.copyOfRange(tmpKeysA, 0, nonMatches);
+    summaries_ = Arrays.copyOfRange(tmpSummariesA, 0, nonMatches);
+    count_ = nonMatches;
+  }
+
+  /**
+   * Performs an <i>AND NOT</i> operation with the existing internal state of this AnoB operator.
+   * @param skB The incoming Theta sketch for the second (or following) argument <i>B</i>.
+   */
+  @SuppressWarnings("unchecked")
+  public void notB(final org.apache.datasketches.theta.Sketch skB) {
+    if (empty_) { return; }
+    if ((skB == null) || skB.isEmpty()) { return; }
+    //skB is not empty
+    final long thetaLongB = skB.getThetaLong();
+    thetaLong_ = Math.min(thetaLong_, thetaLongB);
+    //Build hashtable and removes keys of skB >= theta
+    final int countB = skB.getRetainedEntries();
+    final long[] hashTableKeysB =
+        convertToHashTable(extractThetaHashArray(skB, countB), countB, thetaLong_);
+
+    //build temporary arrays of skA
+    final long[] tmpKeysA = new long[count_];
+    final S[] tmpSummariesA =
+        (S[]) Array.newInstance(summaries_.getClass().getComponentType(), count_);
+
+    //search for non matches and build temp arrays
+    int nonMatches = 0;
+    for (int i = 0; i < count_; i++) {
+      final long key = keys_[i];
+      if ((key != 0) && (key < thetaLong_)) { //skips keys of A >= theta

Review comment:
       This calls into question whether the check (key != 0) is even necessary.  In theory, probably not.  By the way, hash keys in the theta family are forced to be positive or zero by shifting the output of the hash function right by one bit (effectively making them 63 bits of entropy).  Actually this check, if it exists at all, should probably be (key > 0).
   
   So why do I put in checks like this?  Because I am conservative and worried about the other source of non-valid hashes, which is corruption of the sketch binaries as they are moved around in a system. Which, by the way, is far more likely than a valid hash being <= 0.  So this little check just insures that invalid hashes don't get further propagated.  
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] davecromberge commented on pull request #319: Tuple theta extension

Posted by GitBox <gi...@apache.org>.
davecromberge commented on pull request #319:
URL: https://github.com/apache/incubator-datasketches-java/pull/319#issuecomment-639373396


   > I discussed it with my colleagues and we agree with your suggestion to throw exceptions on null inputs when they could change the state of the sketch, or no meaningful result can be returned. If a null has no impact then it will just be ignored.
   > 
   > Thank you, your feedback helped us resolve what to do here!
   
   I'm glad I could provide feedback, it sounds like a sensible approach.  How do you know when a null would have an effect or not?  Does it depend on the current state of the operation (i.e. `IsEmpty` or `IsFirstCall`) or is it dependent on the method that is being called, such as the stateless ANotB?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org