You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/06/15 01:11:57 UTC

[GitHub] [pinot] Jackie-Jiang opened a new pull request, #8891: Optimize dictionary lookup for IN clause

Jackie-Jiang opened a new pull request, #8891:
URL: https://github.com/apache/pinot/pull/8891

   - Cache the parsed values in the IN/NOT_IN predicate to prevent per-segment string value parse
   - Add on-heap dictionary for BYTES and BIG_DECIMAL data type
   - For IN predicate with lots of values, bound the initial dict id set to 1000 to prevent over-allocating when lots of values are not in the dictionary
   - Implement `Dictionary.indexOf()` for all data types to avoid the unnecessary string conversion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on a diff in pull request #8891: Optimize dictionary lookup for IN clause

Posted by GitBox <gi...@apache.org>.

richardstartin commented on code in PR #8891:
URL: https://github.com/apache/pinot/pull/8891#discussion_r898947064


##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/NotInPredicateEvaluatorFactory.java:
##########
@@ -70,67 +68,80 @@ public static BaseDictionaryBasedPredicateEvaluator newDictionaryBasedEvaluator(
    */
   public static BaseRawValueBasedPredicateEvaluator newRawValueBasedEvaluator(NotInPredicate notInPredicate,
       DataType dataType) {
-    List<String> values = notInPredicate.getValues();
-    int hashSetSize = HashUtil.getMinHashSetSize(values.size());
     switch (dataType) {
       case INT: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Integer.parseInt(value));
+        int[] intValues = notInPredicate.getIntValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(intValues.length));
+        for (int value : intValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case LONG: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Long.parseLong(value));
+        long[] longValues = notInPredicate.getLongValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(longValues.length));
+        for (long value : longValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case FLOAT: {
-        FloatSet nonMatchingValues = new FloatOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Float.parseFloat(value));
+        float[] floatValues = notInPredicate.getFloatValues();
+        FloatSet nonMatchingValues = new FloatOpenHashSet(HashUtil.getMinHashSetSize(floatValues.length));
+        for (float value : floatValues) {
+          nonMatchingValues.add(value);
         }
         return new FloatRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case DOUBLE: {
-        DoubleSet nonMatchingValues = new DoubleOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Double.parseDouble(value));
+        double[] doubleValues = notInPredicate.getDoubleValues();
+        DoubleSet nonMatchingValues = new DoubleOpenHashSet(HashUtil.getMinHashSetSize(doubleValues.length));
+        for (double value : doubleValues) {
+          nonMatchingValues.add(value);
         }
         return new DoubleRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BIG_DECIMAL: {
-        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>();
-        for (String value : values) {
-          nonMatchingValues.add(new BigDecimal(value));
-        }
+        BigDecimal[] bigDecimalValues = notInPredicate.getBigDecimalValues();
+        // NOTE: Use TreeSet because BigDecimal's compareTo() is not consistent with equals()
+        //       E.g. compareTo(3.0, 3) returns 0 but equals(3.0, 3) returns false
+        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>(Arrays.asList(bigDecimalValues));
         return new BigDecimalRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BOOLEAN: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BooleanUtils.toInt(value));
+        int[] booleanValues = notInPredicate.getBooleanValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(booleanValues.length));
+        for (int value : booleanValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case TIMESTAMP: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(TimestampUtils.toMillisSinceEpoch(value));
+        long[] timestampValues = notInPredicate.getTimestampValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(timestampValues.length));
+        for (long value : timestampValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case STRING: {
-        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        nonMatchingValues.addAll(values);
+        List<String> stringValues = notInPredicate.getValues();
+        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(stringValues.size()));
+        // NOTE: Add value-by-value to avoid overhead
+        for (String value : stringValues) {
+          //noinspection UseBulkOperation
+          nonMatchingValues.add(value);
+        }
         return new StringRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BYTES: {
-        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BytesUtils.toByteArray(value));
+        ByteArray[] bytesValues = notInPredicate.getBytesValues();
+        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(bytesValues.length));
+        // NOTE: Add value-by-value to avoid overhead
+        //noinspection ManualArrayToCollectionCopy

Review Comment:
   I’m not going to test it but would be amazed if this were beneficial. Shouldn’t hash map size be controlled by the load factor anyway?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] xiangfu0 commented on a diff in pull request #8891: Optimize dictionary lookup for IN clause

Posted by GitBox <gi...@apache.org>.

xiangfu0 commented on code in PR #8891:
URL: https://github.com/apache/pinot/pull/8891#discussion_r898907993


##########
pinot-spi/src/main/java/org/apache/pinot/spi/utils/ByteArray.java:
##########
@@ -54,6 +53,9 @@ public class ByteArray implements Comparable<ByteArray>, Serializable {
 
   private final byte[] _bytes;
 
+  // Hash for empty ByteArray is 1
+  private int _hash = 1;
+
   public ByteArray(byte[] bytes) {

Review Comment:
   Is this ByteArray reusable? If so, we should reset _hash = 1 here.
   Or just have one more boolean represent if hash is already computed in method `hash()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8891: Optimize dictionary lookup for IN clause

Posted by GitBox <gi...@apache.org>.

Jackie-Jiang commented on code in PR #8891:
URL: https://github.com/apache/pinot/pull/8891#discussion_r899392596


##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/NotInPredicateEvaluatorFactory.java:
##########
@@ -70,67 +68,80 @@ public static BaseDictionaryBasedPredicateEvaluator newDictionaryBasedEvaluator(
    */
   public static BaseRawValueBasedPredicateEvaluator newRawValueBasedEvaluator(NotInPredicate notInPredicate,
       DataType dataType) {
-    List<String> values = notInPredicate.getValues();
-    int hashSetSize = HashUtil.getMinHashSetSize(values.size());
     switch (dataType) {
       case INT: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Integer.parseInt(value));
+        int[] intValues = notInPredicate.getIntValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(intValues.length));
+        for (int value : intValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case LONG: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Long.parseLong(value));
+        long[] longValues = notInPredicate.getLongValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(longValues.length));
+        for (long value : longValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case FLOAT: {
-        FloatSet nonMatchingValues = new FloatOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Float.parseFloat(value));
+        float[] floatValues = notInPredicate.getFloatValues();
+        FloatSet nonMatchingValues = new FloatOpenHashSet(HashUtil.getMinHashSetSize(floatValues.length));
+        for (float value : floatValues) {
+          nonMatchingValues.add(value);
         }
         return new FloatRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case DOUBLE: {
-        DoubleSet nonMatchingValues = new DoubleOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Double.parseDouble(value));
+        double[] doubleValues = notInPredicate.getDoubleValues();
+        DoubleSet nonMatchingValues = new DoubleOpenHashSet(HashUtil.getMinHashSetSize(doubleValues.length));
+        for (double value : doubleValues) {
+          nonMatchingValues.add(value);
         }
         return new DoubleRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BIG_DECIMAL: {
-        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>();
-        for (String value : values) {
-          nonMatchingValues.add(new BigDecimal(value));
-        }
+        BigDecimal[] bigDecimalValues = notInPredicate.getBigDecimalValues();
+        // NOTE: Use TreeSet because BigDecimal's compareTo() is not consistent with equals()
+        //       E.g. compareTo(3.0, 3) returns 0 but equals(3.0, 3) returns false
+        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>(Arrays.asList(bigDecimalValues));
         return new BigDecimalRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BOOLEAN: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BooleanUtils.toInt(value));
+        int[] booleanValues = notInPredicate.getBooleanValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(booleanValues.length));
+        for (int value : booleanValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case TIMESTAMP: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(TimestampUtils.toMillisSinceEpoch(value));
+        long[] timestampValues = notInPredicate.getTimestampValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(timestampValues.length));
+        for (long value : timestampValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case STRING: {
-        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        nonMatchingValues.addAll(values);
+        List<String> stringValues = notInPredicate.getValues();
+        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(stringValues.size()));
+        // NOTE: Add value-by-value to avoid overhead
+        for (String value : stringValues) {
+          //noinspection UseBulkOperation
+          nonMatchingValues.add(value);
+        }
         return new StringRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BYTES: {
-        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BytesUtils.toByteArray(value));
+        ByteArray[] bytesValues = notInPredicate.getBytesValues();
+        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(bytesValues.length));
+        // NOTE: Add value-by-value to avoid overhead
+        //noinspection ManualArrayToCollectionCopy

Review Comment:
   The min hash set size is introduced in #3009, and the claim is that it reduces the latency for a query from 580ms to 430ms. We might want to revisit that number some time



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8891: Optimize dictionary lookup for IN clause

Posted by GitBox <gi...@apache.org>.

Jackie-Jiang commented on code in PR #8891:
URL: https://github.com/apache/pinot/pull/8891#discussion_r899403330


##########
pinot-spi/src/main/java/org/apache/pinot/spi/utils/ByteArray.java:
##########
@@ -54,6 +53,9 @@ public class ByteArray implements Comparable<ByteArray>, Serializable {
 
   private final byte[] _bytes;
 
+  // Hash for empty ByteArray is 1
+  private int _hash = 1;
+
   public ByteArray(byte[] bytes) {

Review Comment:
   @xiangfu0 Good point. We don't reuse the `byte[]` in the `ByteArray` right now, but there is no way to enforce that without cloning a byte array during construction, which will add overhead.
   Added some comments to the javadoc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on a diff in pull request #8891: Optimize dictionary lookup for IN clause

Posted by GitBox <gi...@apache.org>.

richardstartin commented on code in PR #8891:
URL: https://github.com/apache/pinot/pull/8891#discussion_r899399714


##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/NotInPredicateEvaluatorFactory.java:
##########
@@ -70,67 +68,80 @@ public static BaseDictionaryBasedPredicateEvaluator newDictionaryBasedEvaluator(
    */
   public static BaseRawValueBasedPredicateEvaluator newRawValueBasedEvaluator(NotInPredicate notInPredicate,
       DataType dataType) {
-    List<String> values = notInPredicate.getValues();
-    int hashSetSize = HashUtil.getMinHashSetSize(values.size());
     switch (dataType) {
       case INT: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Integer.parseInt(value));
+        int[] intValues = notInPredicate.getIntValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(intValues.length));
+        for (int value : intValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case LONG: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Long.parseLong(value));
+        long[] longValues = notInPredicate.getLongValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(longValues.length));
+        for (long value : longValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case FLOAT: {
-        FloatSet nonMatchingValues = new FloatOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Float.parseFloat(value));
+        float[] floatValues = notInPredicate.getFloatValues();
+        FloatSet nonMatchingValues = new FloatOpenHashSet(HashUtil.getMinHashSetSize(floatValues.length));
+        for (float value : floatValues) {
+          nonMatchingValues.add(value);
         }
         return new FloatRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case DOUBLE: {
-        DoubleSet nonMatchingValues = new DoubleOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Double.parseDouble(value));
+        double[] doubleValues = notInPredicate.getDoubleValues();
+        DoubleSet nonMatchingValues = new DoubleOpenHashSet(HashUtil.getMinHashSetSize(doubleValues.length));
+        for (double value : doubleValues) {
+          nonMatchingValues.add(value);
         }
         return new DoubleRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BIG_DECIMAL: {
-        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>();
-        for (String value : values) {
-          nonMatchingValues.add(new BigDecimal(value));
-        }
+        BigDecimal[] bigDecimalValues = notInPredicate.getBigDecimalValues();
+        // NOTE: Use TreeSet because BigDecimal's compareTo() is not consistent with equals()
+        //       E.g. compareTo(3.0, 3) returns 0 but equals(3.0, 3) returns false
+        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>(Arrays.asList(bigDecimalValues));
         return new BigDecimalRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BOOLEAN: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BooleanUtils.toInt(value));
+        int[] booleanValues = notInPredicate.getBooleanValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(booleanValues.length));
+        for (int value : booleanValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case TIMESTAMP: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(TimestampUtils.toMillisSinceEpoch(value));
+        long[] timestampValues = notInPredicate.getTimestampValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(timestampValues.length));
+        for (long value : timestampValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case STRING: {
-        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        nonMatchingValues.addAll(values);
+        List<String> stringValues = notInPredicate.getValues();
+        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(stringValues.size()));
+        // NOTE: Add value-by-value to avoid overhead
+        for (String value : stringValues) {
+          //noinspection UseBulkOperation
+          nonMatchingValues.add(value);
+        }
         return new StringRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BYTES: {
-        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BytesUtils.toByteArray(value));
+        ByteArray[] bytesValues = notInPredicate.getBytesValues();
+        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(bytesValues.length));
+        // NOTE: Add value-by-value to avoid overhead
+        //noinspection ManualArrayToCollectionCopy

Review Comment:
   I’m sure there hash sets were too small by default but this could have been resolved via the load factor, and would probably have made this even faster (it would have been nice if profiles before and after were captured for posterity).



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/NotInPredicateEvaluatorFactory.java:
##########
@@ -70,67 +68,80 @@ public static BaseDictionaryBasedPredicateEvaluator newDictionaryBasedEvaluator(
    */
   public static BaseRawValueBasedPredicateEvaluator newRawValueBasedEvaluator(NotInPredicate notInPredicate,
       DataType dataType) {
-    List<String> values = notInPredicate.getValues();
-    int hashSetSize = HashUtil.getMinHashSetSize(values.size());
     switch (dataType) {
       case INT: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Integer.parseInt(value));
+        int[] intValues = notInPredicate.getIntValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(intValues.length));
+        for (int value : intValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case LONG: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Long.parseLong(value));
+        long[] longValues = notInPredicate.getLongValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(longValues.length));
+        for (long value : longValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case FLOAT: {
-        FloatSet nonMatchingValues = new FloatOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Float.parseFloat(value));
+        float[] floatValues = notInPredicate.getFloatValues();
+        FloatSet nonMatchingValues = new FloatOpenHashSet(HashUtil.getMinHashSetSize(floatValues.length));
+        for (float value : floatValues) {
+          nonMatchingValues.add(value);
         }
         return new FloatRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case DOUBLE: {
-        DoubleSet nonMatchingValues = new DoubleOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Double.parseDouble(value));
+        double[] doubleValues = notInPredicate.getDoubleValues();
+        DoubleSet nonMatchingValues = new DoubleOpenHashSet(HashUtil.getMinHashSetSize(doubleValues.length));
+        for (double value : doubleValues) {
+          nonMatchingValues.add(value);
         }
         return new DoubleRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BIG_DECIMAL: {
-        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>();
-        for (String value : values) {
-          nonMatchingValues.add(new BigDecimal(value));
-        }
+        BigDecimal[] bigDecimalValues = notInPredicate.getBigDecimalValues();
+        // NOTE: Use TreeSet because BigDecimal's compareTo() is not consistent with equals()
+        //       E.g. compareTo(3.0, 3) returns 0 but equals(3.0, 3) returns false
+        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>(Arrays.asList(bigDecimalValues));
         return new BigDecimalRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BOOLEAN: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BooleanUtils.toInt(value));
+        int[] booleanValues = notInPredicate.getBooleanValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(booleanValues.length));
+        for (int value : booleanValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case TIMESTAMP: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(TimestampUtils.toMillisSinceEpoch(value));
+        long[] timestampValues = notInPredicate.getTimestampValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(timestampValues.length));
+        for (long value : timestampValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case STRING: {
-        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        nonMatchingValues.addAll(values);
+        List<String> stringValues = notInPredicate.getValues();
+        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(stringValues.size()));
+        // NOTE: Add value-by-value to avoid overhead
+        for (String value : stringValues) {
+          //noinspection UseBulkOperation
+          nonMatchingValues.add(value);
+        }
         return new StringRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BYTES: {
-        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BytesUtils.toByteArray(value));
+        ByteArray[] bytesValues = notInPredicate.getBytesValues();
+        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(bytesValues.length));
+        // NOTE: Add value-by-value to avoid overhead
+        //noinspection ManualArrayToCollectionCopy

Review Comment:
   I’m sure the hash sets were too small by default but this could have been resolved via the load factor, and would probably have made this even faster (it would have been nice if profiles before and after were captured for posterity).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] codecov-commenter commented on pull request #8891: Optimize dictionary lookup for IN clause

Posted by GitBox <gi...@apache.org>.

codecov-commenter commented on PR #8891:
URL: https://github.com/apache/pinot/pull/8891#issuecomment-1156019344

   # [Codecov](https://codecov.io/gh/apache/pinot/pull/8891?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#8891](https://codecov.io/gh/apache/pinot/pull/8891?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8c8bb67) into [master](https://codecov.io/gh/apache/pinot/commit/c802786ea95cff67b83ff4d24f796b965e565854?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c802786) will **decrease** coverage by `3.41%`.
   > The diff coverage is `59.52%`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #8891      +/-   ##
   ============================================
   - Coverage     69.78%   66.37%   -3.42%     
   - Complexity     4679     4693      +14     
   ============================================
     Files          1808     1358     -450     
     Lines         94235    68612   -25623     
     Branches      14052    10709    -3343     
   ============================================
   - Hits          65765    45541   -20224     
   + Misses        23908    19829    -4079     
   + Partials       4562     3242    -1320     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `66.37% <59.52%> (-0.03%)` | :arrow_down: |
   | unittests2 | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/8891?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...dictionary/BigDecimalOffHeapMutableDictionary.java](https://codecov.io/gh/apache/pinot/pull/8891/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2RpY3Rpb25hcnkvQmlnRGVjaW1hbE9mZkhlYXBNdXRhYmxlRGljdGlvbmFyeS5qYXZh) | `38.38% <0.00%> (-0.40%)` | :arrow_down: |
   | [.../dictionary/BigDecimalOnHeapMutableDictionary.java](https://codecov.io/gh/apache/pinot/pull/8891/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2RpY3Rpb25hcnkvQmlnRGVjaW1hbE9uSGVhcE11dGFibGVEaWN0aW9uYXJ5LmphdmE=) | `34.83% <0.00%> (-0.40%)` | :arrow_down: |
   | [...impl/dictionary/BytesOffHeapMutableDictionary.java](https://codecov.io/gh/apache/pinot/pull/8891/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2RpY3Rpb25hcnkvQnl0ZXNPZmZIZWFwTXV0YWJsZURpY3Rpb25hcnkuamF2YQ==) | `54.79% <0.00%> (-0.77%)` | :arrow_down: |
   | [.../impl/dictionary/BytesOnHeapMutableDictionary.java](https://codecov.io/gh/apache/pinot/pull/8891/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2RpY3Rpb25hcnkvQnl0ZXNPbkhlYXBNdXRhYmxlRGljdGlvbmFyeS5qYXZh) | `51.56% <0.00%> (-0.82%)` | :arrow_down: |
   | [...mpl/dictionary/DoubleOffHeapMutableDictionary.java](https://codecov.io/gh/apache/pinot/pull/8891/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2RpY3Rpb25hcnkvRG91YmxlT2ZmSGVhcE11dGFibGVEaWN0aW9uYXJ5LmphdmE=) | `36.45% <0.00%> (-0.39%)` | :arrow_down: |
   | [...impl/dictionary/DoubleOnHeapMutableDictionary.java](https://codecov.io/gh/apache/pinot/pull/8891/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2RpY3Rpb25hcnkvRG91YmxlT25IZWFwTXV0YWJsZURpY3Rpb25hcnkuamF2YQ==) | `32.94% <0.00%> (-0.40%)` | :arrow_down: |
   | [...impl/dictionary/FloatOffHeapMutableDictionary.java](https://codecov.io/gh/apache/pinot/pull/8891/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2RpY3Rpb25hcnkvRmxvYXRPZmZIZWFwTXV0YWJsZURpY3Rpb25hcnkuamF2YQ==) | `38.54% <0.00%> (-0.41%)` | :arrow_down: |
   | [.../impl/dictionary/FloatOnHeapMutableDictionary.java](https://codecov.io/gh/apache/pinot/pull/8891/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2RpY3Rpb25hcnkvRmxvYXRPbkhlYXBNdXRhYmxlRGljdGlvbmFyeS5qYXZh) | `34.11% <0.00%> (-0.41%)` | :arrow_down: |
   | [...e/impl/dictionary/IntOffHeapMutableDictionary.java](https://codecov.io/gh/apache/pinot/pull/8891/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2RpY3Rpb25hcnkvSW50T2ZmSGVhcE11dGFibGVEaWN0aW9uYXJ5LmphdmE=) | `47.91% <0.00%> (-0.51%)` | :arrow_down: |
   | [...me/impl/dictionary/IntOnHeapMutableDictionary.java](https://codecov.io/gh/apache/pinot/pull/8891/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2RpY3Rpb25hcnkvSW50T25IZWFwTXV0YWJsZURpY3Rpb25hcnkuamF2YQ==) | `42.35% <0.00%> (-0.51%)` | :arrow_down: |
   | ... and [735 more](https://codecov.io/gh/apache/pinot/pull/8891/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/8891?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/8891?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [c802786...8c8bb67](https://codecov.io/gh/apache/pinot/pull/8891?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on a diff in pull request #8891: Optimize dictionary lookup for IN clause

Posted by GitBox <gi...@apache.org>.

richardstartin commented on code in PR #8891:
URL: https://github.com/apache/pinot/pull/8891#discussion_r898958120


##########
pinot-spi/src/main/java/org/apache/pinot/spi/utils/ByteArray.java:
##########
@@ -54,6 +53,9 @@ public class ByteArray implements Comparable<ByteArray>, Serializable {
 
   private final byte[] _bytes;
 
+  // Hash for empty ByteArray is 1
+  private int _hash = 1;
+
   public ByteArray(byte[] bytes) {

Review Comment:
   Java’s string added a flag whether the hash had been computed to avoid computing the hash every time if the hash code happened to be 0, the same should be done here in case the hash happens to be 1 (finding {x_i} such that sum(x_i * 31^(n-i)) = 1 gives the byte arrays which collide, it’s easy to construct examples and they do occur in reality)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8891: Optimize dictionary lookup for IN clause

Posted by GitBox <gi...@apache.org>.

Jackie-Jiang commented on code in PR #8891:
URL: https://github.com/apache/pinot/pull/8891#discussion_r899406054


##########
pinot-spi/src/main/java/org/apache/pinot/spi/utils/ByteArray.java:
##########
@@ -54,6 +53,9 @@ public class ByteArray implements Comparable<ByteArray>, Serializable {
 
   private final byte[] _bytes;
 
+  // Hash for empty ByteArray is 1
+  private int _hash = 1;
+
   public ByteArray(byte[] bytes) {

Review Comment:
   @richardstartin I'm following the `String` implementation within the `adopt-openjdk-11` which has the following check:
   ```
       public int hashCode() {
           int h = hash;
           if (h == 0 && value.length > 0) {
               hash = h = isLatin1() ? StringLatin1.hashCode(value)
                                     : StringUTF16.hashCode(value);
           }
           return h;
       }
   ```
   I assume the collision will be super rare, and is not worth the overhead of storing an extra boolean field? Do you know if this implementation is changed in newer JDK version?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] Jackie-Jiang merged pull request #8891: Optimize dictionary lookup for IN clause

Posted by GitBox <gi...@apache.org>.

Jackie-Jiang merged PR #8891:
URL: https://github.com/apache/pinot/pull/8891


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] richardstartin commented on a diff in pull request #8891: Optimize dictionary lookup for IN clause

Posted by GitBox <gi...@apache.org>.

richardstartin commented on code in PR #8891:
URL: https://github.com/apache/pinot/pull/8891#discussion_r897801870


##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/NotInPredicateEvaluatorFactory.java:
##########
@@ -70,67 +68,80 @@ public static BaseDictionaryBasedPredicateEvaluator newDictionaryBasedEvaluator(
    */
   public static BaseRawValueBasedPredicateEvaluator newRawValueBasedEvaluator(NotInPredicate notInPredicate,
       DataType dataType) {
-    List<String> values = notInPredicate.getValues();
-    int hashSetSize = HashUtil.getMinHashSetSize(values.size());
     switch (dataType) {
       case INT: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Integer.parseInt(value));
+        int[] intValues = notInPredicate.getIntValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(intValues.length));
+        for (int value : intValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case LONG: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Long.parseLong(value));
+        long[] longValues = notInPredicate.getLongValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(longValues.length));
+        for (long value : longValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case FLOAT: {
-        FloatSet nonMatchingValues = new FloatOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Float.parseFloat(value));
+        float[] floatValues = notInPredicate.getFloatValues();
+        FloatSet nonMatchingValues = new FloatOpenHashSet(HashUtil.getMinHashSetSize(floatValues.length));
+        for (float value : floatValues) {
+          nonMatchingValues.add(value);
         }
         return new FloatRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case DOUBLE: {
-        DoubleSet nonMatchingValues = new DoubleOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Double.parseDouble(value));
+        double[] doubleValues = notInPredicate.getDoubleValues();
+        DoubleSet nonMatchingValues = new DoubleOpenHashSet(HashUtil.getMinHashSetSize(doubleValues.length));
+        for (double value : doubleValues) {
+          nonMatchingValues.add(value);
         }
         return new DoubleRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BIG_DECIMAL: {
-        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>();
-        for (String value : values) {
-          nonMatchingValues.add(new BigDecimal(value));
-        }
+        BigDecimal[] bigDecimalValues = notInPredicate.getBigDecimalValues();
+        // NOTE: Use TreeSet because BigDecimal's compareTo() is not consistent with equals()
+        //       E.g. compareTo(3.0, 3) returns 0 but equals(3.0, 3) returns false
+        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>(Arrays.asList(bigDecimalValues));
         return new BigDecimalRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BOOLEAN: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BooleanUtils.toInt(value));
+        int[] booleanValues = notInPredicate.getBooleanValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(booleanValues.length));
+        for (int value : booleanValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case TIMESTAMP: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(TimestampUtils.toMillisSinceEpoch(value));
+        long[] timestampValues = notInPredicate.getTimestampValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(timestampValues.length));
+        for (long value : timestampValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case STRING: {
-        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        nonMatchingValues.addAll(values);
+        List<String> stringValues = notInPredicate.getValues();
+        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(stringValues.size()));
+        // NOTE: Add value-by-value to avoid overhead
+        for (String value : stringValues) {
+          //noinspection UseBulkOperation
+          nonMatchingValues.add(value);
+        }
         return new StringRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BYTES: {
-        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BytesUtils.toByteArray(value));
+        ByteArray[] bytesValues = notInPredicate.getBytesValues();
+        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(bytesValues.length));
+        // NOTE: Add value-by-value to avoid overhead
+        //noinspection ManualArrayToCollectionCopy

Review Comment:
   What is the overhead being avoided here? Have you compared with 
   
   ```java
           Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(Arrays.asList(notInPredicate.getBytesValues()));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8891: Optimize dictionary lookup for IN clause

Posted by GitBox <gi...@apache.org>.

Jackie-Jiang commented on code in PR #8891:
URL: https://github.com/apache/pinot/pull/8891#discussion_r898206930


##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/NotInPredicateEvaluatorFactory.java:
##########
@@ -70,67 +68,80 @@ public static BaseDictionaryBasedPredicateEvaluator newDictionaryBasedEvaluator(
    */
   public static BaseRawValueBasedPredicateEvaluator newRawValueBasedEvaluator(NotInPredicate notInPredicate,
       DataType dataType) {
-    List<String> values = notInPredicate.getValues();
-    int hashSetSize = HashUtil.getMinHashSetSize(values.size());
     switch (dataType) {
       case INT: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Integer.parseInt(value));
+        int[] intValues = notInPredicate.getIntValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(intValues.length));
+        for (int value : intValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case LONG: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Long.parseLong(value));
+        long[] longValues = notInPredicate.getLongValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(longValues.length));
+        for (long value : longValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case FLOAT: {
-        FloatSet nonMatchingValues = new FloatOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Float.parseFloat(value));
+        float[] floatValues = notInPredicate.getFloatValues();
+        FloatSet nonMatchingValues = new FloatOpenHashSet(HashUtil.getMinHashSetSize(floatValues.length));
+        for (float value : floatValues) {
+          nonMatchingValues.add(value);
         }
         return new FloatRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case DOUBLE: {
-        DoubleSet nonMatchingValues = new DoubleOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(Double.parseDouble(value));
+        double[] doubleValues = notInPredicate.getDoubleValues();
+        DoubleSet nonMatchingValues = new DoubleOpenHashSet(HashUtil.getMinHashSetSize(doubleValues.length));
+        for (double value : doubleValues) {
+          nonMatchingValues.add(value);
         }
         return new DoubleRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BIG_DECIMAL: {
-        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>();
-        for (String value : values) {
-          nonMatchingValues.add(new BigDecimal(value));
-        }
+        BigDecimal[] bigDecimalValues = notInPredicate.getBigDecimalValues();
+        // NOTE: Use TreeSet because BigDecimal's compareTo() is not consistent with equals()
+        //       E.g. compareTo(3.0, 3) returns 0 but equals(3.0, 3) returns false
+        TreeSet<BigDecimal> nonMatchingValues = new TreeSet<>(Arrays.asList(bigDecimalValues));
         return new BigDecimalRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BOOLEAN: {
-        IntSet nonMatchingValues = new IntOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BooleanUtils.toInt(value));
+        int[] booleanValues = notInPredicate.getBooleanValues();
+        IntSet nonMatchingValues = new IntOpenHashSet(HashUtil.getMinHashSetSize(booleanValues.length));
+        for (int value : booleanValues) {
+          nonMatchingValues.add(value);
         }
         return new IntRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case TIMESTAMP: {
-        LongSet nonMatchingValues = new LongOpenHashSet(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(TimestampUtils.toMillisSinceEpoch(value));
+        long[] timestampValues = notInPredicate.getTimestampValues();
+        LongSet nonMatchingValues = new LongOpenHashSet(HashUtil.getMinHashSetSize(timestampValues.length));
+        for (long value : timestampValues) {
+          nonMatchingValues.add(value);
         }
         return new LongRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case STRING: {
-        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        nonMatchingValues.addAll(values);
+        List<String> stringValues = notInPredicate.getValues();
+        Set<String> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(stringValues.size()));
+        // NOTE: Add value-by-value to avoid overhead
+        for (String value : stringValues) {
+          //noinspection UseBulkOperation
+          nonMatchingValues.add(value);
+        }
         return new StringRawValueBasedNotInPredicateEvaluator(notInPredicate, nonMatchingValues);
       }
       case BYTES: {
-        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(hashSetSize);
-        for (String value : values) {
-          nonMatchingValues.add(BytesUtils.toByteArray(value));
+        ByteArray[] bytesValues = notInPredicate.getBytesValues();
+        Set<ByteArray> nonMatchingValues = new ObjectOpenHashSet<>(HashUtil.getMinHashSetSize(bytesValues.length));
+        // NOTE: Add value-by-value to avoid overhead
+        //noinspection ManualArrayToCollectionCopy

Review Comment:
   Directly construct the set from a list won't honor the min hash set size (not sure how much it helps, but don't want to couple that change into this change).
   
   I decide to keep the value-by-value add to skip the redundant capacity check in the `ObjectOpenHashSet.addAll()` because we already set the proper capacity up-front. Also want to keep the behavior the same for all data types so that it is easier to track



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org