You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/10/05 20:12:07 UTC

[GitHub] [pinot] KKcorps opened a new pull request, #9538: Improve primary key serialization performance

KKcorps opened a new pull request, #9538:
URL: https://github.com/apache/pinot/pull/9538

   Replacing java serializable based serde with bytebuffer based serialization for faster speed.
   
   ```
   Benchmark                          Mode  Cnt        Score        Error  Units
   BenchmarkPrimaryKey.benchNewSerde  avgt    5   301831.959 ±  15822.247  ns/op
   BenchmarkPrimaryKey.benchOldSerde  avgt    5  7976890.809 ± 747122.722  ns/op
   ```
   
   Can be optimized further by passing the data types in the PrimaryKey constructor thus eliminating the need for so many if else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on pull request #9538: Improve primary key serialization performance

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on PR #9538:
URL: https://github.com/apache/pinot/pull/9538#issuecomment-1270598894

   > Nice - this is an insane improvement! Can you add tests to make sure that this serialization behavior is identical to `SerializationUtils#serialize`? (Does that matter?)
   
   It won't be the same, but we are not persisting it anywhere so it is okay to be not backward compatible


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] agavra commented on pull request #9538: Improve primary key serialization performance

Posted by GitBox <gi...@apache.org>.
agavra commented on PR #9538:
URL: https://github.com/apache/pinot/pull/9538#issuecomment-1270643833

   Hmm, also thinking about it more - if we're changing the bytes, then the hash would be different as well, wouldn't that make anything that relies on that for partitioning incorrect?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] agavra commented on pull request #9538: Improve primary key serialization performance

Posted by GitBox <gi...@apache.org>.
agavra commented on PR #9538:
URL: https://github.com/apache/pinot/pull/9538#issuecomment-1270706617

   Discussed offline with @Jackie-Jiang - makes sense to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] agavra commented on pull request #9538: Improve primary key serialization performance

Posted by GitBox <gi...@apache.org>.
agavra commented on PR #9538:
URL: https://github.com/apache/pinot/pull/9538#issuecomment-1270667756

   > Yes, it will be much faster. We already support hashing for primary keys. But users want to avoid the hash due to collisions and hence serialize the key itself.
   
   I'm still not sure I understand. It looks like there is only one place `asBytes` is used: and thats in `HashUtils#hashPrimaryKey`. Why not change the API to: `PrimaryKey#hash( HashFunction hashFunction)` and compute the hash directly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #9538: Improve primary key serialization performance

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on code in PR #9538:
URL: https://github.com/apache/pinot/pull/9538#discussion_r989416345


##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/PrimaryKey.java:
##########
@@ -37,7 +40,56 @@ public Object[] getValues() {
   }
 
   public byte[] asBytes() {
-    return SerializationUtils.serialize(_values);
+    int arraySize = 0;
+    byte[][] cache = new byte[_values.length][];
+    for (int i = 0; i < _values.length; i++) {
+      Object value = _values[i];
+      if (value instanceof Integer) {
+        arraySize += Integer.BYTES;
+      } else if (value instanceof Double) {
+        arraySize += Double.BYTES;
+      } else if (value instanceof Float) {
+        arraySize += Float.BYTES;
+      } else if (value instanceof Long) {
+        arraySize += Long.BYTES;
+      } else if (value instanceof String) {
+        cache[i] = ((String) value).getBytes(StandardCharsets.UTF_8);
+        arraySize += cache[i].length + Integer.BYTES;
+      } else if (value instanceof BigDecimal) {
+        cache[i] = BigDecimalUtils.serialize((BigDecimal) value);
+        arraySize += cache[i].length + Integer.BYTES;
+      } else if (value instanceof byte[]) {
+        arraySize += ((byte[]) value).length + Integer.BYTES;
+      } else {
+        throw new IllegalStateException("Data type not supported for serializing Primary Key");
+      }
+    }
+
+    ByteBuffer byteBuffer = ByteBuffer.allocate(arraySize);
+    for (int i = 0; i < _values.length; i++) {
+      Object value = _values[i];
+      if (value instanceof Integer) {
+        byteBuffer.putInt((Integer) value);
+      } else if (value instanceof Double) {
+        byteBuffer.putDouble((Double) value);
+      } else if (value instanceof Float) {
+        byteBuffer.putFloat((Float) value);
+      } else if (value instanceof Long) {
+        byteBuffer.putLong((Long) value);
+      } else if (value instanceof String) {
+        byteBuffer.putInt(cache[i].length);
+        byteBuffer.put(cache[i]);
+      } else if (value instanceof BigDecimal) {
+        byteBuffer.putInt(cache[i].length);
+        byteBuffer.put(cache[i]);
+      } else if (value instanceof byte[]) {
+        byteBuffer.putInt(cache[i].length);
+        byteBuffer.put(cache[i]);
+      } else {
+        throw new IllegalStateException("Data type not supported for serializing Primary Key");
+      }

Review Comment:
   This part can be simplified
   ```suggestion
         } else {
           byteBuffer.putInt(cache[i].length);
           byteBuffer.put(cache[i]);
         }
   ```



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/PrimaryKey.java:
##########
@@ -37,7 +40,56 @@ public Object[] getValues() {
   }
 
   public byte[] asBytes() {
-    return SerializationUtils.serialize(_values);
+    int arraySize = 0;

Review Comment:
   (optional) `sizeInBytes`



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/PrimaryKey.java:
##########
@@ -37,7 +40,56 @@ public Object[] getValues() {
   }
 
   public byte[] asBytes() {
-    return SerializationUtils.serialize(_values);
+    int arraySize = 0;
+    byte[][] cache = new byte[_values.length][];
+    for (int i = 0; i < _values.length; i++) {
+      Object value = _values[i];
+      if (value instanceof Integer) {
+        arraySize += Integer.BYTES;
+      } else if (value instanceof Double) {
+        arraySize += Double.BYTES;
+      } else if (value instanceof Float) {
+        arraySize += Float.BYTES;
+      } else if (value instanceof Long) {
+        arraySize += Long.BYTES;
+      } else if (value instanceof String) {
+        cache[i] = ((String) value).getBytes(StandardCharsets.UTF_8);
+        arraySize += cache[i].length + Integer.BYTES;
+      } else if (value instanceof BigDecimal) {
+        cache[i] = BigDecimalUtils.serialize((BigDecimal) value);
+        arraySize += cache[i].length + Integer.BYTES;
+      } else if (value instanceof byte[]) {

Review Comment:
   (MAJOR) For `BYTES` type, value will be `ByteArray`



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/PrimaryKey.java:
##########
@@ -37,7 +40,56 @@ public Object[] getValues() {
   }
 
   public byte[] asBytes() {
-    return SerializationUtils.serialize(_values);
+    int arraySize = 0;
+    byte[][] cache = new byte[_values.length][];
+    for (int i = 0; i < _values.length; i++) {
+      Object value = _values[i];
+      if (value instanceof Integer) {

Review Comment:
   (minor) Consider reordering based on the how common it is used as primary key: INT, LONG, STRING, BYTES, FLOAT, DOUBLE, BIG_DECIMAL



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/PrimaryKey.java:
##########
@@ -37,7 +40,56 @@ public Object[] getValues() {
   }
 
   public byte[] asBytes() {
-    return SerializationUtils.serialize(_values);
+    int arraySize = 0;
+    byte[][] cache = new byte[_values.length][];
+    for (int i = 0; i < _values.length; i++) {
+      Object value = _values[i];
+      if (value instanceof Integer) {
+        arraySize += Integer.BYTES;
+      } else if (value instanceof Double) {
+        arraySize += Double.BYTES;
+      } else if (value instanceof Float) {
+        arraySize += Float.BYTES;
+      } else if (value instanceof Long) {
+        arraySize += Long.BYTES;
+      } else if (value instanceof String) {
+        cache[i] = ((String) value).getBytes(StandardCharsets.UTF_8);
+        arraySize += cache[i].length + Integer.BYTES;
+      } else if (value instanceof BigDecimal) {
+        cache[i] = BigDecimalUtils.serialize((BigDecimal) value);
+        arraySize += cache[i].length + Integer.BYTES;
+      } else if (value instanceof byte[]) {
+        arraySize += ((byte[]) value).length + Integer.BYTES;
+      } else {
+        throw new IllegalStateException("Data type not supported for serializing Primary Key");

Review Comment:
   Put the value class in the exception message (also handle `null` properly)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] KKcorps commented on pull request #9538: Improve primary key serialization performance

Posted by GitBox <gi...@apache.org>.
KKcorps commented on PR #9538:
URL: https://github.com/apache/pinot/pull/9538#issuecomment-1270644247

   > > It won't be the same, but we are not persisting it anywhere so it is okay to be not backward compatible
   > 
   > Interesting, why are we converting it to bytes then? Where's the deserialization logic? If it's just for hashing (quickly scanning the code, it looks like that's what it's used for in some places) won't it be even faster to just compute the hash instead of serializing it to bytes and then computing the hash?
   
   Yes, it will be. We already support hashing for primary keys. But users want to avoid the hash due to collisions and hence serialize the key itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] agavra commented on pull request #9538: Improve primary key serialization performance

Posted by GitBox <gi...@apache.org>.
agavra commented on PR #9538:
URL: https://github.com/apache/pinot/pull/9538#issuecomment-1270641097

   > It won't be the same, but we are not persisting it anywhere so it is okay to be not backward compatible
   
   Interesting, why are we converting it to bytes then? Where's the deserialization logic? If it's just for hashing (quickly scanning the code, it looks like that's what it's used for in some places) won't it be even faster to just compute the hash instead of serializing it to bytes and then computing the hash?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] KKcorps merged pull request #9538: Improve primary key serialization performance

Posted by GitBox <gi...@apache.org>.
KKcorps merged PR #9538:
URL: https://github.com/apache/pinot/pull/9538


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #9538: Improve primary key serialization performance

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on code in PR #9538:
URL: https://github.com/apache/pinot/pull/9538#discussion_r989694085


##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/PrimaryKey.java:
##########
@@ -37,7 +41,57 @@ public Object[] getValues() {
   }
 
   public byte[] asBytes() {
-    return SerializationUtils.serialize(_values);
+    int sizeInBytes = 0;
+    byte[][] cache = new byte[_values.length][];
+    for (int i = 0; i < _values.length; i++) {
+      Object value = _values[i];
+      if (value == null) {
+        continue;
+      }
+
+      if (value instanceof Integer) {
+        sizeInBytes += Integer.BYTES;
+      } else if (value instanceof Long) {
+        sizeInBytes += Long.BYTES;
+      } else if (value instanceof String) {
+        cache[i] = ((String) value).getBytes(StandardCharsets.UTF_8);
+        sizeInBytes += cache[i].length + Integer.BYTES;
+      } else if (value instanceof ByteArray) {
+        cache[i] = ((ByteArray) value).getBytes();
+        sizeInBytes += ((ByteArray) value).length() + Integer.BYTES;
+      } else if (value instanceof Float) {
+        sizeInBytes += Float.BYTES;
+      } else if (value instanceof Double) {
+        sizeInBytes += Double.BYTES;
+      } else if (value instanceof BigDecimal) {
+        cache[i] = BigDecimalUtils.serialize((BigDecimal) value);
+        sizeInBytes += cache[i].length + Integer.BYTES;
+      } else {
+        throw new IllegalStateException("Data type not supported for serializing Primary Key with value: " + value);
+      }
+    }
+
+    ByteBuffer byteBuffer = ByteBuffer.allocate(sizeInBytes);
+    for (int i = 0; i < _values.length; i++) {
+      Object value = _values[i];
+      if (value == null) {

Review Comment:
   No need to check null here if checked above



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/PrimaryKey.java:
##########
@@ -37,7 +41,57 @@ public Object[] getValues() {
   }
 
   public byte[] asBytes() {
-    return SerializationUtils.serialize(_values);
+    int sizeInBytes = 0;
+    byte[][] cache = new byte[_values.length][];
+    for (int i = 0; i < _values.length; i++) {
+      Object value = _values[i];
+      if (value == null) {
+        continue;
+      }
+
+      if (value instanceof Integer) {
+        sizeInBytes += Integer.BYTES;
+      } else if (value instanceof Long) {
+        sizeInBytes += Long.BYTES;
+      } else if (value instanceof String) {
+        cache[i] = ((String) value).getBytes(StandardCharsets.UTF_8);
+        sizeInBytes += cache[i].length + Integer.BYTES;
+      } else if (value instanceof ByteArray) {
+        cache[i] = ((ByteArray) value).getBytes();
+        sizeInBytes += ((ByteArray) value).length() + Integer.BYTES;

Review Comment:
   (nit)
   ```suggestion
           sizeInBytes += cache[i].length + Integer.BYTES;
   ```



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/PrimaryKey.java:
##########
@@ -37,7 +41,57 @@ public Object[] getValues() {
   }
 
   public byte[] asBytes() {
-    return SerializationUtils.serialize(_values);
+    int sizeInBytes = 0;
+    byte[][] cache = new byte[_values.length][];
+    for (int i = 0; i < _values.length; i++) {
+      Object value = _values[i];
+      if (value == null) {
+        continue;
+      }
+
+      if (value instanceof Integer) {
+        sizeInBytes += Integer.BYTES;
+      } else if (value instanceof Long) {
+        sizeInBytes += Long.BYTES;
+      } else if (value instanceof String) {
+        cache[i] = ((String) value).getBytes(StandardCharsets.UTF_8);
+        sizeInBytes += cache[i].length + Integer.BYTES;
+      } else if (value instanceof ByteArray) {
+        cache[i] = ((ByteArray) value).getBytes();
+        sizeInBytes += ((ByteArray) value).length() + Integer.BYTES;
+      } else if (value instanceof Float) {
+        sizeInBytes += Float.BYTES;
+      } else if (value instanceof Double) {
+        sizeInBytes += Double.BYTES;
+      } else if (value instanceof BigDecimal) {
+        cache[i] = BigDecimalUtils.serialize((BigDecimal) value);
+        sizeInBytes += cache[i].length + Integer.BYTES;
+      } else {
+        throw new IllegalStateException("Data type not supported for serializing Primary Key with value: " + value);

Review Comment:
   We can handle `null` here:
   ```suggestion
           throw new IllegalStateException(String.format("Unsupported value: %s of type: %s", value, value != null ? value.getClass() : null));
   ```



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/PrimaryKey.java:
##########
@@ -37,7 +41,57 @@ public Object[] getValues() {
   }
 
   public byte[] asBytes() {
-    return SerializationUtils.serialize(_values);
+    int sizeInBytes = 0;
+    byte[][] cache = new byte[_values.length][];

Review Comment:
   In majority of the use cases, primary key has only one value, and we can optimize that case by doing one pass



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/PrimaryKey.java:
##########
@@ -37,7 +41,57 @@ public Object[] getValues() {
   }
 
   public byte[] asBytes() {
-    return SerializationUtils.serialize(_values);
+    int sizeInBytes = 0;
+    byte[][] cache = new byte[_values.length][];
+    for (int i = 0; i < _values.length; i++) {
+      Object value = _values[i];
+      if (value == null) {

Review Comment:
   It should not be `null` or we cannot guarantee different primary key generate different bytes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org