You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/17 00:37:32 UTC

[GitHub] [spark] srowen commented on a change in pull request #34267: [SPARK-36992][SQL] Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray

srowen commented on a change in pull request #34267:
URL: https://github.com/apache/spark/pull/34267#discussion_r729953775



##########
File path: common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
##########
@@ -42,15 +45,48 @@ public static void writeToMemory(byte[] src, Object target, long targetOffset) {
   public static long getPrefix(byte[] bytes) {
     if (bytes == null) {
       return 0L;
+    }
+    return getPrefix(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length);
+  }
+
+  public static long getPrefix(Object base, long offset, int numBytes) {
+    // Since JVMs are either 4-byte aligned or 8-byte aligned, we check the size of the bytes.
+    // If size is 0, just return 0.
+    // If size is between 0 and 4 (inclusive), assume data is 4-byte aligned under the hood and
+    // use a getInt to fetch the prefix.
+    // If size is greater than 4, assume we have at least 8 bytes of data to fetch.
+    // After getting the data, we use a mask to mask out data that is not part of the bytes.
+    long p;
+    long mask = 0;
+    if (IS_LITTLE_ENDIAN) {
+      if (numBytes >= 8) {
+        p = Platform.getLong(base, offset);
+      } else if (numBytes > 4) {
+        p = Platform.getLong(base, offset);
+        mask = (1L << (8 - numBytes) * 8) - 1;
+      } else if (numBytes > 0) {
+        p = (long) Platform.getInt(base, offset);
+        mask = (1L << (8 - numBytes) * 8) - 1;
+      } else {
+        p = 0;
+      }
+      p = java.lang.Long.reverseBytes(p);
     } else {
-      final int minLen = Math.min(bytes.length, 8);
-      long p = 0;
-      for (int i = 0; i < minLen; ++i) {
-        p |= ((long) Platform.getByte(bytes, Platform.BYTE_ARRAY_OFFSET + i) & 0xff)

Review comment:
       Ah right, nevermind, I'm not thinking this through. Endianness matters when reading ints or longs and masking

##########
File path: common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
##########
@@ -42,15 +45,34 @@ public static void writeToMemory(byte[] src, Object target, long targetOffset) {
   public static long getPrefix(byte[] bytes) {
     if (bytes == null) {
       return 0L;
+    }
+    return getPrefix(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length);
+  }
+
+  protected static long getPrefix(Object base, long offset, int numBytes) {

Review comment:
       This can be package private (remove protected) but it doesn't matter - it's a final class anyway




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org