You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/19 04:25:41 UTC

[GitHub] [spark] ulysses-you commented on a change in pull request #34310: [SPARK-37037][SQL] Improve byte array sort by unify compareTo function of UTF8String and ByteArray

ulysses-you commented on a change in pull request #34310:
URL: https://github.com/apache/spark/pull/34310#discussion_r731483712



##########
File path: common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
##########
@@ -75,6 +75,42 @@ static long getPrefix(Object base, long offset, int numBytes) {
     return (IS_LITTLE_ENDIAN ? java.lang.Long.reverseBytes(p) : p) & ~mask;
   }
 
+  public static int compareBinary(byte[] leftBase, byte[] rightBase) {
+    return compareBinary(leftBase, Platform.BYTE_ARRAY_OFFSET, leftBase.length,

Review comment:
       thank you @srowen and @JoshRosen for point out the difference. I follow the linked benchmark but add a new 512 byte slow benchmark which the first 511 bytes are same. The benchmark result shows it has no regression after this PR and has big benifits if the byte arrays have many same prefix.
   
   Before this PR:
   ```
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_271-b09 on Mac OS X 10.16
   Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
   Byte Array compareTo:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   2-7 byte                                            800            861          70         81.9          12.2       1.0X
   8-16 byte                                           810            878          59         80.9          12.4       1.0X
   16-32 byte                                          804            887          40         81.5          12.3       1.0X
   512-1024 byte                                      1050           1181          43         62.4          16.0       0.8X
   512 byte slow                                     23593          23698         311          2.8         360.0       0.0X
   2-7 byte                                            778            784           5         84.2          11.9       1.0X
   ```
   
   After this PR:
   ```
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_271-b09 on Mac OS X 10.16
   Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
   Byte Array compareTo:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   2-7 byte                                            425            471          24        154.2           6.5       1.0X
   8-16 byte                                           751            814          40         87.2          11.5       0.5X
   16-32 byte                                          789            842          42         83.1          12.0       0.5X
   512-1024 byte                                      1038           1175         193         63.1          15.8       0.4X
   512 byte slow                                      3419           3924         NaN         19.2          52.2       0.1X
   2-7 byte                                            421            424           2        155.6           6.4       1.0X
   
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org