You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/19 04:25:41 UTC
[GitHub] [spark] ulysses-you commented on a change in pull request #34310: [SPARK-37037][SQL] Improve byte array sort by unify compareTo function of UTF8String and ByteArray
ulysses-you commented on a change in pull request #34310:
URL: https://github.com/apache/spark/pull/34310#discussion_r731483712
##########
File path: common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java
##########
@@ -75,6 +75,42 @@ static long getPrefix(Object base, long offset, int numBytes) {
return (IS_LITTLE_ENDIAN ? java.lang.Long.reverseBytes(p) : p) & ~mask;
}
+ public static int compareBinary(byte[] leftBase, byte[] rightBase) {
+ return compareBinary(leftBase, Platform.BYTE_ARRAY_OFFSET, leftBase.length,
Review comment:
thank you @srowen and @JoshRosen for point out the difference. I follow the linked benchmark but add a new 512 byte slow benchmark which the first 511 bytes are same. The benchmark result shows it has no regression after this PR and has big benifits if the byte arrays have many same prefix.
Before this PR:
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_271-b09 on Mac OS X 10.16
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
Byte Array compareTo: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
2-7 byte 800 861 70 81.9 12.2 1.0X
8-16 byte 810 878 59 80.9 12.4 1.0X
16-32 byte 804 887 40 81.5 12.3 1.0X
512-1024 byte 1050 1181 43 62.4 16.0 0.8X
512 byte slow 23593 23698 311 2.8 360.0 0.0X
2-7 byte 778 784 5 84.2 11.9 1.0X
```
After this PR:
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_271-b09 on Mac OS X 10.16
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
Byte Array compareTo: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
2-7 byte 425 471 24 154.2 6.5 1.0X
8-16 byte 751 814 40 87.2 11.5 0.5X
16-32 byte 789 842 42 83.1 12.0 0.5X
512-1024 byte 1038 1175 193 63.1 15.8 0.4X
512 byte slow 3419 3924 NaN 19.2 52.2 0.1X
2-7 byte 421 424 2 155.6 6.4 1.0X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org