You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/13 04:00:27 UTC
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1244867478
> The idea is still valid: we can write a while loop manually to build the map, instead of `zip(...).toMap`, if this code path is proven to be performance critical.
@cloud-fan Do you mean like follows?
```scala
private def zipToMapUseMapBuilder[A, B, K, V](keys: Seq[A], values: Seq[B]): Map[K, V] = {
import scala.collection.immutable
val builder = immutable.Map.newBuilder[K, V]
val keyIter = keys.iterator
val valueIter = values.iterator
while (keyIter.hasNext && valueIter.hasNext) {
builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)]
}
builder.result()
}
private def zipToMapUseMap[A, B, K, V](keys: Seq[A], values: Seq[B]): Map[K, V] = {
var elems: Map[K, V] = Map.empty[K, V]
val keyIter = keys.iterator
val valueIter = values.iterator
while (keyIter.hasNext && valueIter.hasNext) {
elems += (keyIter.next().asInstanceOf[K] -> valueIter.next().asInstanceOf[V])
}
elems
}
```
I write a microben to compare `data.zip(data).toMap`, `data.zip(data)(collection.breakOut)` and above methods, the result as follows:
```
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 1: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 22 22 1 4.6 217.6 1.0X
Use zip + collection.breakOut 8 9 1 11.9 84.4 2.6X
Use Manual builder 3 3 0 32.1 31.2 7.0X
Use Manual map 3 3 0 36.5 27.4 7.9X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 5: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 100 100 1 1.0 998.8 1.0X
Use zip + collection.breakOut 11 11 1 9.1 110.5 9.0X
Use Manual builder 76 76 1 1.3 755.6 1.3X
Use Manual map 47 47 1 2.1 468.1 2.1X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 10: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 123 123 1 0.8 1226.3 1.0X
Use zip + collection.breakOut 16 16 1 6.2 160.9 7.6X
Use Manual builder 95 95 1 1.1 947.2 1.3X
Use Manual map 92 94 1 1.1 922.5 1.3X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 162 162 1 0.6 1615.7 1.0X
Use zip + collection.breakOut 26 27 1 3.8 261.3 6.2X
Use Manual builder 132 133 1 0.8 1321.4 1.2X
Use Manual map 185 186 1 0.5 1846.3 0.9X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 50: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 604 606 2 0.2 6042.7 1.0X
Use zip + collection.breakOut 76 77 2 1.3 759.9 8.0X
Use Manual builder 534 537 2 0.2 5335.6 1.1X
Use Manual map 510 513 2 0.2 5102.1 1.2X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 1087 1087 0 0.1 10865.5 1.0X
Use zip + collection.breakOut 134 135 1 0.7 1336.2 8.1X
Use Manual builder 1000 1002 3 0.1 9996.8 1.1X
Use Manual map 1081 1083 2 0.1 10813.0 1.0X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 500: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 4536 4544 10 0.0 45364.9 1.0X
Use zip + collection.breakOut 778 784 5 0.1 7783.8 5.8X
Use Manual builder 4347 4347 0 0.0 43470.2 1.0X
Use Manual map 6775 6785 15 0.0 67745.2 0.7X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 1000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 11813 11822 13 0.0 118125.2 1.0X
Use zip + collection.breakOut 1590 1601 15 0.1 15898.3 7.4X
Use Manual builder 11431 11450 27 0.0 114312.3 1.0X
Use Manual map 14801 14812 16 0.0 148005.2 0.8X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 5000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 64917 65007 127 0.0 649172.0 1.0X
Use zip + collection.breakOut 8127 8130 5 0.0 81265.8 8.0X
Use Manual builder 63836 63959 174 0.0 638356.4 1.0X
Use Manual map 88139 88308 239 0.0 881392.4 0.7X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 10000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 130985 131331 489 0.0 1309847.7 1.0X
Use zip + collection.breakOut 16133 16142 13 0.0 161325.8 8.1X
Use Manual builder 136655 136916 369 0.0 1366553.8 1.0X
Use Manual map 190525 190794 380 0.0 1905252.4 0.7X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 20000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 306207 306628 595 0.0 3062071.9 1.0X
Use zip + collection.breakOut 32482 32498 23 0.0 324818.3 9.4X
Use Manual builder 336547 337705 1637 0.0 3365473.0 0.9X
Use Manual map 410734 411271 758 0.0 4107344.5 0.7X
```
From the results, the performance of `while loop manually to build the map` is not fast enough. Is there a problem with my test code?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org