You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/22 07:37:52 UTC
[GitHub] [spark] caican00 opened a new pull request, #37609: 3.3 master optimize to map
caican00 opened a new pull request, #37609:
URL: https://github.com/apache/spark/pull/37609
### What changes were proposed in this pull request?
`Traversable.toMap` changed to `collections.breakOut`, that eliminates intermediate tuple collection creation.
I optimized it with reference to this pr:https://github.com/apache/spark/pull/18693
An introduction to `Collections. BreakOut` can be found at [Stack Overflow article](https://stackoverflow.com/questions/1715681/scala-2-8-breakout).
### Why are the changes needed?
When `DeserializeToObject` is executed, converting Tuple2 to Scala Map via `. ToMap` takes a lot of cpu time.
![image](https://user-images.githubusercontent.com/94670132/185860416-f147ddd7-65b3-4dcb-b9d6-9a872015e003.png)
![image](https://user-images.githubusercontent.com/94670132/185860432-2aec4c48-898a-4d66-8d34-2221ab7e9408.png)
### How was this patch tested?
Unit tests run.
No performance tests performed yet.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1246533235
start this work : https://github.com/apache/spark/pull/37876
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1245153126
To check `val map: Map[K, V] = data.zip(data)(collection.breakOut)`, update bench result
```
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test zip to map with collectionSize = 1: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 16 18 2 6.4 155.5 1.0X
Use zip + collection.breakOut 3 4 1 29.3 34.1 4.6X
Use Manual builder 3 3 1 33.3 30.0 5.2X
Use Manual map 3 3 1 39.1 25.6 6.1X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test zip to map with collectionSize = 5: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 91 98 6 1.1 910.1 1.0X
Use zip + collection.breakOut 74 77 3 1.4 737.3 1.2X
Use Manual builder 72 77 4 1.4 722.1 1.3X
Use Manual map 43 46 1 2.3 426.1 2.1X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test zip to map with collectionSize = 10: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 114 120 4 0.9 1138.3 1.0X
Use zip + collection.breakOut 95 100 3 1.0 954.1 1.2X
Use Manual builder 94 101 4 1.1 942.4 1.2X
Use Manual map 85 91 4 1.2 851.7 1.3X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test zip to map with collectionSize = 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 156 162 4 0.6 1560.9 1.0X
Use zip + collection.breakOut 136 140 3 0.7 1356.4 1.2X
Use Manual builder 132 143 8 0.8 1317.6 1.2X
Use Manual map 166 170 3 0.6 1657.2 0.9X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test zip to map with collectionSize = 50: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 628 636 10 0.2 6277.4 1.0X
Use zip + collection.breakOut 572 583 10 0.2 5720.1 1.1X
Use Manual builder 574 584 16 0.2 5741.6 1.1X
Use Manual map 466 477 12 0.2 4660.8 1.3X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test zip to map with collectionSize = 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 1127 1138 15 0.1 11269.1 1.0X
Use zip + collection.breakOut 1060 1073 18 0.1 10600.8 1.1X
Use Manual builder 1050 1073 32 0.1 10500.7 1.1X
Use Manual map 1004 1017 19 0.1 10039.2 1.1X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test zip to map with collectionSize = 500: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 4634 4665 44 0.0 46338.7 1.0X
Use zip + collection.breakOut 4772 4792 28 0.0 47723.3 1.0X
Use Manual builder 4517 4597 112 0.0 45173.1 1.0X
Use Manual map 6473 6487 20 0.0 64726.3 0.7X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test zip to map with collectionSize = 1000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 12355 12366 16 0.0 123550.7 1.0X
Use zip + collection.breakOut 12585 12593 11 0.0 125846.1 1.0X
Use Manual builder 12076 12101 35 0.0 120764.3 1.0X
Use Manual map 14641 14664 34 0.0 146406.5 0.8X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test zip to map with collectionSize = 5000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 68054 68354 425 0.0 680539.3 1.0X
Use zip + collection.breakOut 73307 73316 13 0.0 733073.3 0.9X
Use Manual builder 70887 71129 342 0.0 708867.4 1.0X
Use Manual map 91936 91980 63 0.0 919357.2 0.7X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test zip to map with collectionSize = 10000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 142306 142993 971 0.0 1423062.3 1.0X
Use zip + collection.breakOut 148545 148635 127 0.0 1485454.3 1.0X
Use Manual builder 143287 144215 1313 0.0 1432866.1 1.0X
Use Manual map 198459 198995 758 0.0 1984586.7 0.7X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test zip to map with collectionSize = 20000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 318022 318528 716 0.0 3180223.5 1.0X
Use zip + collection.breakOut 333891 337354 2352 0.0 3338910.7 1.0X
Use Manual builder 319468 320649 1670 0.0 3194676.5 1.0X
Use Manual map 423019 423164 204 0.0 4230194.5 0.8X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
srowen commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1222261405
breakOut was removed or something in scala 2.13, I think? https://www.scala-lang.org/blog/2017/02/28/collections-rework.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1222298867
The idea is still valid: we can write a while loop manually to build the map, instead of `zip(...).toMap`, if this code path is proven to be performance critical.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1222292851
> breakOut was removed or something in scala 2.13, I think? https://www.scala-lang.org/blog/2017/02/28/collections-rework.html
Yes, Scala 2.13 already build failed seems as follows:
<img width="966" alt="image" src="https://user-images.githubusercontent.com/1475305/185921636-cfeb91bf-a170-43a6-b62c-450c96c21b6e.png">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen closed pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
srowen closed pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
URL: https://github.com/apache/spark/pull/37609
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1244888743
interesting, we need to understand why `zip + collection.breakOut` is so fast.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
srowen commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1250083017
OK to continue the work here after a rebase? #37876 is merged
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1244867478
> The idea is still valid: we can write a while loop manually to build the map, instead of `zip(...).toMap`, if this code path is proven to be performance critical.
@cloud-fan Do you mean like follows?
```scala
private def zipToMapUseMapBuilder[A, B, K, V](keys: Seq[A], values: Seq[B]): Map[K, V] = {
import scala.collection.immutable
val builder = immutable.Map.newBuilder[K, V]
val keyIter = keys.iterator
val valueIter = values.iterator
while (keyIter.hasNext && valueIter.hasNext) {
builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)]
}
builder.result()
}
private def zipToMapUseMap[A, B, K, V](keys: Seq[A], values: Seq[B]): Map[K, V] = {
var elems: Map[K, V] = Map.empty[K, V]
val keyIter = keys.iterator
val valueIter = values.iterator
while (keyIter.hasNext && valueIter.hasNext) {
elems += (keyIter.next().asInstanceOf[K] -> valueIter.next().asInstanceOf[V])
}
elems
}
```
I write a microben to compare `data.zip(data).toMap`, `data.zip(data)(collection.breakOut)` and above methods, the result as follows:
```
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 1: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 22 22 1 4.6 217.6 1.0X
Use zip + collection.breakOut 8 9 1 11.9 84.4 2.6X
Use Manual builder 3 3 0 32.1 31.2 7.0X
Use Manual map 3 3 0 36.5 27.4 7.9X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 5: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 100 100 1 1.0 998.8 1.0X
Use zip + collection.breakOut 11 11 1 9.1 110.5 9.0X
Use Manual builder 76 76 1 1.3 755.6 1.3X
Use Manual map 47 47 1 2.1 468.1 2.1X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 10: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 123 123 1 0.8 1226.3 1.0X
Use zip + collection.breakOut 16 16 1 6.2 160.9 7.6X
Use Manual builder 95 95 1 1.1 947.2 1.3X
Use Manual map 92 94 1 1.1 922.5 1.3X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 162 162 1 0.6 1615.7 1.0X
Use zip + collection.breakOut 26 27 1 3.8 261.3 6.2X
Use Manual builder 132 133 1 0.8 1321.4 1.2X
Use Manual map 185 186 1 0.5 1846.3 0.9X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 50: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 604 606 2 0.2 6042.7 1.0X
Use zip + collection.breakOut 76 77 2 1.3 759.9 8.0X
Use Manual builder 534 537 2 0.2 5335.6 1.1X
Use Manual map 510 513 2 0.2 5102.1 1.2X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 1087 1087 0 0.1 10865.5 1.0X
Use zip + collection.breakOut 134 135 1 0.7 1336.2 8.1X
Use Manual builder 1000 1002 3 0.1 9996.8 1.1X
Use Manual map 1081 1083 2 0.1 10813.0 1.0X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 500: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 4536 4544 10 0.0 45364.9 1.0X
Use zip + collection.breakOut 778 784 5 0.1 7783.8 5.8X
Use Manual builder 4347 4347 0 0.0 43470.2 1.0X
Use Manual map 6775 6785 15 0.0 67745.2 0.7X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 1000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 11813 11822 13 0.0 118125.2 1.0X
Use zip + collection.breakOut 1590 1601 15 0.1 15898.3 7.4X
Use Manual builder 11431 11450 27 0.0 114312.3 1.0X
Use Manual map 14801 14812 16 0.0 148005.2 0.8X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 5000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 64917 65007 127 0.0 649172.0 1.0X
Use zip + collection.breakOut 8127 8130 5 0.0 81265.8 8.0X
Use Manual builder 63836 63959 174 0.0 638356.4 1.0X
Use Manual map 88139 88308 239 0.0 881392.4 0.7X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 10000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 130985 131331 489 0.0 1309847.7 1.0X
Use zip + collection.breakOut 16133 16142 13 0.0 161325.8 8.1X
Use Manual builder 136655 136916 369 0.0 1366553.8 1.0X
Use Manual map 190525 190794 380 0.0 1905252.4 0.7X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 20000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 306207 306628 595 0.0 3062071.9 1.0X
Use zip + collection.breakOut 32482 32498 23 0.0 324818.3 9.4X
Use Manual builder 336547 337705 1637 0.0 3365473.0 0.9X
Use Manual map 410734 411271 758 0.0 4107344.5 0.7X
```
From the results, the performance of `while loop manually to build the map` is not fast enough. Is there a problem with my test code?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1245374210
Great! Seems we can always do `while loop manually`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1222629183
```
git grep "zip" | grep ".toMap"| grep -v Suite | grep -v examples | awk -F ':' '{print $1}' | uniq | wc -l
31
```
31 file has `zip(...).toMap` or `zipWithIndex.toMap`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1222161075
Can one of the admins verify this patch?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] caican00 commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
caican00 commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1222240163
> Nice try! Do we have some perf numbers?
@cloud-fan Thanks for your reply. I don't have any perf numbers right now but I'm going to do some performance tests to provide some perf numbers.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1244919208
I seem to know the reason, `data.zip(data)(collection.breakOut)` returns `CanBuildFrom[From, T, To]` instead of `Map[Any, Any]`, so what we should actually test is `val map:Map[K, V] = data.zip(data)(collection.breakOut)`, let me re-run the bench
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1253464163
> OK to continue the work here after a rebase? #37876 is merged
Hmm... I think we can close this pr, https://github.com/apache/spark/pull/37876 is the replacement of this one
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] caican00 commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
caican00 commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1221982822
gently ping @srowen
Can you help to verify this patch?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1222622931
> I think we can test the performance of this api in Scala 2.13 to check if there is targeted optimization in Scala 2.13
Checked, Scala 2.13 has the same problem.
@caican00 Does `zipWithIndex.toMap` have the same problem?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1245286689
```
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 150: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 1407 1408 1 0.1 14071.2 1.0X
Use zip + collection.breakOut 1327 1328 2 0.1 13270.1 1.1X
Use Manual builder 1282 1282 0 0.1 12815.3 1.1X
Use Manual map 1734 1735 1 0.1 17339.8 0.8X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 200: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 1769 1769 1 0.1 17688.7 1.0X
Use zip + collection.breakOut 1595 1598 5 0.1 15949.0 1.1X
Use Manual builder 1544 1546 3 0.1 15440.0 1.1X
Use Manual map 2416 2418 2 0.0 24161.1 0.7X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 300: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 2701 2705 6 0.0 27007.0 1.0X
Use zip + collection.breakOut 2472 2475 4 0.0 24719.1 1.1X
Use Manual builder 2379 2384 8 0.0 23787.5 1.1X
Use Manual map 3803 3807 5 0.0 38031.9 0.7X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test zip to map with collectionSize = 400: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use zip + toMap 3757 3758 2 0.0 37565.1 1.0X
Use zip + collection.breakOut 3446 3447 3 0.0 34455.1 1.1X
Use Manual builder 3314 3318 5 0.0 33139.8 1.1X
Use Manual map 5283 5287 5 0.0 52832.3 0.7X
```
Add results of input size 150, 200, 300, 400.
@cloud-fan , from bench results:
- If input data size < 500, the performance of using `zip + collection.breakOut` and `while loop manually to build the map with mapbuilder` are close, 10%+ faster than `zip(...).toMap`.
- If input data size >= 500, will be no significant performance gap
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1222013683
Nice try! Do we have some perf numbers?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1222305586
I think we can test the performance of this api in Scala 2.13 to check if there is targeted optimization in Scala 2.13
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1227076112
Are you still interested in this pr @caican00
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1245413079
> Great! Seems we can always do `while loop manually`?
Yes, there should be many similar case and I think this should be a valuable optimization
However, before I doing something, I still want to ask @caican00 , will you continue this optimization?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org