You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/07 12:17:17 UTC
[GitHub] [spark] LuciferYang opened a new pull request, #38533: [SPARK-41029][SQL] Optimize the use of `GenericArrayData` constructor for Scala 2.13
LuciferYang opened a new pull request, #38533:
URL: https://github.com/apache/spark/pull/38533
### What changes were proposed in this pull request?
This pr change to use a more appropriate constructor when the input is `ArrayBuffer` or `Empty Collection` to improve the construction performance of `GenericArrayData` with Scala 2.13.
### Why are the changes needed?
Minor performance improvement.
`GenericArrayData ` has the following constructor
https://github.com/apache/spark/blob/57d492556768eb341f525ce7eb5c934089fa9e7e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L30
When the input type is `ArrayBuffer`, the following code is similar in Spark
```
new GenericArrayData(arrayBuffer.toSeq)
```
For Scala 2.12, there will be no performance difference.
However, when Scala 2.13 is used, there will be a performance gap, because 'toSeq' will cause a redundant memory copy.
For the following test case:
```scala
val benchmark = new Benchmark(s"constructor with buffer size = $bufferSize",
valuesPerIteration, output = output)
benchmark.addCase("toSeq and construct") { _ =>
var n = 0
while (n < valuesPerIteration) {
new GenericArrayData(buffer.toSeq)
n += 1
}
}
benchmark.addCase("construct directly") { _ =>
var n = 0
while (n < valuesPerIteration) {
new GenericArrayData(buffer)
n += 1
}
}
```
When bufferSize=10, there is a performance gap of more than 5 times between a and b, and the performance gap increases almost linearly with the increase of bufferSize
There will be more than 5 times performance gap between `new GenericArrayData(buffer.toSeq)` and `new GenericArrayData(buffer)` when `bufferSize = 10` and the performance gap will increase with the increase of bufferSize.
```
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
constructor with buffer size = 10: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
toSeq and construct 2617 2622 7 3.8 261.7 1.0X
construct directly 399 406 11 25.1 39.9 6.6X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
constructor with buffer size = 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
toSeq and construct 12512 12554 60 0.8 1251.2 1.0X
construct directly 779 781 2 12.8 77.9 16.1X
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
constructor with buffer size = 1000: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
toSeq and construct 108882 109400 732 0.1 10888.2 1.0X
construct directly 5717 5731 20 1.7 571.7 19.0X
```
We can safely change `new GenericArrayData(buffer.toSeq)` to `new GenericArrayData(buffer)` due to `ArrayBuffer` is still `scala.collection.Seq` in Scala 2.13.
On the other hand, when input is an empty set, using `Array.empty` is 10% faster than using `Seq.empty`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass GitHub Actions
- Manually test the `catalyst` and `sql` module with Scala 2.13
**will add result**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38533:
URL: https://github.com/apache/spark/pull/38533#issuecomment-1306540467
Yes, there is no change, all inputs will be 'toArray`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38533:
URL: https://github.com/apache/spark/pull/38533#discussion_r1015568616
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -158,7 +158,7 @@ case class PandasMode(
modes.append(key)
}
}
- new GenericArrayData(modes.toSeq)
+ new GenericArrayData(modes)
Review Comment:
`modes` is ArrayBuffer.empty[AnyRef]
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen closed pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13
Posted by GitBox <gi...@apache.org>.
srowen closed pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13
URL: https://github.com/apache/spark/pull/38533
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13
Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #38533:
URL: https://github.com/apache/spark/pull/38533#discussion_r1015808514
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -158,7 +158,7 @@ case class PandasMode(
modes.append(key)
}
}
- new GenericArrayData(modes.toSeq)
+ new GenericArrayData(modes)
Review Comment:
This looks fine. The only concern I can think of is if this means the GenericArrayData accesses a mutable underlying array then from an ArrayBuffer, but this may be no problem given the usages changed here. The ArrayBuffers are local vars.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38533:
URL: https://github.com/apache/spark/pull/38533#issuecomment-1307329489
Thanks @srowen
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13
Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38533:
URL: https://github.com/apache/spark/pull/38533#issuecomment-1307328300
Merged to master
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org