You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/07 12:17:17 UTC

[GitHub] [spark] LuciferYang opened a new pull request, #38533: [SPARK-41029][SQL] Optimize the use of `GenericArrayData` constructor for Scala 2.13

LuciferYang opened a new pull request, #38533:
URL: https://github.com/apache/spark/pull/38533

   ### What changes were proposed in this pull request?
   This pr change to use a more appropriate constructor when the input is `ArrayBuffer` or `Empty Collection` to improve the construction performance of `GenericArrayData` with Scala 2.13.
   
   
   
   
   ### Why are the changes needed?
   Minor performance improvement.
   
   `GenericArrayData ` has the following constructor
   
   https://github.com/apache/spark/blob/57d492556768eb341f525ce7eb5c934089fa9e7e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L30
   
   When the input type is `ArrayBuffer`, the following code is similar in Spark
   
   ```
   new GenericArrayData(arrayBuffer.toSeq)
   ```
   
   For Scala 2.12, there will be no performance difference.
   
   However, when Scala 2.13 is used, there will be a performance gap, because 'toSeq' will cause a redundant memory copy.
   
   For the following test case:
   
   ```scala
     val benchmark = new Benchmark(s"constructor with buffer size = $bufferSize",
       valuesPerIteration, output = output)
     benchmark.addCase("toSeq and construct") { _ =>
       var n = 0
       while (n < valuesPerIteration) {
         new GenericArrayData(buffer.toSeq)
         n += 1
       }
     }
   
     benchmark.addCase("construct directly") { _ =>
       var n = 0
       while (n < valuesPerIteration) {
         new GenericArrayData(buffer)
         n += 1
       }
     }
   ```
   
   When bufferSize=10, there is a performance gap of more than 5 times between a and b, and the performance gap increases almost linearly with the increase of bufferSize
   
   There will be more than 5 times performance gap between `new GenericArrayData(buffer.toSeq)` and `new GenericArrayData(buffer)` when `bufferSize = 10` and the performance gap will increase with the increase of bufferSize.
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   constructor with buffer size = 10:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   toSeq and construct                                2617           2622           7          3.8         261.7       1.0X
   construct directly                                  399            406          11         25.1          39.9       6.6X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   constructor with buffer size = 100:       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   toSeq and construct                               12512          12554          60          0.8        1251.2       1.0X
   construct directly                                  779            781           2         12.8          77.9      16.1X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   constructor with buffer size = 1000:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   toSeq and construct                              108882         109400         732          0.1       10888.2       1.0X
   construct directly                                 5717           5731          20          1.7         571.7      19.0X
   ```
   
   
   We can safely change `new GenericArrayData(buffer.toSeq)` to `new GenericArrayData(buffer)` due to `ArrayBuffer` is still `scala.collection.Seq` in Scala 2.13.
   
   
   On the other hand, when input is an empty set, using `Array.empty` is 10% faster than using `Seq.empty`.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   - Pass GitHub Actions
   - Manually test the `catalyst` and `sql` module with Scala 2.13
   
   **will add result**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on PR #38533:
URL: https://github.com/apache/spark/pull/38533#issuecomment-1306540467

   Yes, there is no change, all inputs will be 'toArray`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on code in PR #38533:
URL: https://github.com/apache/spark/pull/38533#discussion_r1015568616


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -158,7 +158,7 @@ case class PandasMode(
         modes.append(key)
       }
     }
-    new GenericArrayData(modes.toSeq)
+    new GenericArrayData(modes)

Review Comment:
   `modes` is ArrayBuffer.empty[AnyRef]



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen closed pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13

Posted by GitBox <gi...@apache.org>.

srowen closed pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13
URL: https://github.com/apache/spark/pull/38533


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13

Posted by GitBox <gi...@apache.org>.

srowen commented on code in PR #38533:
URL: https://github.com/apache/spark/pull/38533#discussion_r1015808514


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -158,7 +158,7 @@ case class PandasMode(
         modes.append(key)
       }
     }
-    new GenericArrayData(modes.toSeq)
+    new GenericArrayData(modes)

Review Comment:
   This looks fine. The only concern I can think of is if this means the GenericArrayData accesses a mutable underlying array then from an ArrayBuffer, but this may be no problem given the usages changed here. The ArrayBuffers are local vars.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13

Posted by GitBox <gi...@apache.org>.

LuciferYang commented on PR #38533:
URL: https://github.com/apache/spark/pull/38533#issuecomment-1307329489

   Thanks @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #38533: [SPARK-41029][SQL] Optimize constructor use of `GenericArrayData` for Scala 2.13

Posted by GitBox <gi...@apache.org>.

srowen commented on PR #38533:
URL: https://github.com/apache/spark/pull/38533#issuecomment-1307328300

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org