You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/13 04:00:27 UTC

[GitHub] [spark] LuciferYang commented on pull request #37609: [SPARK-40175][SQL]Speed up conversion of Tuple2 to Scala Map

LuciferYang commented on PR #37609:
URL: https://github.com/apache/spark/pull/37609#issuecomment-1244867478

   > The idea is still valid: we can write a while loop manually to build the map, instead of `zip(...).toMap`, if this code path is proven to be performance critical.
   
   @cloud-fan Do you mean like follows?
   
   ```scala
     private def zipToMapUseMapBuilder[A, B, K, V](keys: Seq[A], values: Seq[B]): Map[K, V] = {
       import scala.collection.immutable
       val builder = immutable.Map.newBuilder[K, V]
       val keyIter = keys.iterator
       val valueIter = values.iterator
       while (keyIter.hasNext && valueIter.hasNext) {
         builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)]
       }
       builder.result()
     }
   
     private def zipToMapUseMap[A, B, K, V](keys: Seq[A], values: Seq[B]): Map[K, V] = {
       var elems: Map[K, V] = Map.empty[K, V]
       val keyIter = keys.iterator
       val valueIter = values.iterator
       while (keyIter.hasNext && valueIter.hasNext) {
         elems += (keyIter.next().asInstanceOf[K] -> valueIter.next().asInstanceOf[V])
       }
       elems
     }
   ```
    
   
   I write a microben to compare `data.zip(data).toMap`, `data.zip(data)(collection.breakOut)` and above methods, the result as follows:
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test zip to map with collectionSize = 1:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Use zip + toMap                                      22             22           1          4.6         217.6       1.0X
   Use zip + collection.breakOut                         8              9           1         11.9          84.4       2.6X
   Use Manual builder                                    3              3           0         32.1          31.2       7.0X
   Use Manual map                                        3              3           0         36.5          27.4       7.9X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test zip to map with collectionSize = 5:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Use zip + toMap                                     100            100           1          1.0         998.8       1.0X
   Use zip + collection.breakOut                        11             11           1          9.1         110.5       9.0X
   Use Manual builder                                   76             76           1          1.3         755.6       1.3X
   Use Manual map                                       47             47           1          2.1         468.1       2.1X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test zip to map with collectionSize = 10:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Use zip + toMap                                     123            123           1          0.8        1226.3       1.0X
   Use zip + collection.breakOut                        16             16           1          6.2         160.9       7.6X
   Use Manual builder                                   95             95           1          1.1         947.2       1.3X
   Use Manual map                                       92             94           1          1.1         922.5       1.3X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test zip to map with collectionSize = 20:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Use zip + toMap                                     162            162           1          0.6        1615.7       1.0X
   Use zip + collection.breakOut                        26             27           1          3.8         261.3       6.2X
   Use Manual builder                                  132            133           1          0.8        1321.4       1.2X
   Use Manual map                                      185            186           1          0.5        1846.3       0.9X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test zip to map with collectionSize = 50:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Use zip + toMap                                     604            606           2          0.2        6042.7       1.0X
   Use zip + collection.breakOut                        76             77           2          1.3         759.9       8.0X
   Use Manual builder                                  534            537           2          0.2        5335.6       1.1X
   Use Manual map                                      510            513           2          0.2        5102.1       1.2X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test zip to map with collectionSize = 100:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -------------------------------------------------------------------------------------------------------------------------
   Use zip + toMap                                     1087           1087           0          0.1       10865.5       1.0X
   Use zip + collection.breakOut                        134            135           1          0.7        1336.2       8.1X
   Use Manual builder                                  1000           1002           3          0.1        9996.8       1.1X
   Use Manual map                                      1081           1083           2          0.1       10813.0       1.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test zip to map with collectionSize = 500:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -------------------------------------------------------------------------------------------------------------------------
   Use zip + toMap                                     4536           4544          10          0.0       45364.9       1.0X
   Use zip + collection.breakOut                        778            784           5          0.1        7783.8       5.8X
   Use Manual builder                                  4347           4347           0          0.0       43470.2       1.0X
   Use Manual map                                      6775           6785          15          0.0       67745.2       0.7X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test zip to map with collectionSize = 1000:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------
   Use zip + toMap                                     11813          11822          13          0.0      118125.2       1.0X
   Use zip + collection.breakOut                        1590           1601          15          0.1       15898.3       7.4X
   Use Manual builder                                  11431          11450          27          0.0      114312.3       1.0X
   Use Manual map                                      14801          14812          16          0.0      148005.2       0.8X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test zip to map with collectionSize = 5000:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------
   Use zip + toMap                                     64917          65007         127          0.0      649172.0       1.0X
   Use zip + collection.breakOut                        8127           8130           5          0.0       81265.8       8.0X
   Use Manual builder                                  63836          63959         174          0.0      638356.4       1.0X
   Use Manual map                                      88139          88308         239          0.0      881392.4       0.7X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test zip to map with collectionSize = 10000:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------
   Use zip + toMap                                     130985         131331         489          0.0     1309847.7       1.0X
   Use zip + collection.breakOut                        16133          16142          13          0.0      161325.8       8.1X
   Use Manual builder                                  136655         136916         369          0.0     1366553.8       1.0X
   Use Manual map                                      190525         190794         380          0.0     1905252.4       0.7X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test zip to map with collectionSize = 20000:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------
   Use zip + toMap                                     306207         306628         595          0.0     3062071.9       1.0X
   Use zip + collection.breakOut                        32482          32498          23          0.0      324818.3       9.4X
   Use Manual builder                                  336547         337705        1637          0.0     3365473.0       0.9X
   Use Manual map                                      410734         411271         758          0.0     4107344.5       0.7X
   ```
   
   From the results, the performance of `while loop manually to build the map` is not fast enough. Is there a problem with my test code?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org