You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "pan3793 (via GitHub)" <gi...@apache.org> on 2023/04/24 07:01:56 UTC

[GitHub] [spark] pan3793 opened a new pull request, #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

pan3793 opened a new pull request, #40920:
URL: https://github.com/apache/spark/pull/40920

### What changes were proposed in this pull request?

Remove unnecessary serialize/deserialize of `Path` on parallel gather partition stats.

### Why are the changes needed?

Simplify code, since `Path` is serializable in Hadoop3.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #40920:
URL: https://github.com/apache/spark/pull/40920#issuecomment-1524462509

   kindly ping @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175420930


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##########
@@ -789,22 +789,22 @@ case class RepairTableCommand(
     if (partitionSpecsAndLocs.length > threshold) {
       val hadoopConf = spark.sessionState.newHadoopConf()
       val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-      val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+      val locations = partitionSpecsAndLocs.map(_._2).toArray

Review Comment:
   Can we remove `toArray`?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175425968


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##########
@@ -789,22 +789,22 @@ case class RepairTableCommand(
     if (partitionSpecsAndLocs.length > threshold) {
       val hadoopConf = spark.sessionState.newHadoopConf()
       val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-      val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+      val locations = partitionSpecsAndLocs.map(_._2)
 
       // Set the number of parallelism to prevent following file listing from generating many tasks
       // in case of large #defaultParallelism.
-      val numParallelism = Math.min(serializedPaths.length,
+      val numParallelism = Math.min(locations.length,
         Math.min(spark.sparkContext.defaultParallelism, 10000))
       // gather the fast stats for all the partitions otherwise Hive metastore will list all the
       // files for all the new partitions in sequential way, which is super slow.
       logInfo(s"Gather the fast stats in parallel using $numParallelism tasks.")
-      spark.sparkContext.parallelize(serializedPaths, numParallelism)
-        .mapPartitions { paths =>
+      spark.sparkContext.parallelize(locations, numParallelism)
+        .mapPartitions { locationsEachPartition =>
           val pathFilter = getPathFilter(serializableConfiguration.value)
-          paths.map(new Path(_)).map{ path =>
-            val fs = path.getFileSystem(serializableConfiguration.value)
-            val statuses = fs.listStatus(path, pathFilter)
-            (path.toString, PartitionStatistics(statuses.length, statuses.map(_.getLen).sum))
+          locationsEachPartition.map { location =>
+            val fs = location.getFileSystem(serializableConfiguration.value)
+            val statuses = fs.listStatus(location, pathFilter)
+            (location.toString, PartitionStatistics(statuses.length, statuses.map(_.getLen).sum))

Review Comment:
   How about change the return type to `Map[Path, PartitionStatistics]`, not sure if it would be better
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao closed pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

Posted by "sunchao (via GitHub)" <gi...@apache.org>.

sunchao closed pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats
URL: https://github.com/apache/spark/pull/40920


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175422370


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##########
@@ -789,22 +789,22 @@ case class RepairTableCommand(
     if (partitionSpecsAndLocs.length > threshold) {
       val hadoopConf = spark.sessionState.newHadoopConf()
       val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-      val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+      val locations = partitionSpecsAndLocs.map(_._2).toArray
 
       // Set the number of parallelism to prevent following file listing from generating many tasks
       // in case of large #defaultParallelism.
-      val numParallelism = Math.min(serializedPaths.length,
+      val numParallelism = Math.min(locations.length,

Review Comment:
   Are `locations.length` and `partitionSpecsAndLocs.length` equal?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175443729


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##########
@@ -789,22 +789,22 @@ case class RepairTableCommand(
     if (partitionSpecsAndLocs.length > threshold) {
       val hadoopConf = spark.sessionState.newHadoopConf()
       val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-      val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+      val locations = partitionSpecsAndLocs.map(_._2).toArray
 
       // Set the number of parallelism to prevent following file listing from generating many tasks
       // in case of large #defaultParallelism.
-      val numParallelism = Math.min(serializedPaths.length,
+      val numParallelism = Math.min(locations.length,

Review Comment:
   Yes, changed to evaluate once as `partitionNum`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #40920:
URL: https://github.com/apache/spark/pull/40920#issuecomment-1520446235

   cc @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175444671


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##########
@@ -789,22 +789,22 @@ case class RepairTableCommand(
     if (partitionSpecsAndLocs.length > threshold) {
       val hadoopConf = spark.sessionState.newHadoopConf()
       val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-      val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+      val locations = partitionSpecsAndLocs.map(_._2).toArray

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175442719


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##########
@@ -789,22 +789,22 @@ case class RepairTableCommand(
     if (partitionSpecsAndLocs.length > threshold) {
       val hadoopConf = spark.sessionState.newHadoopConf()
       val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-      val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+      val locations = partitionSpecsAndLocs.map(_._2)
 
       // Set the number of parallelism to prevent following file listing from generating many tasks
       // in case of large #defaultParallelism.
-      val numParallelism = Math.min(serializedPaths.length,
+      val numParallelism = Math.min(locations.length,
         Math.min(spark.sparkContext.defaultParallelism, 10000))
       // gather the fast stats for all the partitions otherwise Hive metastore will list all the
       // files for all the new partitions in sequential way, which is super slow.
       logInfo(s"Gather the fast stats in parallel using $numParallelism tasks.")
-      spark.sparkContext.parallelize(serializedPaths, numParallelism)
-        .mapPartitions { paths =>
+      spark.sparkContext.parallelize(locations, numParallelism)
+        .mapPartitions { locationsEachPartition =>
           val pathFilter = getPathFilter(serializableConfiguration.value)
-          paths.map(new Path(_)).map{ path =>
-            val fs = path.getFileSystem(serializableConfiguration.value)
-            val statuses = fs.listStatus(path, pathFilter)
-            (path.toString, PartitionStatistics(statuses.length, statuses.map(_.getLen).sum))
+          locationsEachPartition.map { location =>
+            val fs = location.getFileSystem(serializableConfiguration.value)
+            val statuses = fs.listStatus(location, pathFilter)
+            (location.toString, PartitionStatistics(statuses.length, statuses.map(_.getLen).sum))

Review Comment:
   Looks simpler, adopted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

Posted by "sunchao (via GitHub)" <gi...@apache.org>.

sunchao commented on PR #40920:
URL: https://github.com/apache/spark/pull/40920#issuecomment-1524621007

   Oops almost forgot. Merged to master, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org