You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/17 08:58:54 UTC

[GitHub] [spark] coalchan opened a new pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

coalchan opened a new pull request #35549:
URL: https://github.com/apache/spark/pull/35549


   ### What changes were proposed in this pull request?
   Add a spark conf in order to just fetch partitions' name instead of fetching partitions' details. This can reduce requests on hive metastore.
   
   ### Why are the changes needed?
   1. method `listPartitions` is order to get locations of partitions and compute custom partition locations(variable `customPartitionLocations`), but in most cases we do not have custom partition locations.
   2. method `listPartitionNames` just fetchs partitions' name, it can reduce requests on hive metastore db.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, we should config "spark.sql.hasCustomPartitionLocations = false"
   
   ### How was this patch tested?
   1. recompile InsertIntoHadoopFsRelationCommand.scala
   2. update spark-sql_2.12-3.0.2.jar
   3. run insert into cases
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] coalchan removed a comment on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
coalchan removed a comment on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1063636935


   @stczwd 
   If so, there are also performance problems while overwriting hive static partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] stczwd edited a comment on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
stczwd edited a comment on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1062790013


   Hm, I got your mind. After looking at the relevant logic, `customPartitionLocations` will only be used while overwriting hive static partition.
   Thus, we can use `listPartitions` when `partitionsTrackedByCatalog && staticPartitions.size < partitionColumns.length` and use `listPartitionNames` when `partitionsTrackedByCatalog`.
   ```
   if (partitionsTrackedByCatalog) {
         if (staticPartitions.size < partitionColumns.length) {
           matchingPartitions = sparkSession.sessionState.catalog.listPartitions(
             catalogTable.get.identifier, Some(staticPartitions))
           initialMatchingPartitions = matchingPartitions.map(_.spec)
           customPartitionLocations = getCustomPartitionLocations(
             fs, catalogTable.get, qualifiedOutputPath, matchingPartitions)
         } else {
               // calling listPartitionNames to find initialMatchingPartitions
         }
   }
   ```
   cc @cloud-fan @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] coalchan commented on a change in pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
coalchan commented on a change in pull request #35549:
URL: https://github.com/apache/spark/pull/35549#discussion_r822366293



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
##########
@@ -92,16 +92,30 @@ case class InsertIntoHadoopFsRelationCommand(
 
     var initialMatchingPartitions: Seq[TablePartitionSpec] = Nil
     var customPartitionLocations: Map[TablePartitionSpec, String] = Map.empty
-    var matchingPartitions: Seq[CatalogTablePartition] = Seq.empty
 
     // When partitions are tracked by the catalog, compute all custom partition locations that
     // may be relevant to the insertion job.
     if (partitionsTrackedByCatalog) {

Review comment:
       Thanks for your response. `tracksPartitionsInCatalog` is set true in [HiveExternalCatalog.restoreHiveSerdeTable](https://github.com/apache/spark/blob/52e7602344036a649bb19ed642dbd74bc7ee9cb1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L800)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] coalchan edited a comment on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
coalchan edited a comment on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1062686161


   @stczwd
   spark-sql run `insert into table xxx partition (x1, x2) select 1,2`:
   before optimization:
   ![before](https://user-images.githubusercontent.com/6671051/157405215-df639df4-fda4-48fa-8ffc-4b0757636fda.png)
   
   after optimization:
   ![after](https://user-images.githubusercontent.com/6671051/157405243-97a0676a-8131-4b84-aaa6-83993b7eee00.png)
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] coalchan edited a comment on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
coalchan edited a comment on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1062686161


   @stczwd
   spark-sql run `insert into table xxx partition (x1, x2) select 1,2`:
   before optimization:
   ![before](https://user-images.githubusercontent.com/6671051/157405215-df639df4-fda4-48fa-8ffc-4b0757636fda.png)
   
   after optimization:
   ![after](https://user-images.githubusercontent.com/6671051/157405243-97a0676a-8131-4b84-aaa6-83993b7eee00.png)
   
   In this case, the table has 20k+ partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] stczwd commented on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
stczwd commented on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1060683191


   Could you fix the GA?
   Besides, any prof showing that this is taking too much time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] stczwd commented on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
stczwd commented on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1062790013


   Hm, I got your mind. After looking at the relevant logic, `customPartitionLocations` will only be used while overwriting hive static partition.
   Thus, changing `listPartitions` to `listPartitionNames` by default, and calling `listPartitions` before `deleteMatchingPartitions` is a better idea. It will not add much time for static partition overwriting as `listPartitions` for staticPartition won't take any time.
   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] stczwd edited a comment on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
stczwd edited a comment on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1062790013


   Hm, I got your mind. After looking at the relevant logic, `customPartitionLocations` will only be used while overwriting hive static partition.
   Thus, we can use `listPartitions` when `partitionsTrackedByCatalog && staticPartitions.size < partitionColumns.length` and use `listPartitionNames` when `partitionsTrackedByCatalog`.
   ```
   if (partitionsTrackedByCatalog) {
         if (staticPartitions.size == partitionColumns.length) {
           matchingPartitions = sparkSession.sessionState.catalog.listPartitions(
             catalogTable.get.identifier, Some(staticPartitions))
           initialMatchingPartitions = matchingPartitions.map(_.spec)
           customPartitionLocations = getCustomPartitionLocations(
             fs, catalogTable.get, qualifiedOutputPath, matchingPartitions)
         } else {
               // calling listPartitionNames to find initialMatchingPartitions
         }
   }
   ```
   cc @cloud-fan @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] coalchan edited a comment on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
coalchan edited a comment on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1062686161


   @stczwd
   spark-sql run `insert into table xxx partition (x1, x2) select 1,2`:
   before optimization:
   ![before](https://user-images.githubusercontent.com/6671051/157655110-e38d547a-8b63-4a23-9571-c667685319f8.png)
   
   after optimization:
   ![after](https://user-images.githubusercontent.com/6671051/157655137-ae0e1d02-68b3-4fb4-a95e-0af564e91bc2.png)
   
   In this case, the table has 20k+ partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] stczwd edited a comment on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
stczwd edited a comment on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1062790013


   Hm, I got your mind. After looking at the relevant logic, `customPartitionLocations` will only be used while overwriting hive static partition.
   Thus, changing `listPartitions` to `listPartitionNames` by default, and calling `listPartitions` before `deleteMatchingPartitions` is a better idea. It will not add much time for static partition overwriting as `listPartitions` for staticPartition won't take any time.
   cc @cloud-fan @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] stczwd commented on a change in pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
stczwd commented on a change in pull request #35549:
URL: https://github.com/apache/spark/pull/35549#discussion_r820697125



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
##########
@@ -92,16 +92,30 @@ case class InsertIntoHadoopFsRelationCommand(
 
     var initialMatchingPartitions: Seq[TablePartitionSpec] = Nil
     var customPartitionLocations: Map[TablePartitionSpec, String] = Map.empty
-    var matchingPartitions: Seq[CatalogTablePartition] = Seq.empty
 
     // When partitions are tracked by the catalog, compute all custom partition locations that
     // may be relevant to the insertion job.
     if (partitionsTrackedByCatalog) {

Review comment:
       Based the code on [HiveClientImpl.getTableOption](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L546) and [CatalogTable](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L250), it seems that we won't run into this code while using hive metastore. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] coalchan commented on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
coalchan commented on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1062686161


   @stczwd
   spark-sql run `insert into table xxx partition xxx (x1, x2) select 1,2`:
   before optimization:
   ![before](https://user-images.githubusercontent.com/6671051/157405215-df639df4-fda4-48fa-8ffc-4b0757636fda.png)
   
   after optimization:
   ![after](https://user-images.githubusercontent.com/6671051/157405243-97a0676a-8131-4b84-aaa6-83993b7eee00.png)
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] coalchan edited a comment on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
coalchan edited a comment on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1062686161


   @stczwd
   spark-sql run `insert into table xxx partition (x1, x2) select 1,2`:
   before optimization:
   ![before](https://user-images.githubusercontent.com/6671051/157405215-df639df4-fda4-48fa-8ffc-4b0757636fda.png)
   
   after optimization:
   ![after](https://user-images.githubusercontent.com/6671051/157405243-97a0676a-8131-4b84-aaa6-83993b7eee00.png)
   
   In this case, the table has 20000+ partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] coalchan commented on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
coalchan commented on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1063636935


   @stczwd 
   If so, there are also performance problems while overwriting hive static partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #35549: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #35549:
URL: https://github.com/apache/spark/pull/35549#issuecomment-1045156541


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org