You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/22 10:54:13 UTC

[GitHub] [spark] AngersZhuuuu opened a new pull request #35608: [SPARK-32838][SQL] Static partition overwrite could use staging dir insert

AngersZhuuuu opened a new pull request #35608:
URL: https://github.com/apache/spark/pull/35608


   ### What changes were proposed in this pull request?
   Currently, we verify path in DataSourceAnalysis
   ```
   // For dynamic partition overwrite, we do not delete partition directories ahead.
   // We write to staging directories and move to final partition directories after writing
   // job is done. So it is ok to have outputPath try to overwrite inputpath.
    if (overwrite && !insertCommand.dynamicPartitionOverwrite) {
         DDLUtils.verifyNotReadPath(actualQuery, outputPath)
   }
   
   /**
      * Throws exception if outputPath tries to overwrite inputpath.
      */
     def verifyNotReadPath(query: LogicalPlan, outputPath: Path) : Unit = {
       val inputPaths = query.collect {
         case LogicalRelation(r: HadoopFsRelation, _, _, _) =>
           r.location.rootPaths
       }.flatten
   
       if (inputPaths.contains(outputPath)) {
         throw new AnalysisException(
           "Cannot overwrite a path that is also being read from.")
       }
     }
   ````
   
   For static partition insert and read data form same table, it's really a normal case. This bug troubles user a lot.
   In this pr, for static partition insert, we can use same logical like dynamic partition overwrite to avoid this issue.
   
   
   
   
   ### Why are the changes needed?
   Support more ETL case
   
   
   ### Does this PR introduce _any_ user-facing change?
   After this patch, user can:
   
   1. Insert overwrite static partition from data read from same table's partition
   2. Insert overwrite static partition from data read from same table's same partition
   
   
   ### How was this patch tested?
   Added UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparksFyz edited a comment on pull request #35608: [SPARK-32838][SQL] Static partition overwrite could use staging dir insert

Posted by GitBox <gi...@apache.org>.
SparksFyz edited a comment on pull request #35608:
URL: https://github.com/apache/spark/pull/35608#issuecomment-1062841092


   ```
   if (overwrite && !insertCommand.dynamicPartitionOverwrite) {
         DDLUtils.verifyNotReadPath(actualQuery, outputPath)
   }
   ```
   DynamicPartitionOverwrite does not delete data before job begin, so it will not encountered verify problem.
   
   Since we use directly outputCommitter to write files. For the case `Insert overwrite static partition from data read from same table's partition`, we just tune `insertCommand.dynamicPartitionOverwrite` to true to avoid delete partition data before read.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #35608: [SPARK-32838][SQL] Static partition overwrite could use staging dir insert

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #35608:
URL: https://github.com/apache/spark/pull/35608#issuecomment-1048382511


   gentle ping @cloud-fan @HyukjinKwon @viirya @dongjoon-hyun This is a long term issue.  and current code is an easy and reasonable way to resolve this problem. Hope for your reviews. Many spark user encounter this issue. ccc @TongWei1105 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparksFyz commented on pull request #35608: [SPARK-32838][SQL] Static partition overwrite could use staging dir insert

Posted by GitBox <gi...@apache.org>.
SparksFyz commented on pull request #35608:
URL: https://github.com/apache/spark/pull/35608#issuecomment-1062590108


   We also encounter this issue for partitioned table(maybe converted from HiveTableRelation).  Here change `InsertIntoHadoopFsRelationCommand.dynamicPartitionOverwrite` to true is another way to solve this problem? DynamicPartitionWrite will delete data in commitJob before rename.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #35608: [SPARK-32838][SQL] Static partition overwrite could use staging dir insert

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #35608:
URL: https://github.com/apache/spark/pull/35608#issuecomment-1058055631


   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #35608: [SPARK-32838][SQL] Static partition overwrite could use staging dir insert

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #35608:
URL: https://github.com/apache/spark/pull/35608#issuecomment-1047671950


   cc @CHENXCHEN


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparksFyz commented on pull request #35608: [SPARK-32838][SQL] Static partition overwrite could use staging dir insert

Posted by GitBox <gi...@apache.org>.
SparksFyz commented on pull request #35608:
URL: https://github.com/apache/spark/pull/35608#issuecomment-1062841092


   ```
   if (overwrite && !insertCommand.dynamicPartitionOverwrite) {
         DDLUtils.verifyNotReadPath(actualQuery, outputPath)
   }
   ```
   DynamicPartitionOverwrite does not delete data before job begin, so it will not encountered verify problem.
   
   Since we use directly outputCommitter to write files. For the case `Insert overwrite static partition from data read from same table's partition`, we just tune `insertCommand.dynamicPartitionOverwrite` to true to avoid delete partition data.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #35608: [SPARK-32838][SQL] Static partition overwrite could use staging dir insert

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #35608:
URL: https://github.com/apache/spark/pull/35608#issuecomment-1062606338


   > We also encounter this issue for partitioned table(maybe converted from HiveTableRelation). Here change `InsertIntoHadoopFsRelationCommand.dynamicPartitionOverwrite` to true is another way to solve this problem?
   
   Why In hive we won't meet such issue is because we use staging dir.
   
   > DynamicPartitionWrite will delete data in commitJob before rename.
   
   What are you trying to express here?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org