You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2022/08/31 17:05:00 UTC
[jira] [Commented] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success

    [ https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598511#comment-17598511 ] 

Sean R. Owen commented on SPARK-40284:
--------------------------------------

You have a race condition where two requests try to delete then write. I don't think this is a Spark issue.

> spark  concurrent overwrite mode writes data to files in HDFS format, all request data write success
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-40284
>                 URL: https://issues.apache.org/jira/browse/SPARK-40284
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 3.0.1
>            Reporter: Liu
>            Priority: Major
>
> We use Spark as a service. The same Spark service needs to handle multiple requests, but I have a problem with this
> When multiple requests are overwritten to a directory at the same time, the results of two overwrite requests may be written successfully. I think this does not meet the definition of overwrite write
> First I ran Write SQL1, then I ran Write SQL2, and I found that both data were written in the end, which I thought was unreasonable
> {code:java}
> sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))
> -- write sql1
> sparkSession.sql("select 1 as id, sleep(40000) as time").write.mode(SaveMode.Overwrite).parquet("path")
> -- write sql2
>  sparkSession.sql("select 2 as id, 1 as time").write.mode(SaveMode.Overwrite).parquet("path") {code}
> When the spark source, and I saw that all these logic in InsertIntoHadoopFsRelationCommand this class.
>  
> When the target directory already exists, Spark directly deletes the target directory and writes to the _temporary directory that it requests. However, when multiple requests are written, the data will all append in; For example, in Write SQL above, this procedure occurs
> 1. excute write sql1, spark  create the _temporary directory for SQL1, and continue
> 2. excute write sql2 , spark will  delete the entire target directory and create its own 
> _temporary
> 3. sql2 writes  its data
> 4. sql1 complete the calculation, The corresponding _temporary /0/attemp_id directory does not exist and so the request fail. However, the task is retried, but the _temporary  directory is not deleted when the task is retried. Therefore, the execution result of sql1  result is append to the target directory 
>  
> Based on the above process, the write process, can  spark do a directory check before the write task or some other way to avoid this kind of problem?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org