You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2022/08/31 17:05:00 UTC
[jira] [Commented] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success
[ https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598511#comment-17598511 ]
Sean R. Owen commented on SPARK-40284:
--------------------------------------
You have a race condition where two requests try to delete then write. I don't think this is a Spark issue.
> spark concurrent overwrite mode writes data to files in HDFS format, all request data write success
> ----------------------------------------------------------------------------------------------------
>
> Key: SPARK-40284
> URL: https://issues.apache.org/jira/browse/SPARK-40284
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output
> Affects Versions: 3.0.1
> Reporter: Liu
> Priority: Major
>
> We use Spark as a service. The same Spark service needs to handle multiple requests, but I have a problem with this
> When multiple requests are overwritten to a directory at the same time, the results of two overwrite requests may be written successfully. I think this does not meet the definition of overwrite write
> First I ran Write SQL1, then I ran Write SQL2, and I found that both data were written in the end, which I thought was unreasonable
> {code:java}
> sparkSession.udf.register("sleep", (time: Long) => Thread.sleep(time))
> -- write sql1
> sparkSession.sql("select 1 as id, sleep(40000) as time").write.mode(SaveMode.Overwrite).parquet("path")
> -- write sql2
> sparkSession.sql("select 2 as id, 1 as time").write.mode(SaveMode.Overwrite).parquet("path") {code}
> When the spark source, and I saw that all these logic in InsertIntoHadoopFsRelationCommand this class.
>
> When the target directory already exists, Spark directly deletes the target directory and writes to the _temporary directory that it requests. However, when multiple requests are written, the data will all append in; For example, in Write SQL above, this procedure occurs
> 1. excute write sql1, spark create the _temporary directory for SQL1, and continue
> 2. excute write sql2 , spark will delete the entire target directory and create its own
> _temporary
> 3. sql2 writes its data
> 4. sql1 complete the calculation, The corresponding _temporary /0/attemp_id directory does not exist and so the request fail. However, the task is retried, but the _temporary directory is not deleted when the task is retried. Therefore, the execution result of sql1 result is append to the target directory
>
> Based on the above process, the write process, can spark do a directory check before the write task or some other way to avoid this kind of problem?
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org