You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2020/10/23 13:40:00 UTC

[jira] [Created] (SPARK-33230) re-instate "spark.sql.sources.writeJobUUID" as unique ID in FileOutputWriter jobs

Steve Loughran created SPARK-33230:
--------------------------------------

             Summary: re-instate "spark.sql.sources.writeJobUUID" as unique ID in FileOutputWriter jobs
                 Key: SPARK-33230
                 URL: https://issues.apache.org/jira/browse/SPARK-33230
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.1, 2.4.7
            Reporter: Steve Loughran


The Hadoop S3A staging committer has problems with >1 spark sql query being launched simultaneously, as it uses the jobID for its path in the clusterFS to pass the commit information from tasks to job committer. 

If two queries are launched in the same second, they conflict and the output of job 1 includes that of all job2 files written so far; job 2 will fail with FNFE.

Proposed:
job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of {{WriteJobDescription.uuid}}

That was the property name which used to serve this purpose; any committers already written which use this property will pick it up without needing any changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org