You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2022/03/10 13:49:00 UTC
[jira] [Commented] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

    [ https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504273#comment-17504273 ] 

Steve Loughran commented on SPARK-31911:
----------------------------------------

I'm going to close as fixed now; the spark changes will have done it.

> Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31911
>                 URL: https://issues.apache.org/jira/browse/SPARK-31911
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.4.4
>            Reporter: Brandon
>            Priority: Major
>
> First of all thanks for the great work on the S3 committers. I was able set up the directory staging committer in my environment following docs at [https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast] and tested one of my Spark applications using it. The Spark version is 2.4.4 with Hadoop 3.2.1 and the cloud committer bindings. The application writes multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to Spark in parallel.
> I think I'm seeing a bug where the staging committer will complete pending uploads more than once. The main symptom how I discovered this is that the _SUCCESS data files under each table will contain overlapping file names that belong to separate tables. From my reading of the code, that's because the filenames in _SUCCESS reflect which multipart uploads were completed in the commit for that particular table.
> An example:
> Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition so writes one partition file.
> When the two writes are done,
>  * /a/_SUCCESS contains two filenames: /a/part-0000 and /b/part-0000.
>  * /b/_SUCCESS contains the same two filenames.
> Setting S3A logs to debug, I see the commitJob operation belonging to table a includes completing the uploads of /a/part-0000 and /b/part-0000. Then again, commitJob for table b includes the same completions. I haven't had a problem yet, but I wonder if having these extra requests would become an issue at higher scale, where dozens of commits with hundreds of files may be happening concurrently in the application.
> I believe this may be caused from the way the pendingSet files are stored in the staging directory. They are stored under one directory named by the jobID, in the Hadoop code. However, for all write jobs executed by the Spark application, the jobID passed to Hadoop is the same - the application ID. Maybe the staging commit algorithm was built on the assumption that each instance of the algorithm would use a unique random jobID.
> [~stevel@apache.org] , [~rdblue] Having seen your names on most of this work (thank you), I would be interested to know your thoughts on this. Also it's my first time opening a bug here, so let me know if there's anything else I can do to help report the issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org