You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "André F. (Jira)" <ji...@apache.org> on 2022/12/12 08:54:00 UTC

[jira] [Created] (HADOOP-18568) Magic Committer optional clean up

André F. created HADOOP-18568:
---------------------------------

             Summary: Magic Committer optional clean up 
                 Key: HADOOP-18568
                 URL: https://issues.apache.org/jira/browse/HADOOP-18568
             Project: Hadoop Common
          Issue Type: Wish
          Components: fs/s3
    Affects Versions: 3.3.3
            Reporter: André F.


It seems that deleting the `__magic` folder, depending on the number of tasks/partitions used on a given spark job, can take really long time. I'm having the following behavior on a given Spark job (processing ~30TB, with ~420k tasks) using the magic committer:
{code:java}
2022-12-10T21:25:19.629Z pool-3-thread-32 INFO MagicS3GuardCommitter: Starting: Deleting magic directory s3a://my-bucket/random_hash/__magic
2022-12-10T21:52:03.250Z pool-3-thread-32 INFO MagicS3GuardCommitter: Deleting magic directory s3a://my-bucket/random_hash/__magic: duration 26:43.620s {code}
I don't see a way out of it since the deletion of s3 objects needs to list all objects under a prefix and this is what may be taking too much time. Could we somehow make this cleanup optional? (the idea would be to delegate it through s3 lifecycle policies in order to not create this overhead on the commit phase).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org