You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2022/12/12 11:41:00 UTC
[jira] [Commented] (HADOOP-18568) Magic Committer optional clean up

    [ https://issues.apache.org/jira/browse/HADOOP-18568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17646049#comment-17646049 ] 

Steve Loughran commented on HADOOP-18568:
-----------------------------------------

wow, that is a lot of tasks! your life would be a lot better if you could have fewer of them.

Your proposal makes sense.
Supply a pr with
* new option in CommitConstants, say "fs.s3a.cleanup.magic.enabled"
* check for this in MagicS3GuardCommitter.cleanupStagingDirs()
* add a test/extend an existing one to not do the cleanup, and verify the job dir still exists.

You have to be confident here that all your spark jobs are creating unique job IDs. We've had problems there in the past but recent spark releases are all good.

I am surprised and impressed by the number of tasks. It's the sheer volume of tasks which is creating your problem as we can only delete a few hundred I entries at a time and I there will be two files (filename, filename + .pending) per file written plus per task stuff. Even listing 420k and loading files as a precursor to committing them is a major overhead.

We are about to do a 3.3.5 release with some major enhancements to the magic committer in terms of performance creating files (no overwrite checks, even when parquet lib requests them), mkdirs (they all become noops) and others, plus more parallelism. see HADOOP-17833 for the work. It also tries to collect more IOStatistics on operations, but looks like it omits the cleanup timings because we write the stats into the _SUCCESS file before starting that clean up. Maybe successful jobs we could kick off the cleanup before writing the file.

(note, that 3.3.5 release adds the option to save the _SUCCESS) files into a history dir elsewhere. If they could explicitly list the job dir then some internal script to list the files, read the field and delete the dirs would be straightforward.

Looking forward to seeing your work. Afraid it has missed the 3.3.5 cut off but there will be an inevitable 3.3.6 released before long.

oh, and any stats on job improvements on 3.3.5 RC0 would be nice -any regressions even more so!


> Magic Committer optional clean up 
> ----------------------------------
>
>                 Key: HADOOP-18568
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18568
>             Project: Hadoop Common
>          Issue Type: Wish
>          Components: fs/s3
>    Affects Versions: 3.3.3
>            Reporter: André F.
>            Priority: Minor
>
> It seems that deleting the `__magic` folder, depending on the number of tasks/partitions used on a given spark job, can take really long time. I'm having the following behavior on a given Spark job (processing ~30TB, with ~420k tasks) using the magic committer:
> {code:java}
> 2022-12-10T21:25:19.629Z pool-3-thread-32 INFO MagicS3GuardCommitter: Starting: Deleting magic directory s3a://my-bucket/random_hash/__magic
> 2022-12-10T21:52:03.250Z pool-3-thread-32 INFO MagicS3GuardCommitter: Deleting magic directory s3a://my-bucket/random_hash/__magic: duration 26:43.620s {code}
> I don't see a way out of it since the deletion of s3 objects needs to list all objects under a prefix and this is what may be taking too much time. Could we somehow make this cleanup optional? (the idea would be to delegate it through s3 lifecycle policies in order to not create this overhead on the commit phase).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org