You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Steve Loughran <st...@cloudera.com.INVALID> on 2020/12/04 13:59:43 UTC

AWS Consistent S3 & Apache Hadoop's S3A connector

as sent to hadoop-general.

TL;DR. S3 is consistent; S3A now works perfectly with S3Guard turned off,
if not, file a JIRA.  rename still isn't real, so don't rely on that or
create(path, overwrite=false) for atomic operations

-------

If you've missed the announcement, AWS S3 storage is now strongly
consistent: https://aws.amazon.com/s3/consistency/

That's full CRUD consistency, consistent listing, and no 404 caching.

You don't get: rename, or an atomic create-no-overwrite. Applications need
to know that and code for it.

This is enabled for all S3 buckets; no need to change endpoints or any
other settings. No extra cost, no performance impact. This is the biggest
change in S3 semantics since it launched.

What does this mean for the Hadoop S3A connector?


   1. We've been testing it for a while, no problems have surfaced.
   2. There's no need for S3Guard; leave the default settings alone. If you
   were using it, turn it off, restart *everything* and then you can delete
   the DDB table.
   3. Without S3 listings may get a bit slower.
   4. There's been a lot of work in branch-3.3 on speeding up listings
   against raw S3, especially for code which uses listStatusIterator() and
   listFiles (HADOOP-17400).


It'll be time to get Hadoop 3.3.1 out the door for people to play with;
it's got a fair few other s3a-side enhancements.

People are still using S3Guard and it needs to be maintained for now, but
we'll have to be fairly ruthless about what isn't going to get closed as
WONTFIX. I'm worried here about anyone using S3Guard against non-AWS
consistent stores. If you are, send me an email.

And so for releases/PRs, tdoing est runs with and without S3Guard is
important. I've added an optional backwards-incompatible change recently
for better scalability: HADOOP-13230. S3A to optionally retain directory
markers. which adds markers=keep/delete to the test matrix. This is a pain,
though as you can choose two options at a time it's manageable.

Apache HBase
============

You still need the HBoss extension in front of the S3A connector to use
Zookeeper to lock files during compaction.


Apache Spark
============

Any workflows which chained together reads directly after writes/overwrites
of files should now work reliably with raw S3.


   - The classic FileOutputCommitter commit-by-rename algorithms aren't
   going to fail with FileNotFoundException during task commit.
   - They will still use copy to rename work, so take O(data) time to
   commit filesWithout atomic dir rename, v1 commit algorithm can't isolate
   the commit operations of two task attempts. So it's unsafe and very slow.
   - The v2 commit is slow, doesn't have isolation between task attempt
   commits against any filesystem.If different task attempts are generating
   unique filenames (possibly to work around s3 update inconsistencies), it's
   not safe. Turn that option off.
   - The S3A committers' algorithms are happy talking directly to S3. But:
   SPARK-33402 is needed to fix a race condition in the staging committer.
   - The "Magic" committer, which has relied on a consistent store, is
   safe. There's a fix in HADOOP-17318 for the staging committer; hadoop-aws
   builds with that in will work safely with older spark versions.


Any formats which commit work by writing a file with a unique name &
updating a reference to it in a consistent store (iceberg &c) are still
going to work great. Naming is irrelevant and commit-by-writing-a-file is
S3's best story.

(+ SPARK-33135 and other uses of incremental listing will get the benefits
of async prefetching of the next page of list results)

Disctp
======

There'll be no cached 404s to break uploads, even if you don't have the
relevant fixes to stop HEAD requests before creating files (HADOOP-16932
and revert of HADOOP-8143)or update inconsistency (HADOOP-16775)

   - If your distcp version supports -direct, use it to avoid rename
   performance penaltiesIf your distcp version doesn't have HADOOP-15209 it
   can issue needless DELETE calls to S3 after a big update, and end up being
   throttled badly. Upgrade if you can.
   - If people are seeing problems: issues.apache.org + component HADOOP is
   where to file JIRAs; please tag the version of hadoop libraries you've been
   running with.


thanks,

-Steve

Re: AWS Consistent S3 & Apache Hadoop's S3A connector

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

On Mon, 7 Dec 2020 at 07:36, Chang Chen <ba...@gmail.com> wrote:

> Since S3A now works perfectly with S3Guard turned off, Could Magic
> Committor work with S3Guard is off? If Yes, will performance degenerate? Or
> if HADOOP-17400 is fixed, then it will have comparable performance?
>

Yes, works really well.

* It doesn't have problems with race conditions in job IDs (SPARK-3320)
because it does all its work under the dest dir and only supports one job @
a time there.

Performance wise:

* Expect no degradation if you are not working with directories marked as
authoritative (hive does that for managed tables). Indeed, you will save on
DDB writes.
* HADOOP-17400 speeds up all listing code, but for maximum directory
listing performance you need to use the (existing) incremental listing
APIS. See SPARK-33135 for some work there which matches this.

The list performance enhancements will only ship in hadoop-3.3.1. If you
use the incremental list APIs today (listStatusIncremental, listFiles)
everything is lined up, hdfs scales better and it helps motivate the abfs
dev team to do the same.

There's some extra fixes coming in related to this -credit to Donjoon for
contributing and/or reviewing this work.

HADOOP-17258. Magic S3Guard Committer to overwrite existing pendingSet file
on task commit
HADOOP-17318. Support concurrent S3A commit jobs with same app attempt ID
(for staging; for magic you can disable aborting all upload under the dest
dir & so have >1 job use the same dest dir)
HADOOP-16798. S3A Committer thread pool shutdown problems.

I'm also actively working on HADOOP-17414, Magic committer files don't have
the count of bytes written collected by spark:
https://github.com/apache/hadoop/pull/2530

Spark doesn't track bytes written as it is only measuring the 0-byte marker
file.

The Hadoop-side patch

* Returns all S3 object headers as XAttr attributes prefixed "header."
* Sets the custom header x-hadoop-s3a-magic-data-length to the length of
the data in the marker file.

There's a matching spark change which looks for the header in the getXAttr
API if the length of the output file is 0 bytes long. If present and parses
to a positive long, it's used as the declaration of output size.

Hadoop Branch-3.3 also has a very leading edge patch to stop deleting
superfluous directory markers when files are created. See
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/directory_markers.md
for details
This will avoid throttling when many files are being written to the same
bit of an S3 bucket, and stop creating tombstone markers in versioned S3
buckets. These tombstones were slowing down subsequent calls to LIST; over
time list calls will slow. This is new, needs a patch on older clients to
stop mistaking a marker for an empty dir and needs broader testing. This is
in all maintained hadoop 3.x branches, but not yet shipped other than in
hadoop-3.3.2

If you do want leading edge performance, yes, grab those latest patches in
your own build. I plan to cut a new 3.3.x release soon to get it into
peoples' hands. It will be the one with Arm-M1 binary support in the libs
and codecs. Building and testing now means that you get to have problems
you find now fixed before that release. Hey, you even have an excuse for
the new macbooks "I wanted to test spark on it"

-Steve

Re: AWS Consistent S3 & Apache Hadoop's S3A connector

Posted by Chang Chen <ba...@gmail.com>.

Since S3A now works perfectly with S3Guard turned off, Could Magic
Committor work with S3Guard is off? If Yes, will performance degenerate? Or
if HADOOP-17400 is fixed, then it will have comparable performance?

Steve Loughran <st...@cloudera.com.invalid> 于2020年12月4日周五 下午10:00写道：

> as sent to hadoop-general.
>
> TL;DR. S3 is consistent; S3A now works perfectly with S3Guard turned off,
> if not, file a JIRA.  rename still isn't real, so don't rely on that or
> create(path, overwrite=false) for atomic operations
>
> -------
>
> If you've missed the announcement, AWS S3 storage is now strongly
> consistent: https://aws.amazon.com/s3/consistency/
>
> That's full CRUD consistency, consistent listing, and no 404 caching.
>
> You don't get: rename, or an atomic create-no-overwrite. Applications need
> to know that and code for it.
>
> This is enabled for all S3 buckets; no need to change endpoints or any
> other settings. No extra cost, no performance impact. This is the biggest
> change in S3 semantics since it launched.
>
> What does this mean for the Hadoop S3A connector?
>
>
>    1. We've been testing it for a while, no problems have surfaced.
>    2. There's no need for S3Guard; leave the default settings alone. If
>    you were using it, turn it off, restart *everything* and then you can
>    delete the DDB table.
>    3. Without S3 listings may get a bit slower.
>    4. There's been a lot of work in branch-3.3 on speeding up listings
>    against raw S3, especially for code which uses listStatusIterator() and
>    listFiles (HADOOP-17400).
>
>
> It'll be time to get Hadoop 3.3.1 out the door for people to play with;
> it's got a fair few other s3a-side enhancements.
>
> People are still using S3Guard and it needs to be maintained for now, but
> we'll have to be fairly ruthless about what isn't going to get closed as
> WONTFIX. I'm worried here about anyone using S3Guard against non-AWS
> consistent stores. If you are, send me an email.
>
> And so for releases/PRs, tdoing est runs with and without S3Guard is
> important. I've added an optional backwards-incompatible change recently
> for better scalability: HADOOP-13230. S3A to optionally retain directory
> markers. which adds markers=keep/delete to the test matrix. This is a pain,
> though as you can choose two options at a time it's manageable.
>
> Apache HBase
> ============
>
> You still need the HBoss extension in front of the S3A connector to use
> Zookeeper to lock files during compaction.
>
>
> Apache Spark
> ============
>
> Any workflows which chained together reads directly after
> writes/overwrites of files should now work reliably with raw S3.
>
>
>    - The classic FileOutputCommitter commit-by-rename algorithms aren't
>    going to fail with FileNotFoundException during task commit.
>    - They will still use copy to rename work, so take O(data) time to
>    commit filesWithout atomic dir rename, v1 commit algorithm can't isolate
>    the commit operations of two task attempts. So it's unsafe and very slow.
>    - The v2 commit is slow, doesn't have isolation between task attempt
>    commits against any filesystem.If different task attempts are generating
>    unique filenames (possibly to work around s3 update inconsistencies), it's
>    not safe. Turn that option off.
>    - The S3A committers' algorithms are happy talking directly to S3.
>    But: SPARK-33402 is needed to fix a race condition in the staging
>    committer.
>    - The "Magic" committer, which has relied on a consistent store, is
>    safe. There's a fix in HADOOP-17318 for the staging committer; hadoop-aws
>    builds with that in will work safely with older spark versions.
>
>
> Any formats which commit work by writing a file with a unique name &
> updating a reference to it in a consistent store (iceberg &c) are still
> going to work great. Naming is irrelevant and commit-by-writing-a-file is
> S3's best story.
>
> (+ SPARK-33135 and other uses of incremental listing will get the benefits
> of async prefetching of the next page of list results)
>
> Disctp
> ======
>
> There'll be no cached 404s to break uploads, even if you don't have the
> relevant fixes to stop HEAD requests before creating files (HADOOP-16932
> and revert of HADOOP-8143)or update inconsistency (HADOOP-16775)
>
>    - If your distcp version supports -direct, use it to avoid rename
>    performance penaltiesIf your distcp version doesn't have HADOOP-15209 it
>    can issue needless DELETE calls to S3 after a big update, and end up being
>    throttled badly. Upgrade if you can.
>    - If people are seeing problems: issues.apache.org + component HADOOP
>    is where to file JIRAs; please tag the version of hadoop libraries you've
>    been running with.
>
>
> thanks,
>
> -Steve
>