You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Gabor Bota (JIRA)" <ji...@apache.org> on 2019/01/14 17:08:00 UTC

[jira] [Comment Edited] (HADOOP-15999) [s3a] Better support for out-of-band operations

    [ https://issues.apache.org/jira/browse/HADOOP-15999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742309#comment-16742309 ] 

Gabor Bota edited comment on HADOOP-15999 at 1/14/19 5:07 PM:
--------------------------------------------------------------

This integration test is for documenting and defining how S3Guard should behave in case of out-of-band (OOB) operations.

The behavior is the following in case of S3AFileSystem.getFileStatus:
A client with S3Guard
B client without S3Guard (Directly to S3)

* OOB OVERWRITE, authoritative mode:
** A client creates F1 file
** B client overwrites F1 file with F2 (Same, or different file size)
** A client's getFileStatus returns F1 metadata

* OOB OVERWRITE, NOT authoritative mode:
** A client creates F1 file
** B client overwrites F1 file with F2 (Same, or different file size)
** A client's getFileStatus returns F2 metadata. In not authoritative mode we check S3 for the file. If the modification time of the file in S3 is greater than in S3Guard, we can safely return the S3 file metadata and update the cache.

* OOB DELETE, authoritative mode:
** A client creates F file
** B client deletes F file
** A client's getFileStatus returns that the file is still there

* OOB DELETE, NOT authoritative mode:
** A client creates F file
** B client deletes F file
** A client's getFileStatus returns that the file is still there

As you can see, authoritative and NOT authoritative mode behaves the same at OOB DELETE case.

File listing: ITestS3GuardTtl for the cases for file listing (e.g  S3AFileSystem.listFiles). Users are able to how much time do they want a listing to be authoritative.

I had to modify two tests in {{ITestS3GuardListConsistency}} because of this modified behaviour in S3AFileSystem/S3Guard authoritative mode.

[~fabbri], please review this patch if you have time.


was (Author: gabor.bota):
This integration test is for documenting and defining how S3Guard should behave in case of out-of-band (OOB) operations.

The behavior is the following in case of S3AFileSystem.getFileStatus:
A client with S3Guard
B client without S3Guard (Directly to S3)

* OOB OVERWRITE, authoritative mode:
** A client creates F1 file
** B client overwrites F1 file with F2 (Same, or different file size)
** A client's getFileStatus returns F1 metadata

* OOB OVERWRITE, NOT authoritative mode:
** A client creates F1 file
** B client overwrites F1 file with F2 (Same, or different file size)
** A client's getFileStatus returns F2 metadata. In not authoritative mode we check S3 for the file. If the modification time of the file in S3 is greater than in S3Guard, we can safely return the S3 file metadata and update the cache.

* OOB DELETE, authoritative mode:
** A client creates F file
** B client deletes F file
** A client's getFileStatus returns that the file is still there

* OOB DELETE, NOT authoritative mode:
** A client creates F file
** B client deletes F file
** A client's getFileStatus returns that the file is still there

As you can see, authoritative and NOT authoritative mode behaves the same at OOB DELETE case.

File listing: ITestS3GuardTtl for the cases for file listing (e.g  S3AFileSystem.listFiles). Users are able to how much time do they want a listing to be authoritative.

I had to modify two tests in {{ITestS3GuardListConsistency}} because of this modified behaviour in S3AFileSystem/S3Guard authoritative mode.

> [s3a] Better support for out-of-band operations
> -----------------------------------------------
>
>                 Key: HADOOP-15999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15999
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.1.0
>            Reporter: Sean Mackrory
>            Assignee: Gabor Bota
>            Priority: Major
>         Attachments: HADOOP-15999.001.patch, out-of-band-operations.patch
>
>
> S3Guard was initially done on the premise that a new MetadataStore would be the source of truth, and that it wouldn't provide guarantees if updates were done without using S3Guard.
> I've been seeing increased demand for better support for scenarios where operations are done on the data that can't reasonably be done with S3Guard involved. For example:
> * A file is deleted using S3Guard, and replaced by some other tool. S3Guard can't tell the difference between the new file and delete / list inconsistency and continues to treat the file as deleted.
> * An S3Guard-ed file is overwritten by a longer file by some other tool. When reading the file, only the length of the original file is read.
> We could possibly have smarter behavior here by querying both S3 and the MetadataStore (even in cases where we may currently only query the MetadataStore in getFileStatus) and use whichever one has the higher modified time.
> This kills the performance boost we currently get in some workloads with the short-circuited getFileStatus, but we could keep it with authoritative mode which should give a larger performance boost. At least we'd get more correctness without authoritative mode and a clear declaration of when we can make the assumptions required to short-circuit the process. If we can't consider S3Guard the source of truth, we need to defer to S3 more.
> We'd need to be extra sure of any locality / time zone issues if we start relying on mod_time more directly, but currently we're tracking the modification time as returned by S3 anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org