You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2016/12/12 16:53:58 UTC

[jira] [Comment Edited] (HADOOP-13336) S3A to support per-bucket configuration

    [ https://issues.apache.org/jira/browse/HADOOP-13336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742062#comment-15742062 ] 

Steve Loughran edited comment on HADOOP-13336 at 12/12/16 4:53 PM:
-------------------------------------------------------------------

This also matters for HADOOP-13345, where different buckets will have different MD caching policies, including "none", so increasing its priority.

Possibilities —all of which assume fallling back to the s3a standard options as default. This means: no way to undefine an option.

h3. per-bucket config. 

Lets you define everything for a bucket. 

Examples

* {{s3a://olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with config set {{fs.s3a.bucket.olap2}} in configuration
* {{s3a://landsat}} : s3a URL {{s3a://landsat}}, with config set {{fs.s3a.landsat}} for anonymous credentials and no dynamo



Pro
* Conceptually simple
* easy to get started
* trivial to move between other s3 clients, just change the prefix/redeclare the prefix binding

Con
* Expensive/complicated to maintain configurations.
* Need to delve into the configuration file to see what the mappings are. I can see this mattering a lot in support calls related to authentication.

h3. config via domain name in URL

This is what swift does: you define a domain, with the domain defining everything.


* {{s3a://olap2.dynamo/data/2017}} with config sett {{fs.s3a.binding.dynamo}}
* {{s3a://landsat.anon}} with config set {{fs.s3a.binding.anon}} for anonymous credentials and no dynamo

Pro:
* shared config across multiple buckets
* easy to see when buckets have different config options without having delve into the configuration file to see what the mappings are.
* Matches {{swift://}}
* Similar-ish to {{wasb}}

Con:
* the need to explicitly declare a domain stops you transparently moving a bucket to a different set of options, unless you add a way to also bind a bucket to a "configuration domain", behind the scenes.
* S3 supports FQDNs already
* not going to be compatible with previous versions, external s3 clients, (e.g. EMR)

h3. Config via user:pass property in URL

This is a bit like Azure, where the FQDN defines the binding, and the username defines the bucket. Here I'm proposing the ability to define a new user which declares the binding info.

Examples

* {{s3a://dynamo@olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with config set {{fs.s3a.binding.dynamo}}
* {{s3a://anon@landsat}} : s3a URL {{s3a://landsat}}, with config set {{fs.s3a.binding.anon}} for anonymous credentials.


Pro:
* Better for sharing configuration options across buckets
* consistent model with the AWSID:secret mechanism today
* see at a glance what the configuration set used is, easy to change.
* no complications related to domain naming
* Easy to switch between configuration sets on the command line, without adding new properties.

Con:
* needs different URLs if you don't want the default.

h3. Fundamentally rework Hadoop configuration to support a hierarchical configuration mechanism.

I'm not really proposing this, just wanted to mention it as the nominal ultimate option, instead of what we have today with different things (HA, Swift, Azure, etc), all defining different mechanisms for tuning customisation.




was (Author: stevel@apache.org):
This also matters for HADOOP-13345, where different buckets will have different MD caching policies, including "none", so increasing its priority.

Possibilities —all of which assume fallling back to the s3a standard options as default. This means: no way to undefine an option.

h3. per-bucket config. 

Lets you define everything for a bucket. 

Examples

* {{s3a://olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with config set {{fs.s3a.bucket.olap2}} in configuration
* {{s3a://anon@landsat}} : s3a URL {{s3a://landsat}}, with config set {{fs.s3a.landsat}} for anonymous credentials and no dynamo



Pro
* Conceptually simple
* easy to get started
* trivial to move between other s3 clients, just change the prefix/redeclare the prefix binding

Con
* Expensive/complicated to maintain configurations.
* Need to delve into the configuration file to see what the mappings are. I can see this mattering a lot in support calls related to authentication.

h3. config via domain name in URL

This is what swift does: you define a domain, with the domain defining everything.


* {{s3a://olap2.dynamo/data/2017}} with config sett {{fs.s3a.binding.dynamo}}
* {{s3a://landsat.anon}} with config set {{fs.s3a.binding.anon}} for anonymous credentials and no dynamo

Pro:
* shared config across multiple buckets
* easy to see when buckets have different config options without having delve into the configuration file to see what the mappings are.
* Matches {{swift://}}
* Similar-ish to {{wasb}}

Con:
* the need to explicitly declare a domain stops you transparently moving a bucket to a different set of options, unless you add a way to also bind a bucket to a "configuration domain", behind the scenes.
* S3 supports FQDNs already
* not going to be compatible with previous versions, external s3 clients, (e.g. EMR)

h3. Config via user:pass property in URL

This is a bit like Azure, where the FQDN defines the binding, and the username defines the bucket. Here I'm proposing the ability to define a new user which declares the binding info.

Examples

* {{s3a://dynamo@olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with config set {{fs.s3a.binding.dynamo}}
* {{s3a://anon@landsat}} : s3a URL {{s3a://landsat}}, with config set {{fs.s3a.binding.anon}} for anonymous credentials.


Pro:
* Better for sharing configuration options across buckets
* consistent model with the AWSID:secret mechanism today
* see at a glance what the configuration set used is, easy to change.
* no complications related to domain naming
* Easy to switch between configuration sets on the command line, without adding new properties.

Con:
* needs different URLs if you don't want the default.

h3. Fundamentally rework Hadoop configuration to support a hierarchical configuration mechanism.

I'm not really proposing this, just wanted to mention it as the nominal ultimate option, instead of what we have today with different things (HA, Swift, Azure, etc), all defining different mechanisms for tuning customisation.



> S3A to support per-bucket configuration
> ---------------------------------------
>
>                 Key: HADOOP-13336
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13336
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>
> S3a now supports different regions, by way of declaring the endpoint —but you can't do things like read in one region, write back in another (e.g. a distcp backup), because only one region can be specified in a configuration.
> If s3a supported region declaration in the URL, e.g. s3a://b1.frankfurt s3a://b2.seol , then this would be possible. 
> Swift does this with a full filesystem binding/config: endpoints, username, etc, in the XML file. Would we need to do that much? It'd be simpler initially to use a domain suffix of a URL to set the region of a bucket from the domain and have the aws library sort the details out itself, maybe with some config options for working with non-AWS infra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org