You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Sean Mackrory (JIRA)" <ji...@apache.org> on 2017/02/25 00:32:44 UTC

[jira] [Updated] (HADOOP-14094) Rethink S3GuardTool options

     [ https://issues.apache.org/jira/browse/HADOOP-14094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Mackrory updated HADOOP-14094:
-----------------------------------
    Attachment: HADOOP-14094-HADOOP-13345.001.patch

Attaching a patch to get started on this. So far, it does the following:

* Change s3a -> s3guard when invoking the command from the Hadoop CLI (e.g. hadoop s3guard init ...)
* Change single-letter options to full-word options to avoid future collisions (e.g. hadoop s3guard init -read 500 -write 100). I've thought about using double-dashes as that's the convention with long-form options, but *most* other Hadoop commands use single dashes, so better to be consistent with the *most* of the rest of Hadoop, I guess.
* Change the argument to initMetadataStore from "create" to "forceCreate". In other words, init can force the table to be created it it doesn't exist, but other commands will create the table IFF fs.s3a.s3guard.ddb.table.create=true (or equivalent). Of course this is muddying the line between implementation details a bit, which is one thing I'd like to improve.
* Change all docs to use s3a://bucket instead of s3a://bucket/path/ - maybe we should allow commands to operate on subdirectories, but for now everything is at the bucket / table level so we shouldn't advertize otherwise.
* Improve the way exceptions are logged, especially in the case of an invalid argument. Before, the -h flag wasn't doing anything and you wouldn't see usage information for a specific command if you provided a bad argument, but now you do.

Before committing this, I would still like to:

* Add more detail to USAGE messages, as well as output logged to confirm the results of running the command
* is how to make the commands more generic. It's nice if everything implementation-specific can be encoded into a URI, but that still requires custom code for any implementation. 

I'd also like to make the commands more generic and make sure it's clear exactly how a connection is going to be configured (e.g. for various combinations of providing / not providing an endpoint in configs, providing / not providing an S3 URL, providing / not providing the -e flag). A few thoughts I've had along these lines:

* Split -m option into separate options for the implementation (e.g. -impl DynamoDB, we can have shortcuts for built-in classes, but also accept a full class name, -impl com.example.CustomMetadataStore) and then have other implementation specific flags for things like the table name.
* Remove -e option, and require endpoints be specified with "-D fs.s3a.s3guard.ddb.endpoint=...". It's very much an implementation-specific configuration, but it fills a role that can also be filled by specifying the S3 URL, so it's a bit messy too. The -e option is already not always allowed to be used (there was some discussion along those lines in HADOOP-13995).

> Rethink S3GuardTool options
> ---------------------------
>
>                 Key: HADOOP-14094
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14094
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Sean Mackrory
>         Attachments: HADOOP-14094-HADOOP-13345.001.patch
>
>
> I think we need to rework the S3GuardTool options. A couple of problems I've observed in the patches I've done on top of that and seeing other developers trying it out:
> * We should probably wrap the current commands in an S3Guard-specific command, since 'init', 'destroy', etc. don't touch the buckets at all.
> * Convert to whole-word options, as the single-letter options are already getting overloaded. Some patches I've submitted have added functionality where the obvious flag is already in use (e.g. -r for region, and read throughput, -m for minutes, and metadatastore uri).  I may do this early as part of HADOOP-14090.
> * We have some options that must be in the config in some cases, and can be in the command in other cases. But I've seen someone try to specify the table name in the config and leave out the -m option, with no luck. Also, since commands hard-code table auto-creation, you might have configured table auto-creation, try to import to a non-existent table, and it tells you table auto-creation is off.
> We need a more consistent policy for how things should get configured that addresses these problems and future-proofs the command a bit more.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org