You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Laszlo Gaal (JIRA)" <ji...@apache.org> on 2017/12/11 21:30:00 UTC

[jira] [Resolved] (IMPALA-6061) Impala needs to handle deprecation of s3n in hadoop 3.0

     [ https://issues.apache.org/jira/browse/IMPALA-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laszlo Gaal resolved IMPALA-6061.
---------------------------------
       Resolution: Fixed
         Assignee: Laszlo Gaal
    Fix Version/s: Impala 2.11.0

This issue is resolved by fixing IMPALA-6067, which required exactly these changes. I'm not setting the resolution to "duplicate" as the issues are quite different, but feel free to change the resolution if you disagree.

Fixed by:
https://git-wip-us.apache.org/repos/asf?p=impala.git;a=commit;h=e81b7c6b682fa4bf06db0fbd30818fe698890b00

IMPALA-6067: Enable S3 access via IAM roles for EC2 VMs

For some time Impala in a production environment has been able
to access data stored in Amazon S3 buckets using credentials specified
in a number of ways:
- storing Amazon access keys in environment variables or
  in core-site.xml.
- using proprietary management tools to store Amazon access keys
  securely
- using Amazon IAM roles bound to VMs running in EC2.

The development minicluster environment used the first approach,
which risked leaking these keys.

This change enables Impala builds to use IAM
roles to access S3 buckets when running on an Amazon EC2 virtual
machine. The changes mainly ensure that environment variables carrying
the traditional AWS credentials do not conflict with credentials supplied
by the IAM role attached to the VM instance.

IAM role based credentials are accessible through the EC2
instance-property mechanism; for further details see Amazon's docs at
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials

The change also removes the remaining references to the s3n: provider.
In the FE tests all URIs referring to s3n: are replaced with their
s3a: equivalents, except for a single negative test in
AnalyzeStmtsTest.java, which is removed.

In addition to the code changes, the s3n: and s3a: credential properties
are also removed from core-site.xml.tmpl. The s3a: provider can pick up
AWS S3 credentials from environment variables or IAM properties bound
to the VM instance, which is a more flexible approach.

As environment variables have precedence over IAM roles, care must be
taken when managing the canonical environment variables carrying
AWS credentials. There are two requirements to be reconciled:
1. The FE tests have code that examines s3a: URIs; this code needs
   existing, but not necessarily valid AWS credentials.
2. When the Impala test suite is executed on an EC2 VM, AWS credentials
   can be supplied via IAM roles. These credentials can be used only
   if the AWS_* environment variables are unset (do not exist).

The tradeoff is managed following these rules:
1. When AWS_* environment variables are set before invoking the
   Impala configuration scripts, their value is preserved and
   the config scripts ensure that the variables are exported.
2. If the AWS_* variables are missing or empty, they will be unset
   to ensure that credentials supplied by Amazon's IAM roles can be
   accessed,
3. except if the scripts are running outside of EC2 (so there can be
   no IAM roles) and TARGET_FILESYSTEM is not set "s3". This combination
   is most often the case on a developer's local workstation.
   In this case the AWS_* credential variables are forcibly set to
   dummy values to allow the FE tests to succeed.
   The removal of S3 credential parameters from core-site.xml[.tmpl]
   also allows users to set up their own credentials there,
   the config scripts will not change those settings.

Environment variables carrying AWS security credentials will be set
up according to the following table:

    Instance:     Running outside EC2 ||  Running in EC2 |
--------------------+--------+--------++--------+--------+
  TARGET_FILESYSTEM |   S3   | not S3 ||   S3   | not S3 |
--------------------+--------+--------++--------+--------+
                    |        |        ||        |        |
              empty | unset  | dummy  ||  unset |  unset |
AWS_*               |        |        ||        |        |
env   --------------+--------+--------++--------+--------+
var                 |        |        ||        |        |
          not empty | export | export || export | export |
                    |        |        ||        |        |
--------------------+--------+--------++--------+--------+

Legend: unset:  the variable is unset
        export: the variable is exported with its current value
        dummy:  the variable is set to a preset dummy value and
                exported

Running on an EC2 VM is indicated by setting RUNNING_IN_EC2 to "true" and
exporting it before impala_config.sh is invoked.

The change also moves the logic performing the S3 access checks into a separate
script file: bin/check-s3-access.sh. This file now contains all the S3-specific
logic and network access to check if the requested S3 bucket can be accessed.

Testing:

Performed local builds for HDFS as well as automated builds against
HDFS and S3, using both IAM roles and explicit AWS_* credentials for
authentication.
Verified that FE tests that parse s3a: URLs are still successful in
all these combinations (when they are run).

Change-Id: I14cd9d4453a91baad3c379aa7e4944993fca95ae
Reviewed-on: http://gerrit.cloudera.org:8080/8294
Reviewed-by: Philip Zeyliger <ph...@cloudera.com>
Reviewed-by: Zach Amsden <za...@cloudera.com>
Tested-by: Impala Public Jenkins

> Impala needs to handle deprecation of s3n in hadoop 3.0
> -------------------------------------------------------
>
>                 Key: IMPALA-6061
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6061
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.10.0
>            Reporter: Joe McDonnell
>            Assignee: Laszlo Gaal
>            Priority: Critical
>             Fix For: Impala 2.11.0
>
>
> Recently, support for s3n was removed from Hadoop by HADOOP-14738. All calls to s3n APIs will now throw an error:
> "The s3n:// client to Amazon S3 is no longer available: please migrate to the s3a:// client"
> This change impacts some Impala frontend tests that cover s3n. Some fail because they expect s3n to work. Some fail because the error message is different from the one expected. The failing tests are in these two test files:
> org.apache.impala.analysis.AnalyzeDDLTest
> org.apache.impala.analysis.AnalyzeStmtsTest
> Since this is only in a recent version of hadoop, Impala still needs to maintain support for s3n, but it needs to be able to run tests on this new version of hadoop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)