You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2021/08/04 14:29:52 UTC
[GitHub] [hadoop] steveloughran commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint

steveloughran commented on a change in pull request #3260:
URL: https://github.com/apache/hadoop/pull/3260#discussion_r682650167



##########
File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/ArnResource.java
##########
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a;
+
+import javax.annotation.Nonnull;
+
+import com.amazonaws.arn.Arn;
+import com.amazonaws.regions.RegionUtils;
+
+/**
+ * Represents an Arn Resource, this can be an accesspoint or bucket.
+ */
+public class ArnResource {
+
+  /**
+   * Resource name.
+   */
+  private final String name;
+
+  /**
+   * Resource owner account id.
+   */
+  private final String ownerAccountId;
+
+  /**
+   * Resource region.
+   */
+  private final String region;
+
+  /**
+   * Full Arn for the resource.
+   */
+  private final String fullArn;
+
+  /**
+   * Partition for the resource. Allowed partitions: aws, aws-cn, aws-us-gov
+   */
+  private final String partition;
+
+  /**
+   * Because of the different ways an endpoint can be constructed depending on partition we're
+   * relying on the AWS SDK to produce the endpoint. In this case we need a region key of the form
+   * {@code String.format("accesspoint-%s", awsRegion)}
+   */
+  private final String accessPointRegionKey;
+
+  private ArnResource(String name, String owner, String region, String partition, String fullArn) {
+    this.name = name;
+    this.ownerAccountId = owner;
+    this.region = region;
+    this.partition = partition;
+    this.fullArn = fullArn;
+    this.accessPointRegionKey = String.format("accesspoint-%s", region);
+  }
+
+  /**
+   * Resource name.
+   * @return resource name.
+   */
+  public String getName() {
+    return name;
+  }
+
+  /**
+   * Return owner's account id.
+   */
+  public String getOwnerAccountId() {
+    return ownerAccountId;
+  }
+
+  /**
+   * Resource region.
+   * @return resource region.
+   */
+  public String getRegion() {
+    return region;
+  }
+
+  /**
+   * Full arn for resource.
+   * @return arn for resource.
+   */
+  public String getFullArn() {
+    return fullArn;
+  }
+
+  /**
+   * Formatted endpoint for the resource.
+   * @return resource endpoint.
+   */
+  public String getEndpoint() {
+    return RegionUtils.getRegion(accessPointRegionKey)
+        .getServiceEndpoint("s3");
+  }
+
+  /**
+   * Parses the passed `arn` string into a full ArnResource
+   * @param arn - string representing an Arn resource
+   * @throws IllegalArgumentException - if the Arn is malformed or any of the region, accountId and

Review comment:
       nit: move below @return

##########
File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
##########
@@ -400,6 +410,14 @@ public void initialize(URI name, Configuration originalConf)
       LOG.debug("Initializing S3AFileSystem for {}", bucket);
       // clone the configuration into one with propagated bucket options
       Configuration conf = propagateBucketOptions(originalConf, bucket);
+
+      String apArn = conf.getTrimmed(ACCESS_POINT_ARN, "");
+      if (!apArn.isEmpty()) {
+        accessPoint = ArnResource.accessPointFromArn(apArn);
+        LOG.info("Using AccessPoint ARN \"{}\" for bucket {}", apArn, bucket);

Review comment:
       not sure about info vs debug; info is safest. 
   
   the .toString() value should also include it.

##########
File path: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AConfiguration.java
##########
@@ -361,6 +362,14 @@ public void shouldBeAbleToSwitchOnS3PathStyleAccessViaConfigProperty()
       // isn't in the same region as the s3 client default. See
       // http://docs.aws.amazon.com/AmazonS3/latest/dev/VirtualHosting.html
       assertEquals(HttpStatus.SC_MOVED_PERMANENTLY, e.getStatusCode());
+    } catch (final IllegalArgumentException e) {
+      // Path style addressing does not work with AP ARNs
+      if (conf.get(ACCESS_POINT_ARN, "").isEmpty()) {
+        LOG.error("Caught unexpected exception: ", e);
+        throw e;
+      }
+
+      assertTrue(e.getMessage().contains("ARN of type accesspoint cannot be passed as a bucket"));

Review comment:
       there's a GenericTestUtils check which will rethrow the exception if it doesn't contain the string; ensures the root exception isn't lost

##########
File path: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/TestArnResource.java
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a;
+
+import com.amazonaws.regions.Regions;
+import org.junit.Assert;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static org.apache.hadoop.test.LambdaTestUtils.intercept;
+
+public class TestArnResource extends Assert {

Review comment:
       extend org.apache.hadoop.test.HadoopTestBase to get thread naming and timeout

##########
File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
##########
@@ -360,6 +365,11 @@
   private AuditManagerS3A auditManager =
       AuditIntegration.stubAuditManager();
 
+  /**
+   * Bucket AccessPoint

Review comment:
       nit: add .

##########
File path: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AContractGetFileStatusV1List.java
##########
@@ -54,6 +60,9 @@ protected Configuration createConfiguration() {
 
     // Use v1 List Objects API
     conf.setInt(LIST_VERSION, 1);
+    if (!conf.get(ACCESS_POINT_ARN, "").isEmpty()) {

Review comment:
       `S3ATestUtils.assume("skipped due to ARN", <condition>`

##########
File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
##########
@@ -1154,7 +1197,8 @@ public String getBucketLocation(String bucketName) throws IOException {
     final String region = trackDurationAndSpan(
         STORE_EXISTS_PROBE, bucketName, null, () ->
             invoker.retry("getBucketLocation()", bucketName, true, () ->
-                s3.getBucketLocation(bucketName)));
+                // If accessPoint then region is known from Arn
+                accessPoint != null ? accessPoint.getRegion() : s3.getBucketLocation(bucketName)));

Review comment:
       nit: split on lines with ? and : at start, to make more obvious this is a  ?: sequence

##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
##########
@@ -1576,6 +1576,81 @@ Why explicitly declare a bucket bound to the central endpoint? It ensures
 that if the default endpoint is changed to a new region, data store in
 US-east is still reachable.
 
+## <a name="accesspoints"></a>Configuring S3 AccessPoints usage with S3a
+S3a now supports [S3 Access Point](https://aws.amazon.com/s3/features/access-points/) usage which
+improves VPC integration with S3 and simplifies your data's permission model because different
+policies can be applied now on the Access Point level. For more information about why to use them
+make sure to read the official documentation.
+
+Accessing data through an access point, is done by using its ARN, as opposed to just the bucket name.
+You can set the Access Point ARN property using the following configuration property:
+```xml
+<property>
+    <name>fs.s3a.accesspoint.arn</name>
+    <value> {ACCESSPOINT_ARN_HERE} </value>
+    <description>Configure S3a traffic to use this AccessPoint</description>
+</property>
+```
+
+Be mindful that this configures **all access** to S3a, and in turn S3, to go through that ARN.
+So for example `s3a://yourbucket/key` will now use your configured ARN when getting data from S3
+instead of your bucket. The flip side to this is that if you're working with multiple buckets

Review comment:
       presumably you'd always want to do a per-bucket mapping, as otherwise you don't get 1 AP ARN.
   
   Maybe we should mandate that it MUST always be on a per-bucket basis (looked up via a conf.getTrimmedString("fs.s3a." + bucket + ".access-point.arn", "")
   
   we would have no global ARN, so no bugreps about always getting the same bucket

##########
File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
##########
@@ -2570,6 +2614,11 @@ protected S3ListResult continueListObjects(S3ListRequest request,
               OBJECT_CONTINUE_LIST_REQUEST,
               () -> {
                 if (useListV1) {
+                  if (accessPoint != null) {
+                    // AccessPoints are not compatible with V1List
+                    throw new InvalidRequestException("ListV1 is not supported by AccessPoints");

Review comment:
       do you think maybe during initialize it should just downgrade? 

##########
File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
##########
@@ -1098,4 +1098,11 @@ private Constants() {
    */
   public static final String AWS_S3_CENTRAL_REGION = "us-east-1";
 
+  /**
+   * AccessPoint ARN for the bucket. When set globally, all requests to S3 are going through
+   * the AccessPoint pointed to by the ARN. If set as a bucket override then only the requests for
+   * that bucket will go through the AccessPoint. Default value " ".

Review comment:
       default is "", not " " ?
   
   Prefer a separate default const, and for this constant add a @value so any javadocs show it

##########
File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
##########
@@ -730,11 +749,24 @@ protected void verifyBucketExists()
    */
   @Retries.RetryTranslated
   protected void verifyBucketExistsV2()
-          throws UnknownStoreException, IOException {
+      throws UnknownStoreException, IOException {
     if (!invoker.retry("doesBucketExistV2", bucket, true,
         trackDurationOfOperation(getDurationTrackerFactory(),
             STORE_EXISTS_PROBE.getSymbol(),
-            () -> s3.doesBucketExistV2(bucket)))) {
+            () -> {
+              // Bug in SDK always returns `true` for AccessPoint ARNs with `doesBucketExistV2()`
+              // expanding implementation to use ARNs and buckets correctly

Review comment:
       aah




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org