You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2021/07/16 11:14:24 UTC
[GitHub] [hadoop] bogthe commented on a change in pull request #2706: HADOOP-13887. Support S3 client side encryption (S3-CSE) using AWS-SDK

bogthe commented on a change in pull request #2706:
URL: https://github.com/apache/hadoop/pull/2706#discussion_r671136384



##########
File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ABlockOutputStream.java
##########
@@ -389,8 +397,10 @@ public void close() throws IOException {
         // PUT the final block
         if (hasBlock &&
             (block.hasData() || multiPartUpload.getPartsSubmitted() == 0)) {
-          //send last part
-          uploadCurrentBlock();
+          // send last part and set the value of isLastPart to true in case of
+          // CSE being enabled, since we are sure it is last part as parts
+          // are being uploaded serially in CSE.
+          uploadCurrentBlock(isCSEEnabled);

Review comment:
       Why not set it always to `true` since it's the last part? i.e. `uploadCurrentBlock(true)`

##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
##########
@@ -70,6 +76,53 @@ by Amazon's Key Management Service, a key referenced by name in the uploading cl
 * SSE-C : the client specifies an actual base64 encoded AES-256 key to be used
 to encrypt and decrypt the data.
 
+Encryption options
+
+|  type | encryption | config on write | config on read |
+|-------|---------|-----------------|----------------|
+| `SSE-S3` | server side, AES256 | encryption algorithm | none |
+| `SSE-KMS` | server side, KMS key | key used to encrypt/decrypt | none |
+| `SSE-C` | server side, custom key | encryption algorithm and secret | encryption algorithm and secret |
+| `CSE-KMS` | client side, KMS key | encryption algorithm and key ID | encryption algorithm |
+
+With server-side encryption, the data is uploaded to S3 unencrypted (but wrapped by the HTTPS
+encryption channel).
+The data is dynamically encrypted and decrypted in the S3 Store, as needed.

Review comment:
       nit: Maybe this needs to be made a bit clearer that data is always encrypted inside of S3 and decrypted when it's being retrieved? 

##########
File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
##########
@@ -356,6 +359,11 @@
   private AuditManagerS3A auditManager =
       AuditIntegration.stubAuditManager();
 
+  /**
+   * Is this S3AFS instance using S3 client side encryption?

Review comment:
       nit*: `S3A FS`

##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
##########
@@ -70,6 +76,53 @@ by Amazon's Key Management Service, a key referenced by name in the uploading cl
 * SSE-C : the client specifies an actual base64 encoded AES-256 key to be used
 to encrypt and decrypt the data.
 
+Encryption options
+
+|  type | encryption | config on write | config on read |
+|-------|---------|-----------------|----------------|
+| `SSE-S3` | server side, AES256 | encryption algorithm | none |
+| `SSE-KMS` | server side, KMS key | key used to encrypt/decrypt | none |
+| `SSE-C` | server side, custom key | encryption algorithm and secret | encryption algorithm and secret |
+| `CSE-KMS` | client side, KMS key | encryption algorithm and key ID | encryption algorithm |
+
+With server-side encryption, the data is uploaded to S3 unencrypted (but wrapped by the HTTPS
+encryption channel).
+The data is dynamically encrypted and decrypted in the S3 Store, as needed.
+
+A server side algorithm can be enabled by default for a bucket, so that
+whenever data is uploaded unencrypted a default encryption algorithm is added.
+When data is encrypted with S3-SSE or SSE-KMS it is transparent to all clients
+downloading the data.
+SSE-C is different in that every client must know the secret key needed to decypt the data.
+
+Working with SSE-C data hard because every client must be configured to use the

Review comment:
       nit: `Working with SSE-C data hard` -> `Working with SSE-C data is harder` or `Working with SSE-C data is more difficult`

##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
##########
@@ -70,6 +76,53 @@ by Amazon's Key Management Service, a key referenced by name in the uploading cl
 * SSE-C : the client specifies an actual base64 encoded AES-256 key to be used
 to encrypt and decrypt the data.
 
+Encryption options
+
+|  type | encryption | config on write | config on read |
+|-------|---------|-----------------|----------------|
+| `SSE-S3` | server side, AES256 | encryption algorithm | none |
+| `SSE-KMS` | server side, KMS key | key used to encrypt/decrypt | none |
+| `SSE-C` | server side, custom key | encryption algorithm and secret | encryption algorithm and secret |
+| `CSE-KMS` | client side, KMS key | encryption algorithm and key ID | encryption algorithm |
+
+With server-side encryption, the data is uploaded to S3 unencrypted (but wrapped by the HTTPS
+encryption channel).
+The data is dynamically encrypted and decrypted in the S3 Store, as needed.
+
+A server side algorithm can be enabled by default for a bucket, so that
+whenever data is uploaded unencrypted a default encryption algorithm is added.
+When data is encrypted with S3-SSE or SSE-KMS it is transparent to all clients
+downloading the data.
+SSE-C is different in that every client must know the secret key needed to decypt the data.
+
+Working with SSE-C data hard because every client must be configured to use the
+algorithm and supply the key. In particular, it is very hard to mix SSE-C
+encrypted objects in the same S3 bucket with objects encrypted with other
+algorithms or unencrypted; The S3A client
+(and other applications) get very confused.
+
+KMS-based key encryption is powerful as access to a key can be restricted to
+specific users/IAM roles. However, use of the key is billed and can be
+throttled. Furthermore as a client seeks around a file, the KMS key *may* be
+used multiple times.
+
+S3 Client side encryption (CSE-KMS) is an experimental feature added in July
+2021.
+
+This encrypts the data on the client, before transmitting to S3, where it is
+stored encrypted. The data is unencrypted after downloading when it is being
+read back.
+
+in CSE-KMS, the ID of an AWS-KMS key is provided to the S3A client; 
+the client communicates with AWS-KMS to request a new encryption key, which
+KMS returns along with the same key encrypted with the KMS key.
+The S3 client encrypts the payload *and* attaches the KMS-encrypted version
+of the key as a header to the object.
+
+When downloading data, this header is extracted, passed to AWS KMS, and,
+if the client has the appropriate permissions, the symmetric key
+retrieved and returned.

Review comment:
       nit: `retrieved and returned.` -> `retrieved.` ?

##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
##########
@@ -419,7 +472,81 @@ the data, so guaranteeing that access to thee data can be read by everyone
 granted access to that key, and nobody without access to it.
 
 
-###<a name="changing-encryption"></a> Use rename() to encrypt files with new keys
+### <a name="fattr"></a> Using `hadoop fs -getfattr` to view encryption information.
+
+The S3A client retrieves all HTTP headers from an object and returns
+them in the "XAttr" list of attributed, prefixed with `header.`.
+This makes them retrievable in the `getXAttr()` API calls, which
+is available on the command line through the `hadoop fs -getfattr -d` command.
+
+This makes viewing the encryption headers of a file straightforward
+
+Here is an example of the operation invoked on a file where the client is using CSE-KMS:
+```
+bin/hadoop fs -getfattr -d s3a://test-london/file2
+
+2021-07-14 12:59:01,554 [main] WARN  s3.AmazonS3EncryptionClientV2 (AmazonS3EncryptionClientV2.java:warnOnLegacyCryptoMode(409)) - The S3 Encryption Client is configured to read encrypted data with legacy encryption modes through the CryptoMode setting. If you don't have objects encrypted with these legacy modes, you should disable support for them to enhance security. See https://docs.aws.amazon.com/general/latest/gr/aws_sdk_cryptography.html
+2021-07-14 12:59:01,558 [main] WARN  s3.AmazonS3EncryptionClientV2 (AmazonS3EncryptionClientV2.java:warnOnRangeGetsEnabled(401)) - The S3 Encryption Client is configured to support range get requests. Range gets do not provide authenticated encryption properties even when used with an authenticated mode (AES-GCM). See https://docs.aws.amazon.com/general/latest/gr/aws_sdk_cryptography.html
+# file: s3a://test-london/file2
+header.Content-Length="0"
+header.Content-Type="application/octet-stream"
+header.ETag="63b3f4bd6758712c98f1be86afad095a"
+header.Last-Modified="Wed Jul 14 12:56:06 BST 2021"
+header.x-amz-cek-alg="AES/GCM/NoPadding"
+header.x-amz-iv="ZfrgtxvcR41yNVkw"
+header.x-amz-key-v2="AQIDAHjZrfEIkNG24/xy/pqPPLcLJulg+O5pLgbag/745OH4VQFuiRnSS5sOfqXJyIa4FOvNAAAAfjB8BgkqhkiG9w0BBwagbzBtAgEAMGgGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQM7b5GZzD4eAGTC6gaAgEQgDte9Et/dhR7LAMU/NaPUZSgXWN5pRnM4hGHBiRNEVrDM+7+iotSTKxFco+VdBAzwFZfHjn0DOoSZEukNw=="
+header.x-amz-matdesc="{"aws:x-amz-cek-alg":"AES/GCM/NoPadding"}"
+header.x-amz-server-side-encryption="AES256"
+header.x-amz-tag-len="128"
+header.x-amz-unencrypted-content-length="0"
+header.x-amz-version-id="zXccFCB9eICszFgqv_paG1pzaUKY09Xa"
+header.x-amz-wrap-alg="kms+context"
+```
+
+Analysis
+
+1. the WARN commands are the AWS SDK warning that because the S3A client uses an encryption algorithm which seek() requires,
+the SDK considers it less secure than the most recent algorithm(s). Ignore.
+   
+* `header.x-amz-server-side-encryption="AES256"` : the file has been encrypted with S3-SSE. This is set up as the S3 default encryption,
+so even when CSE is enabled, the data is doubly encrypted.
+* `header.x-amz-cek-alg="AES/GCM/NoPadding`: client-side encrypted with the `"AES/GCM/NoPadding` algorithm.
+* `header.x-amz-iv="ZfrgtxvcR41yNVkw"`:
+* `header.x-amz-key-v2="AQIDAHjZrfEIkNG24/xy/pqPPLcLJulg+O5pLgbag/745OH4VQFuiRnSS5sOfqXJyIa4FOvNAAAAfjB8BgkqhkiG9w0BBwagbzBtAgEAMGgGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQM7b5GZzD4eAGTC6gaAgEQgDte9Et/dhR7LAMU/NaPUZSgXWN5pRnM4hGHBiRNEVrDM+7+iotSTKxFco+VdBAzwFZfHjn0DOoSZEukNw=="`:
+* `header.x-amz-unencrypted-content-length="0"`: this is the length of the unencrypted data. The S3A client *DOES NOT* use this header;
+* `header.x-amz-wrap-alg="kms+context"`: the algorithm used to encrypt the CSE key
+  it always removes 16 bytes from non-empty files when declaring the length.
+* `header.x-amz-version-id="zXccFCB9eICszFgqv_paG1pzaUKY09Xa"`: the bucket is versioned; this is the version ID.
+
+And a directory encrypted with S3-SSE only
+
+```
+bin/hadoop fs -getfattr -d s3a://test-london/user/stevel/target/test/data/sOCOsNgEjv
+
+# file: s3a://test-london/user/stevel/target/test/data/sOCOsNgEjv
+header.Content-Length="0"
+header.Content-Type="application/x-directory"
+header.ETag="d41d8cd98f00b204e9800998ecf8427e"
+header.Last-Modified="Tue Jul 13 20:12:07 BST 2021"
+header.x-amz-server-side-encryption="AES256"
+header.x-amz-version-id="KcDOVmznIagWx3gP1HlDqcZvm1mFWZ2a"
+```
+
+A file with no-encryption (on a bucket without versioning but with intelligent tiering)

Review comment:
       nit: missing `:`

##########
File path: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AClientSideEncryption.java
##########
@@ -0,0 +1,301 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *       http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.assertj.core.api.Assertions;
+import org.junit.Test;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.hadoop.fs.contract.ContractTestUtils;
+import org.apache.hadoop.fs.statistics.IOStatisticAssertions;
+import org.apache.hadoop.fs.statistics.IOStatistics;
+
+import static org.apache.hadoop.fs.contract.ContractTestUtils.createFile;
+import static org.apache.hadoop.fs.contract.ContractTestUtils.dataset;
+import static org.apache.hadoop.fs.contract.ContractTestUtils.rm;
+import static org.apache.hadoop.fs.contract.ContractTestUtils.verifyFileContents;
+import static org.apache.hadoop.fs.contract.ContractTestUtils.writeDataset;
+import static org.apache.hadoop.fs.s3a.Constants.MULTIPART_MIN_SIZE;
+import static org.apache.hadoop.fs.s3a.Constants.S3_ENCRYPTION_ALGORITHM;
+import static org.apache.hadoop.fs.s3a.Constants.S3_ENCRYPTION_KEY;
+import static org.apache.hadoop.fs.s3a.S3ATestUtils.assume;
+import static org.apache.hadoop.fs.s3a.S3ATestUtils.getTestPropertyBool;
+import static org.apache.hadoop.test.LambdaTestUtils.intercept;
+
+/**
+ * Tests to verify S3 Client-Side Encryption (CSE).
+ */
+public abstract class ITestS3AClientSideEncryption extends AbstractS3ATestBase {
+
+  private static final List<Integer> SIZES =
+      new ArrayList<>(Arrays.asList(0, 1, 255, 4095));
+
+  private static final int BIG_FILE_SIZE = 15 * 1024 * 1024;
+  private static final int SMALL_FILE_SIZE = 1024;
+
+  /**
+   * Testing S3 CSE on different file sizes.
+   */
+  @Test
+  public void testEncryption() throws Throwable {
+    describe("Test to verify client-side encryption for different file sizes.");
+    for (int size : SIZES) {
+      validateEncryptionForFileSize(size);
+    }
+  }
+
+  /**
+   * Testing the S3 client side encryption over rename operation.
+   */
+  @Test
+  public void testEncryptionOverRename() throws Throwable {
+    describe("Test for AWS CSE on Rename Operation.");
+    maybeSkipTest();
+    S3AFileSystem fs = getFileSystem();
+    Path src = path(getMethodName());
+    byte[] data = dataset(SMALL_FILE_SIZE, 'a', 'z');
+    writeDataset(fs, src, data, data.length, SMALL_FILE_SIZE,
+        true, false);
+
+    ContractTestUtils.verifyFileContents(fs, src, data);
+    Path dest = path(src.getName() + "-copy");
+    fs.rename(src, dest);
+    ContractTestUtils.verifyFileContents(fs, dest, data);
+    assertEncrypted(dest);
+  }
+
+  /**
+   * Test to verify if we get same content length of files in S3 CSE using
+   * listStatus and listFiles on the parent directory.
+   */
+  @Test
+  public void testDirectoryListingFileLengths() throws IOException {
+    describe("Test to verify directory listing calls gives correct content "
+        + "lengths");
+    maybeSkipTest();
+    S3AFileSystem fs = getFileSystem();
+    Path parentDir = path(getMethodName());
+
+    // Creating files in the parent directory that will be used to assert
+    // content length.
+    for (int i : SIZES) {
+      Path child = new Path(parentDir, getMethodName() + i);
+      writeThenReadFile(child, i);
+    }
+
+    // Getting the content lengths of files inside the directory via FileStatus.
+    List<Integer> fileLengthDirListing = new ArrayList<>();
+    for (FileStatus fileStatus : fs.listStatus(parentDir)) {
+      fileLengthDirListing.add((int) fileStatus.getLen());
+    }
+    // Assert the file length we got against expected file length for
+    // ListStatus.
+    Assertions.assertThat(fileLengthDirListing)
+        .describedAs("File lengths aren't the same "
+            + "as expected from FileStatus dir. listing")
+        .containsExactlyInAnyOrderElementsOf(SIZES);
+
+    // Getting the content lengths of files inside the directory via ListFiles.
+    RemoteIterator<LocatedFileStatus> listDir = fs.listFiles(parentDir, true);
+    List<Integer> fileLengthListLocated = new ArrayList<>();
+    while (listDir.hasNext()) {
+      LocatedFileStatus fileStatus = listDir.next();
+      fileLengthListLocated.add((int) fileStatus.getLen());
+    }
+    // Assert the file length we got against expected file length for
+    // LocatedFileStatus.
+    Assertions.assertThat(fileLengthListLocated)
+        .describedAs("File lengths isn't same "
+            + "as expected from LocatedFileStatus dir. listing")
+        .containsExactlyInAnyOrderElementsOf(SIZES);
+
+  }
+
+  /**
+   * Test to verify multipart upload through S3ABlockOutputStream and
+   * verifying the contents of the uploaded file.
+   */
+  @Test
+  public void testBigFilePutAndGet() throws IOException {
+    maybeSkipTest();
+    assume("Scale test disabled: to enable set property " +
+        KEY_SCALE_TESTS_ENABLED, getTestPropertyBool(
+        getConfiguration(),
+        KEY_SCALE_TESTS_ENABLED,
+        DEFAULT_SCALE_TESTS_ENABLED));
+    S3AFileSystem fs = getFileSystem();
+    Path filePath = path(getMethodName());
+    byte[] fileContent = dataset(BIG_FILE_SIZE, 'a', 26);
+    int offsetSeek = fileContent[BIG_FILE_SIZE - 4];
+
+    // PUT a 15MB file using CSE to force multipart in CSE.
+    createFile(fs, filePath, true, fileContent);
+    LOG.info("Multi-part upload successful...");
+
+    try (FSDataInputStream in = fs.open(filePath)) {
+      // Verify random IO.
+      in.seek(BIG_FILE_SIZE - 4);
+      assertEquals("Byte at a specific position not equal to actual byte",
+          offsetSeek, in.read());
+      in.seek(0);
+      assertEquals("Byte at a specific position not equal to actual byte",
+          'a', in.read());
+
+      // Verify seek-read between two multipart blocks.
+      in.seek(MULTIPART_MIN_SIZE - 1);
+      int byteBeforeBlockEnd = fileContent[MULTIPART_MIN_SIZE];
+      assertEquals("Byte before multipart block end mismatch",
+          byteBeforeBlockEnd - 1, in.read());
+      assertEquals("Byte at multipart end mismatch",
+          byteBeforeBlockEnd, in.read());
+      assertEquals("Byte after multipart end mismatch",
+          byteBeforeBlockEnd + 1, in.read());
+
+      // Verify end of file seek read.
+      in.seek(BIG_FILE_SIZE + 1);
+      assertEquals("Byte at eof mismatch",
+          -1, in.read());
+
+      // Verify full read.
+      in.readFully(0, fileContent);
+      verifyFileContents(fs, filePath, fileContent);
+    }
+  }
+
+  /**
+   * Testing how unencrypted and encrypted data behaves when read through
+   * CSE enabled and disabled FS respectively.
+   */
+  @Test
+  public void testEncryptionEnabledAndDisabledFS() throws Exception {

Review comment:
       This is awesome

##########
File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java
##########
@@ -524,10 +525,15 @@ public static S3AFileStatus createFileStatus(Path keyPath,
       long blockSize,
       String owner,
       String eTag,
-      String versionId) {
+      String versionId,
+      boolean isCSEEnabled) {

Review comment:
       is it worth creating a ticket in JIRA now so we don't forget about it?

##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
##########
@@ -70,6 +76,53 @@ by Amazon's Key Management Service, a key referenced by name in the uploading cl
 * SSE-C : the client specifies an actual base64 encoded AES-256 key to be used
 to encrypt and decrypt the data.
 
+Encryption options
+
+|  type | encryption | config on write | config on read |
+|-------|---------|-----------------|----------------|
+| `SSE-S3` | server side, AES256 | encryption algorithm | none |
+| `SSE-KMS` | server side, KMS key | key used to encrypt/decrypt | none |
+| `SSE-C` | server side, custom key | encryption algorithm and secret | encryption algorithm and secret |
+| `CSE-KMS` | client side, KMS key | encryption algorithm and key ID | encryption algorithm |
+
+With server-side encryption, the data is uploaded to S3 unencrypted (but wrapped by the HTTPS
+encryption channel).
+The data is dynamically encrypted and decrypted in the S3 Store, as needed.
+
+A server side algorithm can be enabled by default for a bucket, so that
+whenever data is uploaded unencrypted a default encryption algorithm is added.
+When data is encrypted with S3-SSE or SSE-KMS it is transparent to all clients
+downloading the data.
+SSE-C is different in that every client must know the secret key needed to decypt the data.
+
+Working with SSE-C data hard because every client must be configured to use the
+algorithm and supply the key. In particular, it is very hard to mix SSE-C
+encrypted objects in the same S3 bucket with objects encrypted with other
+algorithms or unencrypted; The S3A client
+(and other applications) get very confused.
+
+KMS-based key encryption is powerful as access to a key can be restricted to
+specific users/IAM roles. However, use of the key is billed and can be
+throttled. Furthermore as a client seeks around a file, the KMS key *may* be
+used multiple times.
+
+S3 Client side encryption (CSE-KMS) is an experimental feature added in July
+2021.
+
+This encrypts the data on the client, before transmitting to S3, where it is
+stored encrypted. The data is unencrypted after downloading when it is being
+read back.
+
+in CSE-KMS, the ID of an AWS-KMS key is provided to the S3A client; 

Review comment:
       nit: Capital `i` 

##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
##########
@@ -419,7 +472,81 @@ the data, so guaranteeing that access to thee data can be read by everyone
 granted access to that key, and nobody without access to it.
 
 
-###<a name="changing-encryption"></a> Use rename() to encrypt files with new keys
+### <a name="fattr"></a> Using `hadoop fs -getfattr` to view encryption information.
+
+The S3A client retrieves all HTTP headers from an object and returns
+them in the "XAttr" list of attributed, prefixed with `header.`.
+This makes them retrievable in the `getXAttr()` API calls, which
+is available on the command line through the `hadoop fs -getfattr -d` command.
+
+This makes viewing the encryption headers of a file straightforward
+
+Here is an example of the operation invoked on a file where the client is using CSE-KMS:
+```
+bin/hadoop fs -getfattr -d s3a://test-london/file2
+
+2021-07-14 12:59:01,554 [main] WARN  s3.AmazonS3EncryptionClientV2 (AmazonS3EncryptionClientV2.java:warnOnLegacyCryptoMode(409)) - The S3 Encryption Client is configured to read encrypted data with legacy encryption modes through the CryptoMode setting. If you don't have objects encrypted with these legacy modes, you should disable support for them to enhance security. See https://docs.aws.amazon.com/general/latest/gr/aws_sdk_cryptography.html
+2021-07-14 12:59:01,558 [main] WARN  s3.AmazonS3EncryptionClientV2 (AmazonS3EncryptionClientV2.java:warnOnRangeGetsEnabled(401)) - The S3 Encryption Client is configured to support range get requests. Range gets do not provide authenticated encryption properties even when used with an authenticated mode (AES-GCM). See https://docs.aws.amazon.com/general/latest/gr/aws_sdk_cryptography.html
+# file: s3a://test-london/file2
+header.Content-Length="0"
+header.Content-Type="application/octet-stream"
+header.ETag="63b3f4bd6758712c98f1be86afad095a"
+header.Last-Modified="Wed Jul 14 12:56:06 BST 2021"
+header.x-amz-cek-alg="AES/GCM/NoPadding"
+header.x-amz-iv="ZfrgtxvcR41yNVkw"
+header.x-amz-key-v2="AQIDAHjZrfEIkNG24/xy/pqPPLcLJulg+O5pLgbag/745OH4VQFuiRnSS5sOfqXJyIa4FOvNAAAAfjB8BgkqhkiG9w0BBwagbzBtAgEAMGgGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQM7b5GZzD4eAGTC6gaAgEQgDte9Et/dhR7LAMU/NaPUZSgXWN5pRnM4hGHBiRNEVrDM+7+iotSTKxFco+VdBAzwFZfHjn0DOoSZEukNw=="
+header.x-amz-matdesc="{"aws:x-amz-cek-alg":"AES/GCM/NoPadding"}"
+header.x-amz-server-side-encryption="AES256"
+header.x-amz-tag-len="128"
+header.x-amz-unencrypted-content-length="0"
+header.x-amz-version-id="zXccFCB9eICszFgqv_paG1pzaUKY09Xa"
+header.x-amz-wrap-alg="kms+context"
+```
+
+Analysis
+
+1. the WARN commands are the AWS SDK warning that because the S3A client uses an encryption algorithm which seek() requires,
+the SDK considers it less secure than the most recent algorithm(s). Ignore.
+   
+* `header.x-amz-server-side-encryption="AES256"` : the file has been encrypted with S3-SSE. This is set up as the S3 default encryption,
+so even when CSE is enabled, the data is doubly encrypted.
+* `header.x-amz-cek-alg="AES/GCM/NoPadding`: client-side encrypted with the `"AES/GCM/NoPadding` algorithm.
+* `header.x-amz-iv="ZfrgtxvcR41yNVkw"`:
+* `header.x-amz-key-v2="AQIDAHjZrfEIkNG24/xy/pqPPLcLJulg+O5pLgbag/745OH4VQFuiRnSS5sOfqXJyIa4FOvNAAAAfjB8BgkqhkiG9w0BBwagbzBtAgEAMGgGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQM7b5GZzD4eAGTC6gaAgEQgDte9Et/dhR7LAMU/NaPUZSgXWN5pRnM4hGHBiRNEVrDM+7+iotSTKxFco+VdBAzwFZfHjn0DOoSZEukNw=="`:
+* `header.x-amz-unencrypted-content-length="0"`: this is the length of the unencrypted data. The S3A client *DOES NOT* use this header;
+* `header.x-amz-wrap-alg="kms+context"`: the algorithm used to encrypt the CSE key
+  it always removes 16 bytes from non-empty files when declaring the length.
+* `header.x-amz-version-id="zXccFCB9eICszFgqv_paG1pzaUKY09Xa"`: the bucket is versioned; this is the version ID.
+
+And a directory encrypted with S3-SSE only

Review comment:
       nit: missing `:`

##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
##########
@@ -435,6 +562,78 @@ for reading as for writing, and you must supply that key for reading. There
 you need to copy one bucket to a different bucket, one with a different key.
 Use `distCp`for this, with per-bucket encryption policies.
 
+## <a name="cse"></a> Amazon S3 Client Side Encryption
+
+### Introduction
+Amazon S3 Client Side Encryption(S3-CSE), is used to encrypt data on the
+client-side and then transmit it over to S3 storage. The same encrypted data
+is then transmitted over to client while reading and then
+decrypted on the client-side. 
+
+S3-CSE, uses `AmazonS3EncryptionClientV2.java`  as the AmazonS3 client. The
+encryption and decryption is done by AWS SDK. As of July 2021, Only CSE-KMS
+method is supported. 
+
+A key reason this feature (HADOOP-13887) has been unavailable for a long time
+is that the AWS S3 client pads uploaded objects with a 16 byte footer. This
+meant that files were shorter when being read than when are listed them
+through any of the list API calls/getFileStatus(). Which broke many
+applications, including anything seeking near the end of a file to read a
+footer, as ORC and Parquet do.
+
+There is now a workaround: compensate for the footer in listings when CSE is enabled.
+
+- When listing files and directories, 16 bytes are subtracted from the length
+of all non-empty objects( greater than or equal to 16 bytes). 
+- Directory markers MAY be longer than 0 bytes long.
+
+This "appears" to work; secondly it does in the testing as of July 2021. However
+, the length of files when listed through the S3A client is now going to be
+shorter than the length of files listed with other clients -including S3A
+clients where S3-CSE has not been enabled.
+
+### Features
+
+- Supports client side encryption with keys managed in AWS KMS.
+- encryption settings propagated into jobs through any issued delegation tokens.
+- encryption information stored as headers in the uploaded object.
+
+### Limitations
+
+- Performance will be reduced. All encrypt/decrypt is now being done on the
+ client.
+- Writing files may be slower, as only a single block can be encrypted and
+ uploaded at a time

Review comment:
       nit: missing `.` or `;`

##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
##########
@@ -419,7 +472,81 @@ the data, so guaranteeing that access to thee data can be read by everyone
 granted access to that key, and nobody without access to it.
 
 
-###<a name="changing-encryption"></a> Use rename() to encrypt files with new keys
+### <a name="fattr"></a> Using `hadoop fs -getfattr` to view encryption information.
+
+The S3A client retrieves all HTTP headers from an object and returns
+them in the "XAttr" list of attributed, prefixed with `header.`.
+This makes them retrievable in the `getXAttr()` API calls, which
+is available on the command line through the `hadoop fs -getfattr -d` command.
+
+This makes viewing the encryption headers of a file straightforward

Review comment:
       nit: Missing `.`

##########
File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
##########
@@ -70,6 +76,53 @@ by Amazon's Key Management Service, a key referenced by name in the uploading cl
 * SSE-C : the client specifies an actual base64 encoded AES-256 key to be used
 to encrypt and decrypt the data.
 
+Encryption options
+
+|  type | encryption | config on write | config on read |
+|-------|---------|-----------------|----------------|
+| `SSE-S3` | server side, AES256 | encryption algorithm | none |
+| `SSE-KMS` | server side, KMS key | key used to encrypt/decrypt | none |
+| `SSE-C` | server side, custom key | encryption algorithm and secret | encryption algorithm and secret |
+| `CSE-KMS` | client side, KMS key | encryption algorithm and key ID | encryption algorithm |
+
+With server-side encryption, the data is uploaded to S3 unencrypted (but wrapped by the HTTPS
+encryption channel).
+The data is dynamically encrypted and decrypted in the S3 Store, as needed.
+
+A server side algorithm can be enabled by default for a bucket, so that
+whenever data is uploaded unencrypted a default encryption algorithm is added.
+When data is encrypted with S3-SSE or SSE-KMS it is transparent to all clients
+downloading the data.
+SSE-C is different in that every client must know the secret key needed to decypt the data.
+
+Working with SSE-C data hard because every client must be configured to use the
+algorithm and supply the key. In particular, it is very hard to mix SSE-C
+encrypted objects in the same S3 bucket with objects encrypted with other
+algorithms or unencrypted; The S3A client
+(and other applications) get very confused.
+
+KMS-based key encryption is powerful as access to a key can be restricted to
+specific users/IAM roles. However, use of the key is billed and can be
+throttled. Furthermore as a client seeks around a file, the KMS key *may* be

Review comment:
       I'm not sure if Hadoop does it but KMS keys can be cached by the client which reduces billing and throttling. If Hadoop supports it might be worth mentioning it here, otherwise it would be nice to add it (not in this PR ofc!).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org