You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/16 11:55:16 UTC
[GitHub] [iceberg] ismailsimsek commented on a change in pull request #1891: AWS: documentation page for AWS module

ismailsimsek commented on a change in pull request #1891:
URL: https://github.com/apache/iceberg/pull/1891#discussion_r544238698



##########
File path: site/docs/aws.md
##########
@@ -0,0 +1,212 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+ 
+# Iceberg AWS Integrations
+
+Iceberg provides integration with different AWS services through the `iceberg-aws` module. 
+This section describes how to use Iceberg with AWS.
+
+## Runtime Packages
+
+The first thing to note is that the `iceberg-aws` module is not bundled with any engine runtime.
+To use any features described in later sections, you need to include the following packages by yourself:
+
+* the [iceberg AWS package](https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-aws)
+* the [AWS SDK bundle](https://mvnrepository.com/artifact/software.amazon.awssdk/bundle), or individual AWS client packages if you would like to have a minimum dependency footprint. (please note that we use the new AWS v2 SDK instead of v1)
+
+For example, in Spark 3, you can start the SQL shell with:
+
+```sh
+spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0,org.apache.iceberg:iceberg-aws-runtime:0.11.0,software.amazon.awssdk:bundle:2.15.40 \
+    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+    --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
+    --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my-key-prefix \
+    --conf spark.sql.catalog.my_catalog.gluecatalog.lock.table=myGlueLockTable
+```
+
+## Glue Catalog
+
+Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the `Catalog` implementation.
+When used, an Iceberg `Namespace` is stored as a [Glue Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html), 
+an Iceberg `Table` is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html),
+an Iceberg `Snapshot` is stored as a [Glue TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion). 
+You can start using Glue catalog by specifying the `catalog-impl` as `org.apache.iceberg.aws.glue.GlueCatalog`. 
+More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs).
+
+### Glue Catalog ID
+
+It is very common for an organization to store all the tables in a single Glue catalog in a single AWS account and run data computation in many different accounts. 
+In this case, you need to specify a Glue catalog ID when initializing `GlueCatalog`.
+The Glue catalog ID you should use is the AWS account ID.
+This is because in each AWS account, there is a single Glue catalog in each AWS region,
+but the region is pre-determined by the Glue web client that is making the call.
+If you would like to access a Glue catalog in a different region, you should configure you AWS client, see more details in [AWS client configuration](#aws-client-configurations).
+It is also common to [assume a role](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) when having cross-account access. See [AssumeRoleConfigurer](#assumeroleconfigurer) for how to set up assume role credentials in Iceberg.
+
+### Skip Archive
+
+By default, Glue will store all the table versions created and user can rollback a table to any historical version if needed.
+However, if you are streaming data to Iceberg, this will easily create a lot of Glue table versions.
+Therefore, it is recommended to turn off the archive feature in Glue by setting `` to false.
+For more details, please read [Glue Quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html) and the [UpdateTable API](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html).
+
+
+### DynamoDB for locking Glue tables
+
+Glue does not have a strong guarantee over concurrent updates to a table. 
+Although it throws `ConcurrentModificationException` when detecting two processes updating a table at the same time,
+there is no guarantee that one update would not clobber the other update.
+Therefore, DynamoDB lock is enabled by default for Glue, so that for every commit, 
+`GlueCatalog` first obtains a lock using a helper DynamoDB table and then try to safely modify the Glue table.
+User must specify a table name through catalog property `gluecatalog.lock.table` as the helper DynamoDB lock table to use.
+It is recommend to use the same DynamoDB table for operations in the same Glue catalog,
+and use a different table for a different Glue catalog in another account or region.
+If the lock table with the given name does not exist in DynamoDB, a new table is created with billing mode set as [Pay-per-Request](https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing).
+The lock has the following additional properties:
+
+* `gluecatalog.lock.wait-ms`:  max time to wait for lock acquisition, default to 3 minutes
+* `gluecatalog.lock.expire-ms`: max time a table can be locked by a process, default to 20 minutes
+
+If your use case only consists of single-process low-frequency (e.g. hourly, daily) updates to a table,
+you can also turn off this locking feature by setting `gluecatalog.lock.enabled` as false.
+
+### Warehouse Location
+
+By default, Glue uses `S3FileIO` and only allows a warehouse location in S3. 
+To store data in a different local or cloud store, Glue catalog can switch to use `HadoopFileIO` 
+or any custom FileIO using the mechanism described in the [custom FileIO](../custom-catalog/#custom-file-io-implementation) section.
+
+## S3 FileIO
+
+Iceberg allows user to write data to S3 through `S3FileIO`.
+`GlueCatalog` by default uses this FileIO, and other catalogs can load this FileIO using the `io-impl` catalog property.
+
+### Progressive Multipart Upload
+
+`S3FileIO` implements a customized progressive multipart upload algorithm to upload data.
+Data files are uploaded in parts in parallel as soon as each part is ready,
+and each file part is deleted as soon as its upload process completes.
+This provides maximized upload speed and minimized local disk usage during uploads.
+Here are the configurations user can tune related to this feature:
+
+* `s3fileio.multipart.num-threads`: number of threads to use for uploading parts to S3 (shared pool across all output streams)
+* `s3fileio.multipart.part.size`: the size of a single part for multipart upload requests, default to 32MB
+* `s3fileio.multipart.threshold`: the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload, default to 1.5
+* `s3fileio.staging.dir`: the directory to hold temporary files, default to Java's `java.io.tmpdir` property value
+
+### S3 Server Side Encryption
+
+`S3FileIO` supports all 3 S3 server side encryption modes:
+
+* [SSE-S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html): When you use Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3), each object is encrypted with a unique key. As an additional safeguard, it encrypts the key itself with a master key that it regularly rotates. Amazon S3 server-side encryption uses one of the strongest block ciphers available, 256-bit Advanced Encryption Standard (AES-256), to encrypt your data.
+* [SSE-KMS](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html): Server-Side Encryption with Customer Master Keys (CMKs) Stored in AWS Key Management Service (SSE-KMS) is similar to SSE-S3, but with some additional benefits and charges for using this service. There are separate permissions for the use of a CMK that provides added protection against unauthorized access of your objects in Amazon S3. SSE-KMS also provides you with an audit trail that shows when your CMK was used and by whom. Additionally, you can create and manage customer managed CMKs or use AWS managed CMKs that are unique to you, your service, and your Region.
+* [SSE-C](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html): With Server-Side Encryption with Customer-Provided Keys (SSE-C), you manage the encryption keys and Amazon S3 manages the encryption, as it writes to disks, and decryption, when you access your objects.
+
+To enable server side encryption, use the following configuration properties:
+
+* `s3fileio.sse.type`: `none`, `s3`, `kms` or `custom`, default to `none`
+* `s3fileio.sse.key`: a KMS Key ID or ARN for `kms` type (default to `aws/s3`), or a custom base-64 AES256 symmetric key for `custom` type.
+* `s3fileio.sse.md5`: if SSE type is `custom`, this value must be set as the base-64 MD5 digest of the symmetric key to ensure integrity.
+
+### S3 Access Control List
+
+`S3FileIO` supports S3 access control list for detailed access control. 
+User can choose the ACL level by setting the `s3fileio.acl` property.
+For more details, please read [S3 ACL Documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html).
+
+### ObjectStoreLocationProvider
+
+S3 and many other cloud storage services [throttle requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). 
+This means data stored in a traditional Hive storage layout has bad read and write throughput since data files of the same partition are placed under the same prefix.
+Iceberg by default uses the Hive storage layout, but can be switch to use a different `ObjectStoreLocationProvider`.
+In this mode, a hash string is added to the beginning of each file path, so that files are equally distributed across all prefixes in an S3 bucket.
+This results in minimized throttling and maximized throughput for S3-related IO operations.
+For more details, please follow [LocationProvider Configuration](../custom-catalog/#custom-location-provider-implementation) section to enable this feature.  
+
+### S3 Strong Consistency
+
+In November 2020, S3 announced [strong consistency](https://aws.amazon.com/s3/consistency/) for all GET and LIST operations, and Iceberg is updated to fully leverage this feature.
+When creating a new output file using `OutputFile.create()`, strong consistency check is used and an `AlreadyExistsException` will be thrown if the file already exists in S3.
+
+### Hadoop S3A
+
+Before `S3FileIO` was introduced, many Iceberg users choose to use `HadoopFileIO` to write data to S3 through the [S3A FileSystem](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java).
+As introduced in the above sections, the `S3FileIO` adopts latest AWS clients and S3 features for optimzied security and performance,
+ and is thus recommend for S3 use case over S3A.
+
+`S3FileIO` is compatible with legacy URI schemes written by S3A, 
+so any existing tables with `s3a://` or `s3n://` file paths are treated as equivalent `s3://` file paths.    
+
+If for any reason you have to use S3A, here are the instructions:
+
+1. to store data using S3A, specify the `warehouse` catalog property to be an S3A path, e.g. `s3a://my-bucket/my-warehouse` 
+2. For `HiveCatalog`, to store metadata also using S3A, specify Hadoop config `hive.metastore.warehouse.dir` to be an S3A path.
+3. Add [hadoop-aws](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) as a runtime dependency, configure AWS settings based on [hadoop-aws documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) (make sure you check the version, s3a configuration varies a lot based on the version you use)   
+
+
+## AWS client configurations
+
+Many organizations have customized their way of obtaining AWS credential and region information and configuring details about AWS clients for features like proxy access, retry, etc.
+Therefore, we open a configurer interface for Iceberg users to plugin any client configuration in a centralized place. 
+Users can set the `client.configurer` property as the class of the custom configurer.
+
+For example, a configurer can do something like the following:
+
+```java
+package com.my.team;
+
+import software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider;
+import software.amazon.awssdk.awscore.client.builder.AwsClientBuilder;
+import software.amazon.awssdk.services.s3.S3ClientBuilder;
+
+public class MyCustomClientConfigurer implements AwsClientConfigurer {
+
+  @Override
+  public <T extends AwsClientBuilder & AwsSyncClientBuilder> T configure(T clientBuilder) {
+    
+    // set some custom s3 configurations
+    if (clientBuilder instanceof S3ClientBuilder) {
+      S3ClientBuilder s3ClientBuilder = (S3ClientBuilder) clientBuilder;
+      // configure something
+    }
+
+    // set the same credential provider for all clients
+    clientBuilder.credentialsProvider(ContainerCredentialsProvider.builder().build());
+  }
+}
+```
+
+### AssumeRoleConfigurer
+
+As a common use case scenario, we provide `AssumeRoleConfigurer` as an example configurer. It has the following catalog properties:
+
+* `client.assume-role.arn`: role ARN to assume
+* `client.assume-role.timeout-sec`: seconds for an assume-role session, after the timeout a new session is automatically fetched by a STS client.
+* `client.assume-role.external-id`: optional external ID for the role to assume
+* `client.assume-role.region`: a region for all clients (except the STS client) to use
+
+When this configurer is used, a STS client is initialized with default [credentials chain](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html) and [region chain](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html),
+and all the other clients (Glue, DynamoDB, S3, etc.) will use the configured assume-role credential and region.
+
+## Run Iceberg on AWS

Review comment:
       Thank you @jackye1995  it looks great, this is more an question. is it also possible to run iceberg with AWS "Glue Job"? if we use `spark.jars.packages` to provide iceberg library.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org