You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@iceberg.apache.org by bl...@apache.org on 2021/01/09 00:50:14 UTC
[iceberg] branch master updated: Docs: Add page for AWS integration
(#1891)
This is an automated email from the ASF dual-hosted git repository.
blue pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/master by this push:
new 674c9b6 Docs: Add page for AWS integration (#1891)
674c9b6 is described below
commit 674c9b6e19354cbef851114ff8a1ba2ecdba01fa
Author: Jack Ye <yz...@amazon.com>
AuthorDate: Fri Jan 8 16:50:05 2021 -0800
Docs: Add page for AWS integration (#1891)
---
site/docs/aws.md | 256 +++++++++++++++++++++++++++++++++++++++++++++
site/docs/configuration.md | 14 +++
site/docs/css/extra.css | 4 +
site/mkdocs.yml | 4 +-
4 files changed, 277 insertions(+), 1 deletion(-)
diff --git a/site/docs/aws.md b/site/docs/aws.md
new file mode 100644
index 0000000..57bfa2a
--- /dev/null
+++ b/site/docs/aws.md
@@ -0,0 +1,256 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements. See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License. You may obtain a copy of the License at
+ -
+ - http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg AWS Integrations
+
+Iceberg provides integration with different AWS services through the `iceberg-aws` module.
+This section describes how to use Iceberg with AWS.
+
+## Enabling AWS Integration
+
+The `iceberg-aws` module is bundled with Spark and Flink engine runtimes.
+However, the AWS clients are not bundled so that you can use the same client version as your application.
+You will need to provide the AWS v2 SDK because that is what Iceberg depends on.
+You can choose to use the [AWS SDK bundle](https://mvnrepository.com/artifact/software.amazon.awssdk/bundle),
+or individual AWS client packages (Glue, S3, DynamoDB, KMS, STS) if you would like to have a minimal dependency footprint.
+
+For example, to use AWS features with Spark 3 and AWS clients version 2.15.40, you can start the Spark SQL shell with:
+
+```sh
+spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0,software.amazon.awssdk:bundle:2.15.40 \
+ --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+ --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \
+ --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
+ --conf spark.sql.catalog.my_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager \
+ --conf spark.sql.catalog.my_catalog.lock.table=myGlueLockTable
+```
+
+As you can see, In the shell command, we use `--packages` to specify the additional AWS bundle dependency with its version as `2.15.40`.
+
+## Glue Catalog
+
+Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the `Catalog` implementation.
+When used, an Iceberg namespace is stored as a [Glue Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html),
+an Iceberg table is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html),
+and every Iceberg table version is stored as a [Glue TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion).
+You can start using Glue catalog by specifying the `catalog-impl` as `org.apache.iceberg.aws.glue.GlueCatalog`,
+just like what is shown in the [enabling AWS integration](#enabling-aws-integration) section above.
+More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs).
+
+### Glue Catalog ID
+There is a unique Glue metastore in each AWS account and each AWS region.
+By default, `GlueCatalog` chooses the Glue metastore to use based on the user's default AWS client credential and region setup.
+You can specify the Glue catalog ID through `glue.id` catalog property to point to a Glue catalog in a different AWS account.
+The Glue catalog ID is your numeric AWS account ID.
+If the Glue catalog is in a different region, you should configure you AWS client to point to the correct region,
+see more details in [AWS client customization](#aws-client-customization).
+
+### Skip Archive
+
+By default, Glue stores all the table versions created and user can rollback a table to any historical version if needed.
+However, if you are streaming data to Iceberg, this will easily create a lot of Glue table versions.
+Therefore, it is recommended to turn off the archive feature in Glue by setting `glue.skip-archive` to `true`.
+For more details, please read [Glue Quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html) and the [UpdateTable API](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html).
+
+### DynamoDB for Commit Locking
+
+Glue does not have a strong guarantee over concurrent updates to a table.
+Although it throws `ConcurrentModificationException` when detecting two processes updating a table at the same time,
+there is no guarantee that one update would not clobber the other update.
+Therefore, [DynamoDB](https://aws.amazon.com/dynamodb) can be used for Glue, so that for every commit,
+`GlueCatalog` first obtains a lock using a helper DynamoDB table and then try to safely modify the Glue table.
+
+This feature requires the following lock related catalog properties:
+
+1. Set `lock-impl` as `org.apache.iceberg.aws.glue.DynamoLockManager`.
+2. Set `lock.table` as the DynamoDB table name you would like to use. If the lock table with the given name does not exist in DynamoDB, a new table is created with billing mode set as [pay-per-request](https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing).
+
+Other lock related catalog properties can also be used to adjust locking behaviors such as heartbeat interval.
+For more details, please refer to [Lock catalog properties](../configuration/#lock-catalog-properties).
+
+### Warehouse Location
+
+Similar to all other catalog implementations, `warehouse` is a required catalog property to determine the root path of the data warehouse in storage.
+By default, Glue only allows a warehouse location in S3 because of the use of `S3FileIO`.
+To store data in a different local or cloud store, Glue catalog can switch to use `HadoopFileIO` or any custom FileIO by setting the `io-impl` catalog property.
+Details about this feature can be found in the [custom FileIO](../custom-catalog/#custom-file-io-implementation) section.
+
+### Table Location
+
+By default, the root location for a table `my_table` of namespace `my_ns` is at `my-warehouse-location/my-ns.db/my-table`.
+This default root location can be changed at both namespace and table level.
+
+To use a different path prefix for all tables under a namespace, use AWS console or any AWS Glue client SDK you like to update the `locationUri` attribute of the corresponding Glue database.
+For example, you can update the `locationUri` of `my_ns` to `s3://my-ns-bucket`,
+then any newly created table will have a default root location under the new prefix.
+For instance, a new table `my_table_2` will have its root location at `s3://my-ns-bucket/my_table_2`.
+
+To use a completely different root path for a specific table, set the `location` table property to the desired root path value you want.
+For example, in Spark SQL you can do:
+
+```sql
+CREATE TABLE my_catalog.my_ns.my_table (
+ id bigint,
+ data string,
+ category string)
+USING iceberg
+OPTIONS ('location'='s3://my-special-table-bucket')
+PARTITIONED BY (category);
+```
+
+For engines like Spark that supports the `LOCATION` keyword, the above SQL statement is equivalent to:
+
+```sql
+CREATE TABLE my_catalog.my_ns.my_table (
+ id bigint,
+ data string,
+ category string)
+USING iceberg
+LOCATION 's3://my-special-table-bucket'
+PARTITIONED BY (category);
+```
+
+## S3 FileIO
+
+Iceberg allows users to write data to S3 through `S3FileIO`.
+`GlueCatalog` by default uses this `FileIO`, and other catalogs can load this `FileIO` using the `io-impl` catalog property.
+
+### Progressive Multipart Upload
+
+`S3FileIO` implements a customized progressive multipart upload algorithm to upload data.
+Data files are uploaded by parts in parallel as soon as each part is ready,
+and each file part is deleted as soon as its upload process completes.
+This provides maximized upload speed and minimized local disk usage during uploads.
+Here are the configurations that users can tune related to this feature:
+
+| Property | Default | Description |
+| --------------------------------- | -------------------------------------------------- | ------------------------------------------------------ |
+| s3.multipart.num-threads | the available number of processors in the system | number of threads to use for uploading parts to S3 (shared across all output streams) |
+| s3.multipart.part-size-bytes | 32MB | the size of a single part for multipart upload requests |
+| s3.multipart.threshold | 1.5 | the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload |
+| s3.staging-dir | `java.io.tmpdir` property value | the directory to hold temporary files |
+
+### S3 Server Side Encryption
+
+`S3FileIO` supports all 3 S3 server side encryption modes:
+
+* [SSE-S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html): When you use Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3), each object is encrypted with a unique key. As an additional safeguard, it encrypts the key itself with a master key that it regularly rotates. Amazon S3 server-side encryption uses one of the strongest block ciphers available, 256-bit Advanced Encryption Standard (AES-256), to encrypt your data.
+* [SSE-KMS](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html): Server-Side Encryption with Customer Master Keys (CMKs) Stored in AWS Key Management Service (SSE-KMS) is similar to SSE-S3, but with some additional benefits and charges for using this service. There are separate permissions for the use of a CMK that provides added protection against unauthorized access of your objects in Amazon S3. SSE-KMS also provides you with an audit trail that shows when your CMK [...]
+* [SSE-C](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html): With Server-Side Encryption with Customer-Provided Keys (SSE-C), you manage the encryption keys and Amazon S3 manages the encryption, as it writes to disks, and decryption, when you access your objects.
+
+To enable server side encryption, use the following configuration properties:
+
+| Property | Default | Description |
+| --------------------------------- | ---------------------------------------- | ------------------------------------------------------ |
+| s3.sse.type | `none` | `none`, `s3`, `kms` or `custom` |
+| s3.sse.key | `aws/s3` for `kms` type, null otherwise | A KMS Key ID or ARN for `kms` type, or a custom base-64 AES256 symmetric key for `custom` type. |
+| s3.sse.md5 | null | If SSE type is `custom`, this value must be set as the base-64 MD5 digest of the symmetric key to ensure integrity. |
+
+### S3 Access Control List
+
+`S3FileIO` supports S3 access control list (ACL) for detailed access control.
+User can choose the ACL level by setting the `s3.acl` property.
+For more details, please read [S3 ACL Documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html).
+
+### Object Store File Layout
+
+S3 and many other cloud storage services [throttle requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
+This means data stored in a traditional Hive storage layout has bad read and write throughput since data files of the same partition are placed under the same prefix.
+Iceberg by default uses the Hive storage layout, but can be switched to use a different `ObjectStoreLocationProvider`.
+In this mode, a hash string is added to the beginning of each file path, so that files are equally distributed across all prefixes in an S3 bucket.
+This results in minimized throttling and maximized throughput for S3-related IO operations.
+Here is an example Spark SQL command to create a table with this feature enabled:
+
+```sql
+CREATE TABLE my_catalog.my_ns.my_table (
+ id bigint,
+ data string,
+ category string)
+USING iceberg
+OPTIONS (
+ 'write.object-storage.enabled'=true,
+ 'write.object-storage.path'='s3://my-table-data-bucket')
+PARTITIONED BY (category);
+```
+
+For more details, please refer to the [LocationProvider Configuration](../custom-catalog/#custom-location-provider-implementation) section.
+
+### S3 Strong Consistency
+
+In November 2020, S3 announced [strong consistency](https://aws.amazon.com/s3/consistency/) for all read operations, and Iceberg is updated to fully leverage this feature.
+There is no redundant consistency wait and check which might negatively impact performance during IO operations.
+
+### Hadoop S3A FileSystem
+
+Before `S3FileIO` was introduced, many Iceberg users choose to use `HadoopFileIO` to write data to S3 through the [S3A FileSystem](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java).
+As introduced in the previous sections, `S3FileIO` adopts latest AWS clients and S3 features for optimized security and performance,
+ and is thus recommend for S3 use cases rather than the S3A FileSystem.
+
+`S3FileIO` writes data with `s3://` URI scheme, but it is also compatible with schemes written by the S3A FileSystem.
+This means for any table manifests containing `s3a://` or `s3n://` file paths, `S3FileIO` is still able to read them.
+This feature allows people to easily switch from S3A to `S3FileIO`.
+
+If for any reason you have to use S3A, here are the instructions:
+
+1. To store data using S3A, specify the `warehouse` catalog property to be an S3A path, e.g. `s3a://my-bucket/my-warehouse`
+2. For `HiveCatalog`, to also store metadata using S3A, specify the Hadoop config property `hive.metastore.warehouse.dir` to be an S3A path.
+3. Add [hadoop-aws](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) as a runtime dependency of your compute engine.
+4. Configure AWS settings based on [hadoop-aws documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) (make sure you check the version, S3A configuration varies a lot based on the version you use).
+
+## AWS Client Customization
+
+Many organizations have customized their way of configuring AWS clients with their own credential provider, access proxy, retry strategy, etc.
+Iceberg allows users to plug in their own implementation of `org.apache.iceberg.aws.AwsClientFactory` by setting the `client.factory` catalog property.
+
+### Cross-Account and Cross-Region Access
+
+It is a common use case for organizations to have a centralized AWS account for Glue metastore and S3 buckets, and use different AWS accounts and regions for different teams to access those resources.
+In this case, a [cross-account IAM role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html) is needed to access those centralized resources.
+Iceberg provides an AWS client factory `AssumeRoleAwsClientFactory` to support this common use case.
+This also serves as an example for users who would like to implement their own AWS client factory.
+
+This client factory has the following configurable catalog properties:
+
+| Property | Default | Description |
+| --------------------------------- | ---------------------------------------- | ------------------------------------------------------ |
+| client.assume-role.arn | null, requires user input | ARN of the role to assume, e.g. arn:aws:iam::123456789:role/myRoleToAssume |
+| client.assume-role.region | null, requires user input | All AWS clients except the STS client will use the given region instead of the default region chain |
+| client.assume-role.external-id | null | An optional [external ID](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user_externalid.html) |
+| client.assume-role.timeout-sec | 1 hour | Timeout of each assume role session. At the end of the timeout, a new set of role session credentials will be fetched through a STS client. |
+
+By using this client factory, an STS client is initialized with the default credential and region to assume the specified role.
+The Glue, S3 and DynamoDB clients are then initialized with the assume-role credential and region to access resources.
+Here is an example to start Spark shell with this client factory:
+
+```shell
+spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0,software.amazon.awssdk:bundle:2.15.40 \
+ --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+ --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \
+ --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
+ --conf spark.sql.catalog.my_catalog.client.factory=org.apache.iceberg.aws.AssumeRoleAwsClientFactory \
+ --conf spark.sql.catalog.my_catalog.client.assume-role.arn=arn:aws:iam::123456789:role/myRoleToAssume \
+ --conf spark.sql.catalog.my_catalog.client.assume-role.region=ap-northeast-1
+```
+
+## Run Iceberg on AWS
+
+[Amazon EMR](https://aws.amazon.com/emr/) can provision clusters with [Spark](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html) (EMR 6 for Spark 3, EMR 5 for Spark 2),
+[Hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html), [Flink](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html),
+[Trino](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html) that can run Iceberg.
+
+[Amazon Kinesis Data Analytics](https://aws.amazon.com/about-aws/whats-new/2019/11/you-can-now-run-fully-managed-apache-flink-applications-with-apache-kafka/) provides a platform
+to run fully managed Apache Flink applications. You can include Iceberg in your application Jar and run it in the platform.
diff --git a/site/docs/configuration.md b/site/docs/configuration.md
index 8c75f04..db05529 100644
--- a/site/docs/configuration.md
+++ b/site/docs/configuration.md
@@ -35,6 +35,20 @@ The properties can be manually constructed or passed in from a compute engine li
Spark uses its session properties as catalog properties, see more details in the [Spark configuration](#spark-configuration) section.
Flink passes in catalog properties through `CREATE CATALOG` statement, see more details in the [Flink](../flink/#creating-catalogs-and-using-catalogs) section.
+### Lock catalog properties
+
+Here are the catalog properties related to locking. They are used by some catalog implementations to control the locking behavior during commits.
+
+| Property | Default | Description |
+| --------------------------------- | ------------------ | ------------------------------------------------------ |
+| lock-impl | null | a custom implementation of the lock manager, the actual interface depends on the catalog used |
+| lock.table | null | an auxiliary table for locking, such as in [AWS DynamoDB lock manager](../aws/#dynamodb-for-commit-locking) |
+| lock.acquire-interval-ms | 5 seconds | the interval to wait between each attempt to acquire a lock |
+| lock.acquire-timeout-ms | 3 minutes | the maximum time to try acquiring a lock |
+| lock.heartbeat-interval-ms | 3 seconds | the interval to wait between each heartbeat after acquiring a lock |
+| lock.heartbeat-timeout-ms | 15 seconds | the maximum time without a heartbeat to consider a lock expired |
+
+
## Table properties
Iceberg tables support table properties to configure table behavior, like the default split size for readers.
diff --git a/site/docs/css/extra.css b/site/docs/css/extra.css
index 4fa7c80..3d79de0 100644
--- a/site/docs/css/extra.css
+++ b/site/docs/css/extra.css
@@ -28,6 +28,10 @@
float: left;
}
+.navbar-right {
+ display: none;
+}
+
.navbar-brand {
margin-right: 1em;
}
diff --git a/site/mkdocs.yml b/site/mkdocs.yml
index 1ce0513..93b5e2e 100644
--- a/site/mkdocs.yml
+++ b/site/mkdocs.yml
@@ -40,8 +40,8 @@ markdown_extensions:
- admonition
- pymdownx.tilde
nav:
- - About: index.md
- Project:
+ - About: index.md
- Community: community.md
- Releases: releases.md
- Trademarks: trademarks.md
@@ -62,6 +62,8 @@ nav:
- Trino: https://trino.io/docs/current/connector/iceberg.html
- Flink: flink.md
- Hive: hive.md
+ - Integrations:
+ - AWS: aws.md
- API:
- Javadoc: /javadoc/
- Java API intro: api.md