You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/17 02:45:37 UTC

[GitHub] [iceberg] HeartSaVioR commented on a change in pull request #1891: AWS: documentation page for AWS module

HeartSaVioR commented on a change in pull request #1891:
URL: https://github.com/apache/iceberg/pull/1891#discussion_r544770511



##########
File path: site/docs/aws.md
##########
@@ -0,0 +1,212 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+ 
+# Iceberg AWS Integrations
+
+Iceberg provides integration with different AWS services through the `iceberg-aws` module. 
+This section describes how to use Iceberg with AWS.
+
+## Runtime Packages
+
+The first thing to note is that the `iceberg-aws` module is not bundled with any engine runtime.
+To use any features described in later sections, you need to include the following packages by yourself:
+
+* the [iceberg AWS package](https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-aws)
+* the [AWS SDK bundle](https://mvnrepository.com/artifact/software.amazon.awssdk/bundle), or individual AWS client packages if you would like to have a minimum dependency footprint. (please note that we use the new AWS v2 SDK instead of v1)
+
+For example, in Spark 3, you can start the SQL shell with:
+
+```sh
+spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0,org.apache.iceberg:iceberg-aws-runtime:0.11.0,software.amazon.awssdk:bundle:2.15.40 \
+    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+    --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
+    --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my-key-prefix \
+    --conf spark.sql.catalog.my_catalog.gluecatalog.lock.table=myGlueLockTable
+```
+
+## Glue Catalog
+
+Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the `Catalog` implementation.
+When used, an Iceberg `Namespace` is stored as a [Glue Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html), 
+an Iceberg `Table` is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html),
+an Iceberg `Snapshot` is stored as a [Glue TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion). 
+You can start using Glue catalog by specifying the `catalog-impl` as `org.apache.iceberg.aws.glue.GlueCatalog`. 
+More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs).
+
+### Glue Catalog ID
+
+It is very common for an organization to store all the tables in a single Glue catalog in a single AWS account and run data computation in many different accounts. 
+In this case, you need to specify a Glue catalog ID when initializing `GlueCatalog`.
+The Glue catalog ID you should use is the AWS account ID.
+This is because in each AWS account, there is a single Glue catalog in each AWS region,
+but the region is pre-determined by the Glue web client that is making the call.
+If you would like to access a Glue catalog in a different region, you should configure you AWS client, see more details in [AWS client configuration](#aws-client-configurations).
+It is also common to [assume a role](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) when having cross-account access. See [AssumeRoleConfigurer](#assumeroleconfigurer) for how to set up assume role credentials in Iceberg.
+
+### Skip Archive
+
+By default, Glue will store all the table versions created and user can rollback a table to any historical version if needed.
+However, if you are streaming data to Iceberg, this will easily create a lot of Glue table versions.
+Therefore, it is recommended to turn off the archive feature in Glue by setting `` to false.
+For more details, please read [Glue Quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html) and the [UpdateTable API](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html).
+
+
+### DynamoDB for locking Glue tables
+
+Glue does not have a strong guarantee over concurrent updates to a table. 
+Although it throws `ConcurrentModificationException` when detecting two processes updating a table at the same time,
+there is no guarantee that one update would not clobber the other update.
+Therefore, DynamoDB lock is enabled by default for Glue, so that for every commit, 
+`GlueCatalog` first obtains a lock using a helper DynamoDB table and then try to safely modify the Glue table.
+User must specify a table name through catalog property `gluecatalog.lock.table` as the helper DynamoDB lock table to use.
+It is recommend to use the same DynamoDB table for operations in the same Glue catalog,
+and use a different table for a different Glue catalog in another account or region.
+If the lock table with the given name does not exist in DynamoDB, a new table is created with billing mode set as [Pay-per-Request](https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing).
+The lock has the following additional properties:
+
+* `gluecatalog.lock.wait-ms`:  max time to wait for lock acquisition, default to 3 minutes
+* `gluecatalog.lock.expire-ms`: max time a table can be locked by a process, default to 20 minutes
+
+If your use case only consists of single-process low-frequency (e.g. hourly, daily) updates to a table,
+you can also turn off this locking feature by setting `gluecatalog.lock.enabled` as false.
+
+### Warehouse Location
+
+By default, Glue uses `S3FileIO` and only allows a warehouse location in S3. 
+To store data in a different local or cloud store, Glue catalog can switch to use `HadoopFileIO` 
+or any custom FileIO using the mechanism described in the [custom FileIO](../custom-catalog/#custom-file-io-implementation) section.
+
+## S3 FileIO

Review comment:
       I was about to ask the rationalization of S3 FileIO compared to Hadoop filesystem API with S3 support in #1945, but this section covers it. Thanks!
   
   Probably worth to also mention whether Hadoop FS API with S3 is sufficient to work with, or S3 FileIO is required to avoid consistency glitches. That would help end users to determine whether including aws module is a kind of requirement for dealing with S3 or not.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org