You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "ChristinaTech (via GitHub)" <gi...@apache.org> on 2023/05/07 03:27:10 UTC

[GitHub] [iceberg] ChristinaTech commented on a diff in pull request #6742: AWS: support force register table in GlueCatalog

ChristinaTech commented on code in PR #6742:
URL: https://github.com/apache/iceberg/pull/6742#discussion_r1186773219


##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -111,7 +117,8 @@ public GlueCatalog() {}
 
   @Override
   public void initialize(String name, Map<String, String> properties) {
-    this.catalogProperties = ImmutableMap.copyOf(properties);
+    this.catalogProperties = new HashMap<>();
+    catalogProperties.putAll(properties);

Review Comment:
   Making CatalogProperties non-immutable in order to effectively side-load information into the AWS client factory post-initialization is a dangerous precedent to set. In addition, it doesn't seem to actually accomplish anything besides allowing the GlueCatalog to suddenly switch regions post-initialization, which is likely to introduce some dangerous side effects.



##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -437,6 +444,81 @@ public void renameTable(TableIdentifier from, TableIdentifier to) {
     LOG.info("Successfully renamed table from {} to {}", from, to);
   }
 
+  @Override
+  public org.apache.iceberg.Table registerTable(
+      TableIdentifier identifier, String metadataFileLocation) {
+    Preconditions.checkArgument(
+        isValidIdentifier(identifier), "Table identifier to register is invalid: " + identifier);
+    Preconditions.checkArgument(
+        metadataFileLocation != null && !metadataFileLocation.isEmpty(),
+        "Cannot register an empty metadata file location as a table");
+
+    // keep the original behavior when force-register-table flag is off
+    if (!awsProperties.glueCatalogForceRegisterTable()) {
+      return super.registerTable(identifier, metadataFileLocation);
+    }
+
+    String factoryImpl =
+        PropertyUtil.propertyAsString(catalogProperties, AwsProperties.CLIENT_FACTORY, null);
+    if (factoryImpl != null && factoryImpl.equals(AssumeRoleAwsClientFactory.class.getName())) {

Review Comment:
   It hurts extensibility to make logic specific to a particular implementation class. If, for example, a customer needs to extend `AssumeRoleAwsClientFactory` to add some functionality for their own use, this logic will break as the class name will no longer match.



##########
aws/src/main/java/org/apache/iceberg/aws/AwsProperties.java:
##########
@@ -187,6 +187,18 @@ public class AwsProperties implements Serializable {
    */
   public static final String GLUE_CATALOG_ENDPOINT = "glue.endpoint";
 
+  /**
+   * If set, Glue will always update the catalog table if the table already exists in glue catalog.
+   * By default, Glue catalog will only be able to create new table and will throw
+   * AlreadyExistsException when register an existing table name.
+   */
+  public static final String GLUE_CATALOG_FORCE_REGISTER_TABLE = "glue.force-register-table";
+
+  public static final boolean GLUE_CATALOG_FORCE_REGISTER_TABLE_DEFAULT = false;
+
+  /** Configure the Glue Catalog S3 FileIO Region to allow cross region s3 access */
+  public static final String GLUE_CATALOG_FILE_IO_REGION = "glue.catalog-file-io-region";

Review Comment:
   So we already have `client.region` which sets the region for all services. I am going to be introducing a change in the near future that will allow setting per-region for the default AWS Client Factory (and we can extend it to the Assume Role client factory as well), so we probably want a more general per-service naming scheme. Based on how existing parameters are formatted where `glue.` and `s3.` are already established prefixes, most likely something like: `glue.region`, `s3.region`, `kms.region`, etc...



##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -437,6 +444,81 @@ public void renameTable(TableIdentifier from, TableIdentifier to) {
     LOG.info("Successfully renamed table from {} to {}", from, to);
   }
 
+  @Override
+  public org.apache.iceberg.Table registerTable(
+      TableIdentifier identifier, String metadataFileLocation) {
+    Preconditions.checkArgument(
+        isValidIdentifier(identifier), "Table identifier to register is invalid: " + identifier);
+    Preconditions.checkArgument(
+        metadataFileLocation != null && !metadataFileLocation.isEmpty(),
+        "Cannot register an empty metadata file location as a table");
+
+    // keep the original behavior when force-register-table flag is off
+    if (!awsProperties.glueCatalogForceRegisterTable()) {
+      return super.registerTable(identifier, metadataFileLocation);
+    }
+
+    String factoryImpl =
+        PropertyUtil.propertyAsString(catalogProperties, AwsProperties.CLIENT_FACTORY, null);
+    if (factoryImpl != null && factoryImpl.equals(AssumeRoleAwsClientFactory.class.getName())) {
+      // overwrite client assume_role_region for file IO to make cross region call
+      String catalogFileIORegion = awsProperties.getGlueCatalogFileIORegion();
+      if (catalogFileIORegion != null) {
+        catalogProperties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION, catalogFileIORegion);
+      }
+    }
+
+    TableOperations ops = newTableOps(identifier);
+    InputFile metadataFile = ops.io().newInputFile(metadataFileLocation);
+    TableMetadata metadata = TableMetadataParser.read(ops.io(), metadataFile);
+
+    Map<String, String> tableParameters =
+        ImmutableMap.of(
+            BaseMetastoreTableOperations.TABLE_TYPE_PROP,
+            BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toLowerCase(Locale.ENGLISH),
+            BaseMetastoreTableOperations.METADATA_LOCATION_PROP,
+            metadataFileLocation);
+
+    String databaseName =
+        IcebergToGlueConverter.getDatabaseName(
+            identifier, awsProperties.glueCatalogSkipNameValidation());
+    String tableName =
+        IcebergToGlueConverter.getTableName(
+            identifier, awsProperties.glueCatalogSkipNameValidation());
+
+    TableInput tableInput =
+        TableInput.builder()
+            .applyMutation(
+                builder ->
+                    IcebergToGlueConverter.setTableInputInformation(
+                        builder, metadata, tableParameters))
+            .name(tableName)
+            .tableType(GlueTableOperations.GLUE_EXTERNAL_TABLE_TYPE)
+            .parameters(tableParameters)
+            .build();
+
+    try {
+      glue.createTable(
+          CreateTableRequest.builder().databaseName(databaseName).tableInput(tableInput).build());
+    } catch (software.amazon.awssdk.services.glue.model.AlreadyExistsException e) {
+      GetTableResponse response =
+          glue.getTable(
+              GetTableRequest.builder().databaseName(databaseName).name(tableName).build());
+      String versionId = response.table().versionId();
+      glue.updateTable(
+          UpdateTableRequest.builder()
+              .databaseName(databaseName)
+              .tableInput(tableInput)
+              .versionId(versionId)
+              .build());
+    } catch (EntityNotFoundException e) {

Review Comment:
   So this logic is a combination of a fork of the logic in `BaseMetastoreCatalog.registerTable` and `GlueTableOperations.persistGlueTable`. As `GlueTableOperations` also has access to `AwsProperties`, I would recommend refactoring the logic in GlueTableOperations so that `persistGlueTable` can conditionally fall back to its update mode as that improves code reuse.
   
   If the concern is the creation of an extra metadata file, it looks like `GlueTableOperations` already checks whether it needs to during the `writeNewMetadataIfRequired` function and considering tableMetadata is always set for `registerTable`, it will never choose to write a new metadata file anyways.



##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -437,6 +444,81 @@ public void renameTable(TableIdentifier from, TableIdentifier to) {
     LOG.info("Successfully renamed table from {} to {}", from, to);
   }
 
+  @Override
+  public org.apache.iceberg.Table registerTable(
+      TableIdentifier identifier, String metadataFileLocation) {
+    Preconditions.checkArgument(
+        isValidIdentifier(identifier), "Table identifier to register is invalid: " + identifier);
+    Preconditions.checkArgument(
+        metadataFileLocation != null && !metadataFileLocation.isEmpty(),
+        "Cannot register an empty metadata file location as a table");
+
+    // keep the original behavior when force-register-table flag is off
+    if (!awsProperties.glueCatalogForceRegisterTable()) {
+      return super.registerTable(identifier, metadataFileLocation);
+    }
+
+    String factoryImpl =
+        PropertyUtil.propertyAsString(catalogProperties, AwsProperties.CLIENT_FACTORY, null);
+    if (factoryImpl != null && factoryImpl.equals(AssumeRoleAwsClientFactory.class.getName())) {
+      // overwrite client assume_role_region for file IO to make cross region call
+      String catalogFileIORegion = awsProperties.getGlueCatalogFileIORegion();
+      if (catalogFileIORegion != null) {
+        catalogProperties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION, catalogFileIORegion);
+      }

Review Comment:
   Can't this logic and the associated parameter be removed and replaced with just setting `AwsProperties.CLIENT_ASSUME_ROLE_REGION` directly? This change is not in any way scoped to this call or AWS service, it effectively creates a case where what region the GlueCatalog is calling can suddenly change post-initialization after the first time `registerTable` is called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org