You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/13 19:48:44 UTC

[GitHub] [iceberg] JonasJ-ap opened a new pull request, #6179: AWS: Re-tag files when renaming tables in GlueCatalog

JonasJ-ap opened a new pull request, #6179:
URL: https://github.com/apache/iceberg/pull/6179

   Follows PR #4402 . 
   As mentioned in https://github.com/apache/iceberg/pull/4402#issuecomment-1261096282:
   In `GlueCatalog`, if `s3.write.table-name-tag-enabled` and `s3.write.namespace-name-tag-enabled` are enabled, all the files related to the table will be tagged with the table name and the namespace name. We need to update these tags when we rename the table.
   
   This PR add S3 tag updates for `renameTable` operation in `GlueCatalog`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6179: AWS: Re-tag files when renaming tables in GlueCatalog

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on code in PR #6179:
URL: https://github.com/apache/iceberg/pull/6179#discussion_r1021714066


##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -624,4 +642,153 @@ public void setConf(Configuration conf) {
   protected Map<String, String> properties() {
     return catalogProperties == null ? ImmutableMap.of() : catalogProperties;
   }
+
+  private void updateTableTag(TableIdentifier from, TableIdentifier to) {
+    // should update tag when the rename process is successful
+    TableOperations ops = newTableOps(to);
+    TableMetadata lastMetadata = null;
+    try {
+      lastMetadata = ops.current();
+    } catch (NotFoundException e) {
+      LOG.warn(
+          "Failed to load table metadata for table: {}, continuing rename without re-tag", to, e);
+    }
+    Set<Tag> oldTags = Sets.newHashSet();
+    Set<Tag> newTags = Sets.newHashSet();
+    boolean skipNameValidation = awsProperties.glueCatalogSkipNameValidation();
+    if (awsProperties.s3WriteTableTagEnabled()) {
+      oldTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_TABLE)
+              .value(IcebergToGlueConverter.getTableName(from, skipNameValidation))
+              .build());
+      newTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_TABLE)
+              .value(IcebergToGlueConverter.getTableName(to, skipNameValidation))
+              .build());
+    }
+
+    if (awsProperties.s3WriteNamespaceTagEnabled()) {
+      oldTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_NAMESPACE)
+              .value(IcebergToGlueConverter.getDatabaseName(from, skipNameValidation))
+              .build());
+      newTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_NAMESPACE)
+              .value(IcebergToGlueConverter.getDatabaseName(to, skipNameValidation))
+              .build());
+    }
+
+    if (lastMetadata != null && ops.io() instanceof S3FileIO) {
+      updateTableTag((S3FileIO) ops.io(), lastMetadata, oldTags, newTags);
+    }
+  }
+
+  private void updateTableTag(
+      S3FileIO io, TableMetadata metadata, Set<Tag> oldTags, Set<Tag> newTags) {
+    Set<String> manifestListsToUpdate = Sets.newHashSet();
+    Set<ManifestFile> manifestsToUpdate = Sets.newHashSet();
+    for (Snapshot snapshot : metadata.snapshots()) {
+      // add all manifests to the delete set because both data and delete files should be removed
+      Iterables.addAll(manifestsToUpdate, snapshot.allManifests(io));
+      // add the manifest list to the delete set, if present
+      if (snapshot.manifestListLocation() != null) {
+        manifestListsToUpdate.add(snapshot.manifestListLocation());
+      }
+    }
+
+    LOG.info("Manifests to update: {}", Joiner.on(", ").join(manifestsToUpdate));
+
+    boolean gcEnabled =
+        PropertyUtil.propertyAsBoolean(metadata.properties(), GC_ENABLED, GC_ENABLED_DEFAULT);
+
+    if (gcEnabled) {
+      // update data files only if we are sure this won't corrupt other tables
+      updateFilesTag(io, manifestsToUpdate, oldTags, newTags);
+    }
+
+    updateFilesTag(
+        io,
+        Iterables.transform(manifestsToUpdate, ManifestFile::path),
+        "manifest",
+        true,
+        oldTags,
+        newTags);
+    updateFilesTag(io, manifestListsToUpdate, "manifest list", true, oldTags, newTags);
+    updateFilesTag(
+        io,
+        Iterables.transform(metadata.previousFiles(), TableMetadata.MetadataLogEntry::file),
+        "previous metadata",
+        true,
+        oldTags,
+        newTags);
+    updateFileTag(io, metadata.metadataFileLocation(), "metadata", oldTags, newTags);
+  }
+
+  private void updateFilesTag(
+      S3FileIO io, Set<ManifestFile> allManifests, Set<Tag> oldTags, Set<Tag> newTags) {
+    Map<String, Boolean> updatedFiles =
+        new MapMaker().concurrencyLevel(ThreadPools.WORKER_THREAD_POOL_SIZE).weakKeys().makeMap();
+
+    Tasks.foreach(allManifests)
+        .noRetry()
+        .suppressFailureWhenFinished()
+        .executeWith(ThreadPools.getWorkerPool())
+        .onFailure(
+            (item, exc) ->
+                LOG.warn("Failed to get updated files: this may cause orphaned data files", exc))
+        .run(
+            manifest -> {
+              if (manifest.content() == ManifestContent.DATA) {
+                List<String> pathsToUpdated = Lists.newArrayList();
+                CloseableIterable<String> filePaths = ManifestFiles.readPaths(manifest, io);
+                for (String rawFilePath : filePaths) {
+                  // intern the file path because the weak key map uses identity (==) instead of
+                  // equals
+                  String path = rawFilePath.intern();
+                  Boolean alreadyUpdated = updatedFiles.putIfAbsent(path, true);
+                  if (alreadyUpdated == null || !alreadyUpdated) {
+                    pathsToUpdated.add(path);
+                  }
+                }
+                updateFilesTag(io, pathsToUpdated, "data", false, oldTags, newTags);
+              }
+            });
+  }
+
+  private void updateFilesTag(
+      S3FileIO io,
+      Iterable<String> files,
+      String type,
+      boolean concurrent,

Review Comment:
   looks like we always use `true` for `concurrent`, can we just have that code path for now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6179: AWS: Re-tag files when renaming tables in GlueCatalog

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on code in PR #6179:
URL: https://github.com/apache/iceberg/pull/6179#discussion_r1021704987


##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -624,4 +642,153 @@ public void setConf(Configuration conf) {
   protected Map<String, String> properties() {
     return catalogProperties == null ? ImmutableMap.of() : catalogProperties;
   }
+
+  private void updateTableTag(TableIdentifier from, TableIdentifier to) {
+    // should update tag when the rename process is successful
+    TableOperations ops = newTableOps(to);
+    TableMetadata lastMetadata = null;
+    try {
+      lastMetadata = ops.current();
+    } catch (NotFoundException e) {
+      LOG.warn(
+          "Failed to load table metadata for table: {}, continuing rename without re-tag", to, e);
+    }
+    Set<Tag> oldTags = Sets.newHashSet();
+    Set<Tag> newTags = Sets.newHashSet();
+    boolean skipNameValidation = awsProperties.glueCatalogSkipNameValidation();
+    if (awsProperties.s3WriteTableTagEnabled()) {
+      oldTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_TABLE)
+              .value(IcebergToGlueConverter.getTableName(from, skipNameValidation))
+              .build());
+      newTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_TABLE)
+              .value(IcebergToGlueConverter.getTableName(to, skipNameValidation))
+              .build());
+    }
+
+    if (awsProperties.s3WriteNamespaceTagEnabled()) {
+      oldTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_NAMESPACE)
+              .value(IcebergToGlueConverter.getDatabaseName(from, skipNameValidation))
+              .build());
+      newTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_NAMESPACE)
+              .value(IcebergToGlueConverter.getDatabaseName(to, skipNameValidation))
+              .build());
+    }
+
+    if (lastMetadata != null && ops.io() instanceof S3FileIO) {
+      updateTableTag((S3FileIO) ops.io(), lastMetadata, oldTags, newTags);
+    }
+  }
+
+  private void updateTableTag(
+      S3FileIO io, TableMetadata metadata, Set<Tag> oldTags, Set<Tag> newTags) {
+    Set<String> manifestListsToUpdate = Sets.newHashSet();
+    Set<ManifestFile> manifestsToUpdate = Sets.newHashSet();
+    for (Snapshot snapshot : metadata.snapshots()) {
+      // add all manifests to the delete set because both data and delete files should be removed
+      Iterables.addAll(manifestsToUpdate, snapshot.allManifests(io));
+      // add the manifest list to the delete set, if present
+      if (snapshot.manifestListLocation() != null) {
+        manifestListsToUpdate.add(snapshot.manifestListLocation());
+      }
+    }
+
+    LOG.info("Manifests to update: {}", Joiner.on(", ").join(manifestsToUpdate));
+
+    boolean gcEnabled =
+        PropertyUtil.propertyAsBoolean(metadata.properties(), GC_ENABLED, GC_ENABLED_DEFAULT);
+
+    if (gcEnabled) {
+      // update data files only if we are sure this won't corrupt other tables
+      updateFilesTag(io, manifestsToUpdate, oldTags, newTags);
+    }
+
+    updateFilesTag(
+        io,
+        Iterables.transform(manifestsToUpdate, ManifestFile::path),
+        "manifest",
+        true,
+        oldTags,
+        newTags);
+    updateFilesTag(io, manifestListsToUpdate, "manifest list", true, oldTags, newTags);
+    updateFilesTag(
+        io,
+        Iterables.transform(metadata.previousFiles(), TableMetadata.MetadataLogEntry::file),
+        "previous metadata",
+        true,
+        oldTags,
+        newTags);
+    updateFileTag(io, metadata.metadataFileLocation(), "metadata", oldTags, newTags);

Review Comment:
   there is a list of metadata file location we need to update based on history log



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6179: AWS: Re-tag files when renaming tables in GlueCatalog

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on code in PR #6179:
URL: https://github.com/apache/iceberg/pull/6179#discussion_r1021715987


##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -624,4 +642,153 @@ public void setConf(Configuration conf) {
   protected Map<String, String> properties() {
     return catalogProperties == null ? ImmutableMap.of() : catalogProperties;
   }
+
+  private void updateTableTag(TableIdentifier from, TableIdentifier to) {
+    // should update tag when the rename process is successful
+    TableOperations ops = newTableOps(to);
+    TableMetadata lastMetadata = null;
+    try {
+      lastMetadata = ops.current();
+    } catch (NotFoundException e) {
+      LOG.warn(
+          "Failed to load table metadata for table: {}, continuing rename without re-tag", to, e);
+    }
+    Set<Tag> oldTags = Sets.newHashSet();
+    Set<Tag> newTags = Sets.newHashSet();
+    boolean skipNameValidation = awsProperties.glueCatalogSkipNameValidation();
+    if (awsProperties.s3WriteTableTagEnabled()) {
+      oldTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_TABLE)
+              .value(IcebergToGlueConverter.getTableName(from, skipNameValidation))
+              .build());
+      newTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_TABLE)
+              .value(IcebergToGlueConverter.getTableName(to, skipNameValidation))
+              .build());
+    }
+
+    if (awsProperties.s3WriteNamespaceTagEnabled()) {
+      oldTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_NAMESPACE)
+              .value(IcebergToGlueConverter.getDatabaseName(from, skipNameValidation))
+              .build());
+      newTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_NAMESPACE)
+              .value(IcebergToGlueConverter.getDatabaseName(to, skipNameValidation))
+              .build());
+    }
+
+    if (lastMetadata != null && ops.io() instanceof S3FileIO) {

Review Comment:
   we should probably determine tag update upfront, based on FileIO used as well as the flags in https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/AwsProperties.java#L511-L526



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #6179: AWS: Re-tag files when renaming tables in GlueCatalog

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on PR #6179:
URL: https://github.com/apache/iceberg/pull/6179#issuecomment-1382318184

   I'd like to discuss this a bit more, since we do have some actual customer use cases for this, because overall the S3 tagging related features in Iceberg integrate very well with S3 lifecycle policy and bucket permission, RENAME seems to be the only piece that is missing for the whole end to end flow to work.
   
   > I think that if a rename requires anything but a metadata operation, then the Iceberg catalog should not allow it and should throw an exception that rename is not supported.
   
   Totally agree. The reason I was relatively not against this approach was that RENAME is not a frequent operation for most users, and they are willing to wait long to complete the full rename if it can update related tags or other metadata information in storage.
   
   > Users aren't going to know that this is needs to perform a potentially huge number of S3 operations. 
   
   Correct, maybe one way to resolve this is to throw exception by default and user has to turn on the feature to be aware of the implications.
   
   This logic will probably exist outside Iceberg repository if it is not contributed in. @rdblue please let me know if there is any intermediate grounds that could be taken for this use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6179: AWS: Re-tag files when renaming tables in GlueCatalog

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on code in PR #6179:
URL: https://github.com/apache/iceberg/pull/6179#discussion_r1021702623


##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -624,4 +642,153 @@ public void setConf(Configuration conf) {
   protected Map<String, String> properties() {
     return catalogProperties == null ? ImmutableMap.of() : catalogProperties;
   }
+
+  private void updateTableTag(TableIdentifier from, TableIdentifier to) {
+    // should update tag when the rename process is successful
+    TableOperations ops = newTableOps(to);
+    TableMetadata lastMetadata = null;
+    try {
+      lastMetadata = ops.current();
+    } catch (NotFoundException e) {
+      LOG.warn(
+          "Failed to load table metadata for table: {}, continuing rename without re-tag", to, e);
+    }
+    Set<Tag> oldTags = Sets.newHashSet();
+    Set<Tag> newTags = Sets.newHashSet();
+    boolean skipNameValidation = awsProperties.glueCatalogSkipNameValidation();
+    if (awsProperties.s3WriteTableTagEnabled()) {
+      oldTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_TABLE)
+              .value(IcebergToGlueConverter.getTableName(from, skipNameValidation))
+              .build());
+      newTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_TABLE)
+              .value(IcebergToGlueConverter.getTableName(to, skipNameValidation))
+              .build());
+    }
+
+    if (awsProperties.s3WriteNamespaceTagEnabled()) {
+      oldTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_NAMESPACE)
+              .value(IcebergToGlueConverter.getDatabaseName(from, skipNameValidation))
+              .build());
+      newTags.add(
+          Tag.builder()
+              .key(AwsProperties.S3_TAG_ICEBERG_NAMESPACE)
+              .value(IcebergToGlueConverter.getDatabaseName(to, skipNameValidation))
+              .build());
+    }
+
+    if (lastMetadata != null && ops.io() instanceof S3FileIO) {
+      updateTableTag((S3FileIO) ops.io(), lastMetadata, oldTags, newTags);
+    }
+  }
+
+  private void updateTableTag(
+      S3FileIO io, TableMetadata metadata, Set<Tag> oldTags, Set<Tag> newTags) {
+    Set<String> manifestListsToUpdate = Sets.newHashSet();
+    Set<ManifestFile> manifestsToUpdate = Sets.newHashSet();
+    for (Snapshot snapshot : metadata.snapshots()) {
+      // add all manifests to the delete set because both data and delete files should be removed
+      Iterables.addAll(manifestsToUpdate, snapshot.allManifests(io));
+      // add the manifest list to the delete set, if present
+      if (snapshot.manifestListLocation() != null) {
+        manifestListsToUpdate.add(snapshot.manifestListLocation());
+      }
+    }
+
+    LOG.info("Manifests to update: {}", Joiner.on(", ").join(manifestsToUpdate));
+
+    boolean gcEnabled =

Review Comment:
   I guess you got the logic from table file deletion, but it does not apply here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap closed pull request #6179: AWS: Re-tag files when renaming tables in GlueCatalog

Posted by GitBox <gi...@apache.org>.
JonasJ-ap closed pull request #6179: AWS: Re-tag files when renaming tables in GlueCatalog
URL: https://github.com/apache/iceberg/pull/6179


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org