You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "amogh-jahagirdar (via GitHub)" <gi...@apache.org> on 2023/02/24 15:18:37 UTC

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6880: Delta Migration: Handle delta lake's automatic log clean and VACUUM

amogh-jahagirdar commented on code in PR #6880:
URL: https://github.com/apache/iceberg/pull/6880#discussion_r1117139031


##########
delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeTableAction.java:
##########
@@ -213,6 +218,52 @@ private PartitionSpec getPartitionSpecFromDeltaSnapshot(Schema schema) {
     return builder.build();
   }
 
+  /**
+   * Commit the initial delta snapshot to iceberg transaction. It tries the snapshot starting from
+   * {@code deltaStartVersion} to {@code latestVersion} and commit the first constructable one.
+   *
+   * <p>There are two cases that the delta snapshot is not constructable:
+   *
+   * <ul>
+   *   <li>the version is earlier than the earliest checkpoint
+   *   <li>the corresponding data files are deleted by {@code VACUUM}
+   * </ul>
+   *
+   * <p>For more information, please refer to delta lake's <a
+   * href="https://docs.delta.io/latest/delta-batch.html#-data-retention">Data Retention</a>
+   *
+   * @param latestVersion the latest version of the delta lake table
+   * @param transaction the iceberg transaction
+   * @return the initial version of the delta lake table that is successfully committed to iceberg
+   */
+  private long commitInitialDeltaSnapshotToIcebergTransaction(
+      long latestVersion, Transaction transaction) {
+    long constructableStartVersion = deltaStartVersion;
+    while (constructableStartVersion <= latestVersion) {
+      try {
+        List<AddFile> initDataFiles =
+            deltaLog.getSnapshotForVersionAsOf(constructableStartVersion).getAllFiles();
+        List<DataFile> filesToAdd = Lists.newArrayList();
+        for (AddFile addFile : initDataFiles) {
+          DataFile dataFile = buildDataFileFromAction(addFile, transaction.table());
+          filesToAdd.add(dataFile);
+        }
+
+        // AppendFiles case
+        AppendFiles appendFiles = transaction.newAppend();
+        filesToAdd.forEach(appendFiles::appendFile);
+        appendFiles.commit();
+
+        return constructableStartVersion;
+      } catch (NotFoundException | IllegalArgumentException | DeltaStandaloneException e) {

Review Comment:
   Just want to confirm we don't need to explicitly handle `IOException` here for when the file isn't found, we'll get a `NotFoundException`?



##########
delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeTableAction.java:
##########
@@ -213,6 +218,52 @@ private PartitionSpec getPartitionSpecFromDeltaSnapshot(Schema schema) {
     return builder.build();
   }
 
+  /**
+   * Commit the initial delta snapshot to iceberg transaction. It tries the snapshot starting from
+   * {@code deltaStartVersion} to {@code latestVersion} and commit the first constructable one.
+   *
+   * <p>There are two cases that the delta snapshot is not constructable:
+   *
+   * <ul>
+   *   <li>the version is earlier than the earliest checkpoint
+   *   <li>the corresponding data files are deleted by {@code VACUUM}
+   * </ul>
+   *
+   * <p>For more information, please refer to delta lake's <a
+   * href="https://docs.delta.io/latest/delta-batch.html#-data-retention">Data Retention</a>
+   *
+   * @param latestVersion the latest version of the delta lake table
+   * @param transaction the iceberg transaction
+   * @return the initial version of the delta lake table that is successfully committed to iceberg
+   */
+  private long commitInitialDeltaSnapshotToIcebergTransaction(
+      long latestVersion, Transaction transaction) {
+    long constructableStartVersion = deltaStartVersion;
+    while (constructableStartVersion <= latestVersion) {
+      try {
+        List<AddFile> initDataFiles =
+            deltaLog.getSnapshotForVersionAsOf(constructableStartVersion).getAllFiles();
+        List<DataFile> filesToAdd = Lists.newArrayList();

Review Comment:
   Should `filesToAdd` be a set instead of a list, since the files could be re-used across snapshots?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org