You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "amogh-jahagirdar (via GitHub)" <gi...@apache.org> on 2023/01/20 20:23:42 UTC

[GitHub] [iceberg] amogh-jahagirdar opened a new pull request, #6634: Core, API: Fix for tracking intermediate snapshots when a transaction spans multiple branches

amogh-jahagirdar opened a new pull request, #6634:
URL: https://github.com/apache/iceberg/pull/6634

   Fix for https://github.com/apache/iceberg/issues/6632
   
   Moving to draft as I read more of the code to make sure this handles different failure cases properly and will add tests if determined this is the right way to fix this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6634: Core, API: Fix for tracking intermediate snapshots when a transaction spans multiple branches

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#discussion_r1083324041


##########
core/src/test/java/org/apache/iceberg/TestTransaction.java:
##########
@@ -771,4 +771,14 @@ public void testSimpleTransactionNotDeletingMetadataOnUnknownSate() throws IOExc
     Assert.assertTrue("Manifest file should exist", new File(manifests.get(0).path()).exists());
     Assert.assertEquals("Should have 2 files in metadata", 2, countAllMetadataFiles(tableDir));
   }
+
+  @Test
+  public void testAppendTransactionMultipleBranches(){
+    Table table = load();
+    Transaction txn = table.newTransaction();
+    txn.newAppend().toBranch("b1").appendFile(FILE_A).toBranch("b1").commit();
+    txn.newAppend().toBranch("b2").appendFile(FILE_B).toBranch("b2").commit();
+    txn.newAppend().toBranch("b2").appendFile(FILE_C).toBranch("b2").commit();
+    txn.commitTransaction();
+  }

Review Comment:
   Ignore this, was just playing around here. The bulk of this change will be in the tests. I think for transactions we need to parameterize to test transactions which commit to a single branch AND we need to have tests for committing operations on different branches in a single transaction (which is where things become more interesting). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6634: Core, API: Fix for tracking intermediate snapshots when a transaction spans multiple branches

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#discussion_r1083329928


##########
core/src/test/java/org/apache/iceberg/TestTransaction.java:
##########
@@ -771,4 +771,14 @@ public void testSimpleTransactionNotDeletingMetadataOnUnknownSate() throws IOExc
     Assert.assertTrue("Manifest file should exist", new File(manifests.get(0).path()).exists());
     Assert.assertEquals("Should have 2 files in metadata", 2, countAllMetadataFiles(tableDir));
   }
+
+  @Test
+  public void testAppendTransactionMultipleBranches(){
+    Table table = load();
+    Transaction txn = table.newTransaction();
+    txn.newAppend().toBranch("b1").appendFile(FILE_A).toBranch("b1").commit();
+    txn.newAppend().toBranch("b2").appendFile(FILE_B).toBranch("b2").commit();
+    txn.newAppend().toBranch("b2").appendFile(FILE_C).toBranch("b2").commit();
+    txn.commitTransaction();
+  }

Review Comment:
   sounds like we should first merge #5234 before we can have proper tests for this fix?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6634: Core: Fix for deleting files when commiting transactions with multiple branches

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#issuecomment-1400750805

   @amogh-jahagirdar, this looks great! I think it's about ready to merge but there are a couple of naming nits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6634: Core, API: Fix for tracking intermediate snapshots when a transaction spans multiple branches

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#discussion_r1083658719


##########
core/src/main/java/org/apache/iceberg/BaseTransaction.java:
##########
@@ -551,10 +555,19 @@ public void commit(TableMetadata underlyingBase, TableMetadata metadata) {
       }
 
       // track the intermediate snapshot ids for rewriting the snapshot log
-      // an id is intermediate if it isn't the base snapshot id and it is replaced by a new current
-      Long oldId = currentId(current);
-      if (oldId != null && !oldId.equals(currentId(metadata)) && !oldId.equals(currentId(base))) {
-        intermediateSnapshotIds.add(oldId);
+      // an id is intermediate if it isn't the head of the branch in base and it is replaced by a new head of the branch in current

Review Comment:
   Thanks @rdblue that makes a lot of sense, my previous logic was a more complex way of just capturing the new snapshots across all the branches but we can quite simply just get that by diffing the set of snapshots that were committed and the pre-transaction snapshots.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6634: Core: Fix for deleting files when commiting transactions with multiple branches

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#discussion_r1084384263


##########
core/src/main/java/org/apache/iceberg/BaseTransaction.java:
##########
@@ -433,17 +425,18 @@ private void commitSimpleTransaction() {
     // the commit succeeded
 
     try {
-      if (currentSnapshotId.get() != -1) {
-        intermediateSnapshotIds.add(currentSnapshotId.get());
+      // clean up the data files that were deleted by each operation. first, get the list of
+      // committed manifests to ensure that no committed manifest is deleted.
+      // A manifest could be deleted in one successful operation commit, but reused in another
+      // successful commit of that operation if the whole transaction is retried.
+      Set<Long> currentSnapshots = Sets.newHashSet();

Review Comment:
   Oh yes `newSnapshots` is much better. This is the delta from before and current, not the actual current snapshots



##########
core/src/main/java/org/apache/iceberg/BaseTransaction.java:
##########
@@ -433,17 +425,18 @@ private void commitSimpleTransaction() {
     // the commit succeeded
 
     try {
-      if (currentSnapshotId.get() != -1) {
-        intermediateSnapshotIds.add(currentSnapshotId.get());
+      // clean up the data files that were deleted by each operation. first, get the list of
+      // committed manifests to ensure that no committed manifest is deleted.
+      // A manifest could be deleted in one successful operation commit, but reused in another
+      // successful commit of that operation if the whole transaction is retried.
+      Set<Long> currentSnapshots = Sets.newHashSet();

Review Comment:
   Oh yes `newSnapshots` is much better. This is the delta between before and current, not the actual current snapshots



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on pull request #6634: Core: Fix for deleting files when commiting transactions with multiple branches

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#issuecomment-1400775143

   Thanks @rdblue for the review! since the focus of this PR was properly determining the set of committed files when committing a transaction I think we should be good. Regardless I think we probably want some more coverage on transactions spanning multiple branches, I can raise that in a separate PR? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6634: Core: Fix for deleting files when commiting transactions with multiple branches

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#discussion_r1084384263


##########
core/src/main/java/org/apache/iceberg/BaseTransaction.java:
##########
@@ -433,17 +425,18 @@ private void commitSimpleTransaction() {
     // the commit succeeded
 
     try {
-      if (currentSnapshotId.get() != -1) {
-        intermediateSnapshotIds.add(currentSnapshotId.get());
+      // clean up the data files that were deleted by each operation. first, get the list of
+      // committed manifests to ensure that no committed manifest is deleted.
+      // A manifest could be deleted in one successful operation commit, but reused in another
+      // successful commit of that operation if the whole transaction is retried.
+      Set<Long> currentSnapshots = Sets.newHashSet();

Review Comment:
   Oh yes `newSnapshots` is much better. This is the delta from before and now not the current snapshots



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6634: Core, API: Fix for tracking intermediate snapshots when a transaction spans multiple branches

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#discussion_r1083331843


##########
api/src/main/java/org/apache/iceberg/SnapshotUpdate.java:
##########
@@ -71,4 +71,8 @@ default ThisT toBranch(String branch) {
             "Cannot commit to branch %s: %s does not support branch commits",
             branch, this.getClass().getName()));
   }
+
+  default String targetBranch() {
+    return SnapshotRef.MAIN_BRANCH;
+  }

Review Comment:
   I think it's fine to keep it in the same PR if we need to add this API to fix the logic, it makes the purpose more clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6634: Core, API: Fix for tracking intermediate snapshots when a transaction spans multiple branches

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#discussion_r1083023727


##########
api/src/main/java/org/apache/iceberg/SnapshotUpdate.java:
##########
@@ -71,4 +71,8 @@ default ThisT toBranch(String branch) {
             "Cannot commit to branch %s: %s does not support branch commits",
             branch, this.getClass().getName()));
   }
+
+  default String targetBranch() {
+    return SnapshotRef.MAIN_BRANCH;
+  }

Review Comment:
   We should probably separate this API PR if we determine this is the right way to solve this issue



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #6634: Core: Fix for deleting files when commiting transactions with multiple branches

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue merged PR #6634:
URL: https://github.com/apache/iceberg/pull/6634


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6634: Core: Fix for deleting files when commiting transactions with multiple branches

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#discussion_r1084384263


##########
core/src/main/java/org/apache/iceberg/BaseTransaction.java:
##########
@@ -433,17 +425,18 @@ private void commitSimpleTransaction() {
     // the commit succeeded
 
     try {
-      if (currentSnapshotId.get() != -1) {
-        intermediateSnapshotIds.add(currentSnapshotId.get());
+      // clean up the data files that were deleted by each operation. first, get the list of
+      // committed manifests to ensure that no committed manifest is deleted.
+      // A manifest could be deleted in one successful operation commit, but reused in another
+      // successful commit of that operation if the whole transaction is retried.
+      Set<Long> currentSnapshots = Sets.newHashSet();

Review Comment:
   Oh yes `newSnapshots` is much better. This is the delta from before and now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6634: Core: Fix for deleting files when commiting transactions with multiple branches

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#issuecomment-1400861489

   Thanks, @amogh-jahagirdar!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6634: Core: Fix for deleting files when commiting transactions with multiple branches

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#discussion_r1084371360


##########
core/src/main/java/org/apache/iceberg/BaseTransaction.java:
##########
@@ -433,17 +425,18 @@ private void commitSimpleTransaction() {
     // the commit succeeded
 
     try {
-      if (currentSnapshotId.get() != -1) {
-        intermediateSnapshotIds.add(currentSnapshotId.get());
+      // clean up the data files that were deleted by each operation. first, get the list of
+      // committed manifests to ensure that no committed manifest is deleted.
+      // A manifest could be deleted in one successful operation commit, but reused in another
+      // successful commit of that operation if the whole transaction is retried.
+      Set<Long> currentSnapshots = Sets.newHashSet();

Review Comment:
   This isn't really "current". I think it's the snapshots that were added in the transaction. Minor, but I'd probably call it `newSnapshots`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6634: Core: Fix for deleting files when commiting transactions with multiple branches

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#discussion_r1084372467


##########
core/src/main/java/org/apache/iceberg/BaseTransaction.java:
##########
@@ -395,9 +393,8 @@ private void commitSimpleTransaction() {
       return;
     }
 
-    // this is always set to the latest commit attempt's snapshot id.
-    AtomicLong currentSnapshotId = new AtomicLong(-1L);
-
+    Set<Long> preTxnSnapshots =

Review Comment:
   We avoid using short names that are hard to type, like `txn`. It doesn't save that much space and it makes the code harder to work with in general. I'd recommend choosing a better word instead, like `existingSnapshots` or `startingSnapshots`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6634: Core, API: Fix for tracking intermediate snapshots when a transaction spans multiple branches

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6634:
URL: https://github.com/apache/iceberg/pull/6634#discussion_r1083582534


##########
core/src/main/java/org/apache/iceberg/BaseTransaction.java:
##########
@@ -551,10 +555,19 @@ public void commit(TableMetadata underlyingBase, TableMetadata metadata) {
       }
 
       // track the intermediate snapshot ids for rewriting the snapshot log
-      // an id is intermediate if it isn't the base snapshot id and it is replaced by a new current
-      Long oldId = currentId(current);
-      if (oldId != null && !oldId.equals(currentId(metadata)) && !oldId.equals(currentId(base))) {
-        intermediateSnapshotIds.add(oldId);
+      // an id is intermediate if it isn't the head of the branch in base and it is replaced by a new head of the branch in current

Review Comment:
   I don't think that we need to keep intermediate snapshot IDs like this anymore.
   
   I took a look at this and the intermediate IDs are currently used to ensure that we don't delete files from any committed snapshot. Before we added metadata change tracking to TableMetadata, this list was also used to rewrite the history / snapshot log. But that's handled in `TableMetadata.Builder` now, so this is only used for fixing up the deletes.
   
   Fixing up deletes is a much simpler problem. We don't need to know whether a reference was "intermediate" anymore -- that was for the history fixes. So what we need is a list of the new snapshots committed by the transaction to feed into `committedFiles`. We can do that more easily by taking the current set of snapshots and removing the old set of snapshots.
   
   @amogh-jahagirdar, does that make sense? We can do this all in `commitSimpleTransaction` and remove `intermediateSnapshotIds`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org