You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "namrathamyske (via GitHub)" <gi...@apache.org> on 2023/02/01 06:30:22 UTC

[GitHub] [iceberg] namrathamyske opened a new pull request, #6717: spark 3.3 read by snapshot ref bug fix for schema change

namrathamyske opened a new pull request, #6717:
URL: https://github.com/apache/iceberg/pull/6717

   In PR https://github.com/apache/iceberg/pull/5150 we introduced read from branch or tag. There is a bug where **we cannot read the snapshot specific schema changes** when doing 
   
   `spark.read().option("branch", <branch_name>)`. The fix was applied similar to https://github.com/apache/iceberg/pull/3722
   
   Still yet to add support for tag. If given a heads up for branch, will go ahead with tag 
   cc  @jackieo168
   @rdblue @jackye1995 @amogh-jahagirdar Let me know what you think 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1114655196


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##########
@@ -722,6 +757,16 @@ private Table loadFromPathIdentifier(PathIdentifier ident) {
       long snapshotIdAsOfTime = SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp);
       return new SparkTable(table, snapshotIdAsOfTime, !cacheEnabled);
 
+    } else if (branch != null) {
+      Snapshot branchSnapshot = table.snapshot(branch);
+      Preconditions.checkArgument(
+          branchSnapshot != null, "Cannot find snapshot associated with branch name: %s", branch);
+      return new SparkTable(table, branchSnapshot.snapshotId(), !cacheEnabled);
+    } else if (tag != null) {

Review Comment:
   This doesn't follow the spacing of the other branches. Can you introduce similar whitespace?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#issuecomment-1440446002

   I should be able to review this one today as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103691700


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,23 +149,49 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);
+      }
+
+      Matcher tagBasedMatcher = BRANCH.matcher(meta);
+      if (tagBasedMatcher.matches()) {
+        tag = tagBasedMatcher.group(1);
+      }
     }
 
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot and timestamp for time travel: %s",
-        ident);
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), and snapshot-ref (%s)",
+        snapshotId,
+        asOfTimestamp,
+        branch);
 
     Table table = TABLE_CACHE.get(key);
 
     if (table == null) {
       throw new NoSuchTableException(ident);
     }
 
+    long snapshotIdFromTimeTravel = -1L;

Review Comment:
   nit: can use `Optional<Long>` instead of using -1 to represent no snapshot ID



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#issuecomment-1440995783

   Thanks, @namrathamyske and @jackye1995! Thanks for reviewing, @amogh-jahagirdar @rdblue!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103692411


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,23 +149,49 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);
+      }
+
+      Matcher tagBasedMatcher = BRANCH.matcher(meta);
+      if (tagBasedMatcher.matches()) {
+        tag = tagBasedMatcher.group(1);
+      }
     }
 
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot and timestamp for time travel: %s",
-        ident);
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), and snapshot-ref (%s)",
+        snapshotId,
+        asOfTimestamp,
+        branch);
 
     Table table = TABLE_CACHE.get(key);
 
     if (table == null) {
       throw new NoSuchTableException(ident);
     }
 
+    long snapshotIdFromTimeTravel = -1L;
+
     if (snapshotId != null) {
-      return Pair.of(table, snapshotId);
+      snapshotIdFromTimeTravel = snapshotId;
     } else if (asOfTimestamp != null) {
-      return Pair.of(table, SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp));
+      snapshotIdFromTimeTravel = SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp);
+    } else if (branch != null) {
+      Preconditions.checkArgument(
+          table.snapshot(branch) != null, "branch not associated with a snapshot");

Review Comment:
   nit: should capitalize first word, and also provide info in the error message, so something like `Cannot find snapshot associated with branch name %s". Similar comment for the error check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1114994029


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,12 +151,25 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));

Review Comment:
   This was the last statement so there was no `continue`. However, it is no longer the last statement after this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1106177816


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java:
##########
@@ -124,11 +128,15 @@ private Spark3Util.CatalogAndIdentifier catalogAndIdentifier(CaseInsensitiveStri
 
     Long snapshotId = propertyAsLong(options, SparkReadOptions.SNAPSHOT_ID);
     Long asOfTimestamp = propertyAsLong(options, SparkReadOptions.AS_OF_TIMESTAMP);
+    String branch = options.get(SparkReadOptions.BRANCH);
+    String tag = options.get(SparkReadOptions.TAG);
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot-id (%s) and as-of-timestamp (%s)",
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), branch (%s) and tag (%s)",

Review Comment:
   I think "Can specify only one of snapshot-id (%s), as-of-timestamp (%s), branch, tag..." is more clear imo



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##########
@@ -654,6 +659,21 @@ private Table load(Identifier ident) {
         return new SparkTable(table, snapshotId, !cacheEnabled);
       }
 
+      Matcher branch = BRANCH.matcher(ident.name());
+      if (branch.matches()) {
+        Snapshot branchSnapshot = table.snapshot(branch.group(1));
+        if (branchSnapshot != null) {
+          return new SparkTable(table, branchSnapshot.snapshotId(), !cacheEnabled);
+        }
+      }
+
+      Matcher tag = TAG.matcher(ident.name());
+      if (tag.matches()) {
+        Snapshot tagSnapshot = table.snapshot(tag.group(1));
+        if (tagSnapshot != null) {
+          return new SparkTable(table, tagSnapshot.snapshotId(), !cacheEnabled);
+        }
+      }

Review Comment:
   Nit: Newline after the if block/before the comment starts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1114652628


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,12 +151,24 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);
+      }
+
+      Matcher tagBasedMatcher = TAG.matcher(meta);
+      if (tagBasedMatcher.matches()) {
+        tag = tagBasedMatcher.group(1);
+      }
     }
 
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot and timestamp for time travel: %s",
-        ident);
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), and snapshot-ref (%s)",
+        snapshotId,
+        asOfTimestamp,
+        branch);

Review Comment:
   This can be misleading. It always prints branch, but could be triggered by tag.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103693013


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,23 +149,49 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);
+      }
+
+      Matcher tagBasedMatcher = BRANCH.matcher(meta);
+      if (tagBasedMatcher.matches()) {
+        tag = tagBasedMatcher.group(1);
+      }
     }
 
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot and timestamp for time travel: %s",
-        ident);
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), and snapshot-ref (%s)",
+        snapshotId,
+        asOfTimestamp,
+        branch);
 
     Table table = TABLE_CACHE.get(key);
 
     if (table == null) {
       throw new NoSuchTableException(ident);
     }
 
+    long snapshotIdFromTimeTravel = -1L;

Review Comment:
   I see, this is a comment from Amogh
   
   > Nit: could we just have each if block match and resolve the snapshot ID, and then just have one return new SparkTable(table, snapshotId, !cacheEnabled) at the end. That'll reduce duplication considering we need to also handle the tag case also.
   
   I actually think the other way, because if we look at this current change, it introduces more changes and variables without the stated benefit of really removing much duplication.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103691492


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -46,7 +48,8 @@ public class SparkCachedTableCatalog implements TableCatalog {
   private static final Splitter COMMA = Splitter.on(",");
   private static final Pattern AT_TIMESTAMP = Pattern.compile("at_timestamp_(\\d+)");
   private static final Pattern SNAPSHOT_ID = Pattern.compile("snapshot_id_(\\d+)");
-

Review Comment:
   nit: should not change newline above `TABLE_CACHE`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103738003


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,23 +149,49 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);
+      }
+
+      Matcher tagBasedMatcher = BRANCH.matcher(meta);
+      if (tagBasedMatcher.matches()) {
+        tag = tagBasedMatcher.group(1);
+      }
     }
 
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot and timestamp for time travel: %s",
-        ident);
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), and snapshot-ref (%s)",
+        snapshotId,
+        asOfTimestamp,
+        branch);
 
     Table table = TABLE_CACHE.get(key);
 
     if (table == null) {
       throw new NoSuchTableException(ident);
     }
 
+    long snapshotIdFromTimeTravel = -1L;
+
     if (snapshotId != null) {
-      return Pair.of(table, snapshotId);
+      snapshotIdFromTimeTravel = snapshotId;
     } else if (asOfTimestamp != null) {
-      return Pair.of(table, SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp));
+      snapshotIdFromTimeTravel = SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp);
+    } else if (branch != null) {
+      Preconditions.checkArgument(
+          table.snapshot(branch) != null, "branch not associated with a snapshot");
+      snapshotIdFromTimeTravel = table.snapshot(branch).snapshotId();
+    } else if (tag != null) {
+      Preconditions.checkArgument(
+          table.snapshot(tag) != null, "tag not associated with a snapshot");

Review Comment:
   Same comment as @jackye1995 let's include the ref name in the error itself so it's obvious. `Cannot find snapshot associated with tag: %s`



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,23 +149,49 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);
+      }
+
+      Matcher tagBasedMatcher = BRANCH.matcher(meta);

Review Comment:
   Should be `TAG` instead of `BRANCH ` here



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,23 +149,49 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);
+      }
+
+      Matcher tagBasedMatcher = BRANCH.matcher(meta);
+      if (tagBasedMatcher.matches()) {
+        tag = tagBasedMatcher.group(1);
+      }
     }
 
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot and timestamp for time travel: %s",
-        ident);
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), and snapshot-ref (%s)",
+        snapshotId,
+        asOfTimestamp,
+        branch);
 
     Table table = TABLE_CACHE.get(key);
 
     if (table == null) {
       throw new NoSuchTableException(ident);
     }
 
+    long snapshotIdFromTimeTravel = -1L;

Review Comment:
   Ah yeah, most of the duplication comes from having the different time travel options but ultimately we end up having to have a branching logic anyways to assign the snapshot id. I'm good to revert back to the existing way, we should generally strive to minimize changes as long as it's still readable which it is in the existing case as well. sorry for the confusion there @namrathamyske ! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1092832125


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##########
@@ -654,6 +658,13 @@ private Table load(Identifier ident) {
         return new SparkTable(table, snapshotId, !cacheEnabled);
       }
 
+      Matcher branch = BRANCH.matcher(ident.name());
+      if (branch.matches()) {
+        Snapshot snapshot = table.snapshot(branch.group(1));
+        if (snapshot != null) {
+          return new SparkTable(table, snapshot.snapshotId(), !cacheEnabled);
+        }
+      }

Review Comment:
   Nit: could we just have each if block match and resolve the snapshot ID, and then just have one `return new SparkTable(table, snapshotId, !cacheEnabled)` at the end. That'll reduce duplication considering we need to also handle the tag case also.



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##########
@@ -722,6 +741,9 @@ private Table loadFromPathIdentifier(PathIdentifier ident) {
       long snapshotIdAsOfTime = SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp);
       return new SparkTable(table, snapshotIdAsOfTime, !cacheEnabled);
 
+    } else if (branch != null && table.snapshot(branch) != null) {
+      return new SparkTable(table, table.snapshot(branch).snapshotId(), !cacheEnabled);
+
     } else {
       return new SparkTable(table, snapshotId, !cacheEnabled);

Review Comment:
   Can we resolve to a snapshot ID and then return the table? Imo that makes the logic a bit easier to read since it isolates the resolving the snapshot ID from the various options. 
   
   



##########
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java:
##########
@@ -370,4 +370,42 @@ public void testSnapshotSelectionByTimestampAndBranchOrTagFails() throws IOExcep
         .isInstanceOf(IllegalArgumentException.class)
         .hasMessageStartingWith("Cannot override ref, already set snapshot id=");
   }
+
+  @Test
+  public void testSnapshotSelectionByBranchWithSchemaChange() throws IOException {
+    String tableLocation = temp.newFolder("iceberg-table").toString();
+
+    HadoopTables tables = new HadoopTables(CONF);
+    PartitionSpec spec = PartitionSpec.unpartitioned();
+    Table table = tables.create(SCHEMA, spec, tableLocation);
+
+    // produce the first snapshot
+    List<SimpleRecord> firstBatchRecords =
+        Lists.newArrayList(
+            new SimpleRecord(1, "a"), new SimpleRecord(2, "b"), new SimpleRecord(3, "c"));
+    Dataset<Row> firstDf = spark.createDataFrame(firstBatchRecords, SimpleRecord.class);
+    firstDf.select("id", "data").write().format("iceberg").mode("append").save(tableLocation);
+
+    table.manageSnapshots().createBranch("branch", table.currentSnapshot().snapshotId()).commit();
+
+    Dataset<Row> currentSnapshotResult =
+        spark.read().format("iceberg").option("branch", "branch").load(tableLocation);
+    List<SimpleRecord> currentSnapshotRecords =
+        currentSnapshotResult.orderBy("id").as(Encoders.bean(SimpleRecord.class)).collectAsList();
+    List<SimpleRecord> expectedRecords = Lists.newArrayList();
+    expectedRecords.addAll(firstBatchRecords);
+    Assert.assertEquals(
+        "Current snapshot rows should match", expectedRecords, currentSnapshotRecords);
+
+    table.updateSchema().deleteColumn("data").commit();
+
+    Dataset<Row> deleteSnapshotResult =
+        spark.read().format("iceberg").option("branch", "branch").load(tableLocation);
+    List<SimpleRecord> deletedSnapshotRecords =
+        deleteSnapshotResult.orderBy("id").as(Encoders.bean(SimpleRecord.class)).collectAsList();
+    List<SimpleRecord> expectedRecordsAfterDeletion = Lists.newArrayList();
+    expectedRecordsAfterDeletion.addAll(firstBatchRecords);
+    Assert.assertEquals(
+        "Current snapshot rows should match", expectedRecords, deletedSnapshotRecords);

Review Comment:
   I think some inline comments here can make it easier to understand what the test is doing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1101943185


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##########
@@ -722,6 +741,9 @@ private Table loadFromPathIdentifier(PathIdentifier ident) {
       long snapshotIdAsOfTime = SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp);
       return new SparkTable(table, snapshotIdAsOfTime, !cacheEnabled);
 
+    } else if (branch != null && table.snapshot(branch) != null) {
+      return new SparkTable(table, table.snapshot(branch).snapshotId(), !cacheEnabled);

Review Comment:
   If the branch is non-null but the table doesn't have a branch with that name, then I think this must fail rather than returning the `else` branch, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1115002962


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -369,6 +369,8 @@ private static CaseInsensitiveStringMap addSnapshotId(
       scanOptions.put(SparkReadOptions.SNAPSHOT_ID, value);
       scanOptions.remove(SparkReadOptions.AS_OF_TIMESTAMP);
 

Review Comment:
   This is grouped incorrectly. We should have an empty line before the return statement and all remove calls should be grouped together.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#issuecomment-1440854041

   A few minor nits and should be good to go. Thanks everyone!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103758126


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -161,9 +180,19 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       return Pair.of(table, snapshotId);
     } else if (asOfTimestamp != null) {
       return Pair.of(table, SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp));
-    } else {
-      return Pair.of(table, null);
+    } else if (branch != null) {
+      Snapshot branchSnapshot = table.snapshot(branch);
+      Preconditions.checkArgument(
+          branchSnapshot != null, "Cannot find snapshot associated with branch name: %s", branch);
+      return Pair.of(table, branchSnapshot.snapshotId());
+    } else if (tag != null) {
+      Snapshot tagSnapshot = table.snapshot(tag);
+      Preconditions.checkArgument(
+          tagSnapshot != null, "Cannot find snapshot associated with tag name: %s", tag);
+      return Pair.of(table, tagSnapshot.snapshotId());
     }
+
+    return Pair.of(table, null);

Review Comment:
   nit: I think we can avoid this line change by putting it in the else clause



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##########
@@ -722,9 +757,19 @@ private Table loadFromPathIdentifier(PathIdentifier ident) {
       long snapshotIdAsOfTime = SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp);
       return new SparkTable(table, snapshotIdAsOfTime, !cacheEnabled);
 
-    } else {
-      return new SparkTable(table, snapshotId, !cacheEnabled);
+    } else if (branch != null) {
+      Snapshot branchSnapshot = table.snapshot(branch);
+      Preconditions.checkArgument(
+          branchSnapshot != null, "Cannot find snapshot associated with branch name: %s", branch);
+      return new SparkTable(table, branchSnapshot.snapshotId(), !cacheEnabled);
+    } else if (tag != null) {
+      Snapshot tagSnapshot = table.snapshot(tag);
+      Preconditions.checkArgument(
+          tagSnapshot != null, "Cannot find snapshot associated with tag name: %s", tag);
+      return new SparkTable(table, tagSnapshot.snapshotId(), !cacheEnabled);
     }
+
+    return new SparkTable(table, snapshotId, !cacheEnabled);

Review Comment:
   nit: I think we can avoid this line change by putting it in the else clause



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103757921


##########
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java:
##########
@@ -368,6 +370,64 @@ public void testSnapshotSelectionByTimestampAndBranchOrTagFails() throws IOExcep
                     .load(tableLocation)
                     .show())
         .isInstanceOf(IllegalArgumentException.class)
-        .hasMessageStartingWith("Cannot override ref, already set snapshot id=");
+        .hasMessageStartingWith("Can specify at most one of snapshot-id");
+  }
+
+  @Test
+  public void testSnapshotSelectionByBranchOrTagWithSchemaChange() throws IOException {

Review Comment:
   can we have separated tests for branch and tag?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1101943508


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -368,7 +368,7 @@ private static CaseInsensitiveStringMap addSnapshotId(
       scanOptions.putAll(options.asCaseSensitiveMap());
       scanOptions.put(SparkReadOptions.SNAPSHOT_ID, value);
       scanOptions.remove(SparkReadOptions.AS_OF_TIMESTAMP);
-

Review Comment:
   Please revert the whitespace change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackieo168 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackieo168 (via GitHub)" <gi...@apache.org>.

jackieo168 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1096420864


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##########
@@ -105,6 +108,7 @@ public class SparkCatalog extends BaseCatalog {
   private static final Splitter COMMA = Splitter.on(",");
   private static final Pattern AT_TIMESTAMP = Pattern.compile("at_timestamp_(\\d+)");
   private static final Pattern SNAPSHOT_ID = Pattern.compile("snapshot_id_(\\d+)");
+  private static final Pattern BRANCH = Pattern.compile("branch_(.*)");

Review Comment:
   Do we also want to make similar changes to `SparkCachedTableCatalog`?



##########
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java:
##########
@@ -370,4 +370,42 @@ public void testSnapshotSelectionByTimestampAndBranchOrTagFails() throws IOExcep
         .isInstanceOf(IllegalArgumentException.class)
         .hasMessageStartingWith("Cannot override ref, already set snapshot id=");
   }
+
+  @Test
+  public void testSnapshotSelectionByBranchWithSchemaChange() throws IOException {
+    String tableLocation = temp.newFolder("iceberg-table").toString();
+
+    HadoopTables tables = new HadoopTables(CONF);
+    PartitionSpec spec = PartitionSpec.unpartitioned();
+    Table table = tables.create(SCHEMA, spec, tableLocation);
+
+    // produce the first snapshot
+    List<SimpleRecord> firstBatchRecords =
+        Lists.newArrayList(
+            new SimpleRecord(1, "a"), new SimpleRecord(2, "b"), new SimpleRecord(3, "c"));
+    Dataset<Row> firstDf = spark.createDataFrame(firstBatchRecords, SimpleRecord.class);
+    firstDf.select("id", "data").write().format("iceberg").mode("append").save(tableLocation);
+
+    table.manageSnapshots().createBranch("branch", table.currentSnapshot().snapshotId()).commit();
+
+    Dataset<Row> currentSnapshotResult =
+        spark.read().format("iceberg").option("branch", "branch").load(tableLocation);
+    List<SimpleRecord> currentSnapshotRecords =
+        currentSnapshotResult.orderBy("id").as(Encoders.bean(SimpleRecord.class)).collectAsList();
+    List<SimpleRecord> expectedRecords = Lists.newArrayList();
+    expectedRecords.addAll(firstBatchRecords);
+    Assert.assertEquals(
+        "Current snapshot rows should match", expectedRecords, currentSnapshotRecords);
+
+    table.updateSchema().deleteColumn("data").commit();
+
+    Dataset<Row> deleteSnapshotResult =
+        spark.read().format("iceberg").option("branch", "branch").load(tableLocation);
+    List<SimpleRecord> deletedSnapshotRecords =
+        deleteSnapshotResult.orderBy("id").as(Encoders.bean(SimpleRecord.class)).collectAsList();
+    List<SimpleRecord> expectedRecordsAfterDeletion = Lists.newArrayList();
+    expectedRecordsAfterDeletion.addAll(firstBatchRecords);
+    Assert.assertEquals(
+        "Current snapshot rows should match", expectedRecords, deletedSnapshotRecords);

Review Comment:
   @namrathamyske please see if you can leverage the unit tests added in our fork version.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1114999272


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##########
@@ -701,12 +724,25 @@ private Table loadFromPathIdentifier(PathIdentifier ident) {
       if (id.matches()) {
         snapshotId = Long.parseLong(id.group(1));
       }
+
+      Matcher branchRef = BRANCH.matcher(meta);

Review Comment:
   Same comment about `continue`. The line with `SNAPSHOT_ID` is no longer the last one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] namrathamyske commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "namrathamyske (via GitHub)" <gi...@apache.org>.

namrathamyske commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103322852


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -368,7 +368,7 @@ private static CaseInsensitiveStringMap addSnapshotId(
       scanOptions.putAll(options.asCaseSensitiveMap());
       scanOptions.put(SparkReadOptions.SNAPSHOT_ID, value);
       scanOptions.remove(SparkReadOptions.AS_OF_TIMESTAMP);
-

Review Comment:
   i have removed branch property 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103692664


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,23 +149,49 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);
+      }
+
+      Matcher tagBasedMatcher = BRANCH.matcher(meta);
+      if (tagBasedMatcher.matches()) {
+        tag = tagBasedMatcher.group(1);
+      }
     }
 
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot and timestamp for time travel: %s",
-        ident);
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), and snapshot-ref (%s)",
+        snapshotId,
+        asOfTimestamp,
+        branch);
 
     Table table = TABLE_CACHE.get(key);
 
     if (table == null) {
       throw new NoSuchTableException(ident);
     }
 
+    long snapshotIdFromTimeTravel = -1L;

Review Comment:
   Actually, after reading all the if else statements, I am not sure if we really need this, why not just do `return Pair.of(table, table.snapshot(branch).snapshotId())` and `return Pairs.of(table, table.snapshot(tag).snapshotId())` in the branch and tag case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#issuecomment-1411576490

   Thanks for the fix @namrathamyske ! left some comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] namrathamyske commented on pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "namrathamyske (via GitHub)" <gi...@apache.org>.

namrathamyske commented on PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#issuecomment-1412985751

   Thanks @jackieo168 for coming up with this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103693013


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,23 +149,49 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);
+      }
+
+      Matcher tagBasedMatcher = BRANCH.matcher(meta);
+      if (tagBasedMatcher.matches()) {
+        tag = tagBasedMatcher.group(1);
+      }
     }
 
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot and timestamp for time travel: %s",
-        ident);
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), and snapshot-ref (%s)",
+        snapshotId,
+        asOfTimestamp,
+        branch);
 
     Table table = TABLE_CACHE.get(key);
 
     if (table == null) {
       throw new NoSuchTableException(ident);
     }
 
+    long snapshotIdFromTimeTravel = -1L;

Review Comment:
   I see, this is a comment from Amogh
   
   > Nit: could we just have each if block match and resolve the snapshot ID, and then just have one return new SparkTable(table, snapshotId, !cacheEnabled) at the end. That'll reduce duplication considering we need to also handle the tag case also.
   
   I actually think the other way, because if we look at this current change based on the suggestion, it introduces more changes and variables without the stated benefit of really removing much duplication.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1114993458


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,12 +151,25 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);

Review Comment:
   Question: I see we had a `continue` statement after the first match to avoid checking other branches. Does it make sense to repeat the same for new branches except the last one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1115015676


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,12 +151,25 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);

Review Comment:
   nice catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1114655926


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java:
##########
@@ -124,11 +128,15 @@ private Spark3Util.CatalogAndIdentifier catalogAndIdentifier(CaseInsensitiveStri
 
     Long snapshotId = propertyAsLong(options, SparkReadOptions.SNAPSHOT_ID);
     Long asOfTimestamp = propertyAsLong(options, SparkReadOptions.AS_OF_TIMESTAMP);
+    String branch = options.get(SparkReadOptions.BRANCH);
+    String tag = options.get(SparkReadOptions.TAG);
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot-id (%s) and as-of-timestamp (%s)",
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), branch (%s) and tag (%s)",

Review Comment:
   Also missing the comma after branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1114654570


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##########
@@ -701,12 +723,25 @@ private Table loadFromPathIdentifier(PathIdentifier ident) {
       if (id.matches()) {
         snapshotId = Long.parseLong(id.group(1));
       }
+
+      Matcher branchRef = BRANCH.matcher(meta);
+      if (branchRef.matches()) {
+        branch = branchRef.group(1);
+      }
+
+      Matcher tagRef = TAG.matcher(meta);
+      if (tagRef.matches()) {
+        tag = tagRef.group(1);
+      }
     }
 
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot-id and as-of-timestamp: %s",
-        ident.location());
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), branch (%s) and tag (%s)",

Review Comment:
   Missing a comma after `branch (%s)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi merged pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi merged PR #6717:
URL: https://github.com/apache/iceberg/pull/6717


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#issuecomment-1421091541

   @aokolnychyi you might be also interested in this because #6651 depends on this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1102022039


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##########
@@ -722,6 +741,9 @@ private Table loadFromPathIdentifier(PathIdentifier ident) {
       long snapshotIdAsOfTime = SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp);
       return new SparkTable(table, snapshotIdAsOfTime, !cacheEnabled);
 
+    } else if (branch != null && table.snapshot(branch) != null) {
+      return new SparkTable(table, table.snapshot(branch).snapshotId(), !cacheEnabled);

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103692411


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,23 +149,49 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);
+      }
+
+      Matcher tagBasedMatcher = BRANCH.matcher(meta);
+      if (tagBasedMatcher.matches()) {
+        tag = tagBasedMatcher.group(1);
+      }
     }
 
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot and timestamp for time travel: %s",
-        ident);
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), and snapshot-ref (%s)",
+        snapshotId,
+        asOfTimestamp,
+        branch);
 
     Table table = TABLE_CACHE.get(key);
 
     if (table == null) {
       throw new NoSuchTableException(ident);
     }
 
+    long snapshotIdFromTimeTravel = -1L;
+
     if (snapshotId != null) {
-      return Pair.of(table, snapshotId);
+      snapshotIdFromTimeTravel = snapshotId;
     } else if (asOfTimestamp != null) {
-      return Pair.of(table, SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp));
+      snapshotIdFromTimeTravel = SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp);
+    } else if (branch != null) {
+      Preconditions.checkArgument(
+          table.snapshot(branch) != null, "branch not associated with a snapshot");

Review Comment:
   nit: should capitalize first word, and also provide info in the error message, so something like `Cannot find snapshot associated with branch name: %s`. Similar comment for the error check.



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCachedTableCatalog.java:
##########
@@ -144,23 +149,49 @@ private Pair<Table, Long> load(Identifier ident) throws NoSuchTableException {
       if (snapshotBasedMatcher.matches()) {
         snapshotId = Long.parseLong(snapshotBasedMatcher.group(1));
       }
+
+      Matcher branchBasedMatcher = BRANCH.matcher(meta);
+      if (branchBasedMatcher.matches()) {
+        branch = branchBasedMatcher.group(1);
+      }
+
+      Matcher tagBasedMatcher = BRANCH.matcher(meta);
+      if (tagBasedMatcher.matches()) {
+        tag = tagBasedMatcher.group(1);
+      }
     }
 
     Preconditions.checkArgument(
-        asOfTimestamp == null || snapshotId == null,
-        "Cannot specify both snapshot and timestamp for time travel: %s",
-        ident);
+        Stream.of(snapshotId, asOfTimestamp, branch, tag).filter(Objects::nonNull).count() <= 1,
+        "Can specify at most one of snapshot-id (%s), as-of-timestamp (%s), and snapshot-ref (%s)",
+        snapshotId,
+        asOfTimestamp,
+        branch);
 
     Table table = TABLE_CACHE.get(key);
 
     if (table == null) {
       throw new NoSuchTableException(ident);
     }
 
+    long snapshotIdFromTimeTravel = -1L;
+
     if (snapshotId != null) {
-      return Pair.of(table, snapshotId);
+      snapshotIdFromTimeTravel = snapshotId;
     } else if (asOfTimestamp != null) {
-      return Pair.of(table, SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp));
+      snapshotIdFromTimeTravel = SnapshotUtil.snapshotIdAsOfTime(table, asOfTimestamp);
+    } else if (branch != null) {
+      Preconditions.checkArgument(
+          table.snapshot(branch) != null, "branch not associated with a snapshot");

Review Comment:
   nit: should capitalize first word, and also provide info in the error message, so something like `Cannot find snapshot associated with branch name: %s`. Similar comment for the tag error check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] namrathamyske commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "namrathamyske (via GitHub)" <gi...@apache.org>.

namrathamyske commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1103322852


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -368,7 +368,7 @@ private static CaseInsensitiveStringMap addSnapshotId(
       scanOptions.putAll(options.asCaseSensitiveMap());
       scanOptions.put(SparkReadOptions.SNAPSHOT_ID, value);
       scanOptions.remove(SparkReadOptions.AS_OF_TIMESTAMP);
-

Review Comment:
   i have removed branch property 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] namrathamyske commented on pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "namrathamyske (via GitHub)" <gi...@apache.org>.

namrathamyske commented on PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#issuecomment-1426455414

   @aokolnychyi can you take a look at this ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1114657085


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java:
##########
@@ -140,6 +148,14 @@ private Spark3Util.CatalogAndIdentifier catalogAndIdentifier(CaseInsensitiveStri
       selector = AT_TIMESTAMP + asOfTimestamp;
     }
 
+    if (branch != null) {
+      selector = BRANCH + branch;

Review Comment:
   These names are used in other places for patterns, but here they are prefixes. I think it would help to name these `BRANCH_PREFIX` and `TAG_PREFIX`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.

jackye1995 commented on PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#issuecomment-1440632377

   I just asked for permission to push to this branch, addressing the comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6717: spark 3.3 read by snapshot ref schema

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #6717:
URL: https://github.com/apache/iceberg/pull/6717#discussion_r1115002962


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -369,6 +369,8 @@ private static CaseInsensitiveStringMap addSnapshotId(
       scanOptions.put(SparkReadOptions.SNAPSHOT_ID, value);
       scanOptions.remove(SparkReadOptions.AS_OF_TIMESTAMP);
 

Review Comment:
   This is grouped incorrectly. We should have an empty line before the return statement and all remove calls should be grouped together.
   
   ```
   Map<String, String> scanOptions = Maps.newHashMap();
   scanOptions.putAll(options.asCaseSensitiveMap());
   scanOptions.put(SparkReadOptions.SNAPSHOT_ID, value);
   scanOptions.remove(SparkReadOptions.AS_OF_TIMESTAMP);
   scanOptions.remove(SparkReadOptions.BRANCH);
   scanOptions.remove(SparkReadOptions.TAG);
   
   return new CaseInsensitiveStringMap(scanOptions);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org