You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/20 01:02:08 UTC

[GitHub] [iceberg] rdblue opened a new pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

rdblue opened a new pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775


   The util class `SnapshotUtil` is shared, so it needs to have fairly strict method contracts. This replaces `SnapshotUtil.firstSnapshotAfterTimestamp` with `oldestAncestorAfter`:
   * The new implementation throws `IllegalStateException` if the correct ancestor cannot be determined
   * The new name is clear that the snapshots considered are ancestors, not all snapshots
   
   This also updates places that called `firstSnapshotAfterTimestamp` and attempts to have the same behavior by catching the `IllegalStateException` (cannot determine ancestor) and uses the oldest ancestor instead. However, in updating the Spark streaming code, I noticed a few bugs:
   * `planFiles` will call `.snapshotId()` without checking the snapshot, which can be null if the timestamp is newer than the current, resulting in a `NullPointerException`
   * The initial offset store checks for a future timestamp, but `planFiles` does not
   * The initial offset store handles null `fromTimestamp`, but `planFiles` does not
   * The `initialFutureStartOffset` creates an offset after the current snapshot, but not necessarily after the given future time
   
   I'm also attempting to fix those issues. This simplifies the code by removing several static helper methods. These assisted readability, but made assumptions about whether the table has a current snapshot and so were prone to `NullPointerExceptions`. Instead, this adds `determineStartingOffset` that is responsible for getting a starting offset or returning `StreamingOffset.START_OFFSET` if it cannot be determined because of the table state or a timestamp in the future.
   
   Now, `StreamingOffset.START_OFFSET` means that the job cannot start because the start offset has not been determined, and there is no "initial future offset".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772594645



##########
File path: spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -169,8 +169,7 @@ public void stop() {
   private List<FileScanTask> planFiles(StreamingOffset startOffset, StreamingOffset endOffset) {
     List<FileScanTask> fileScanTasks = Lists.newArrayList();
     StreamingOffset batchStartOffset = StreamingOffset.START_OFFSET.equals(startOffset) ?
-        new StreamingOffset(SnapshotUtil.firstSnapshotAfterTimestamp(table, fromTimestamp).snapshotId(), 0, false) :
-        startOffset;
+        determineStartingOffset(table, fromTimestamp) : startOffset;

Review comment:
       I wonder if it would be simpler if we just drop the ternary operator here and have "suppliedOffset" be a parameter like
   ```java
   determineStartingOffset(table, fromTimestamp, startOffset) {
   ... logic
   }
   
   determineStartingOffset(table, fromTimetsamp) {
      determineStartingOffset(table, fromTimetsamp, null)
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772692074



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,37 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot committed after the given time.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp

Review comment:
       OK, got it that it's a new method rather than change.  To me the previous javadoc is more clear, as "older" gives the impression of strictly greater than.  (there's also a removal of a comment below on this topic).
   
   ```
   // Return the oldest snapshot which satisfies >= targetTimestamp
   ```

##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,37 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot committed after the given time.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp

Review comment:
       OK, got it that it's a new method rather than change.  To me the previous javadoc is more clear, as "older" gives the impression of strictly greater than. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772586836



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,37 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot committed after the given time.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp

Review comment:
       I would add the word "committed" again here and before all the "afters". I don't know why I can't handle both concepts in my head at the same time, but I keep visualizing a linked list fo snapshots.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772695828



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,37 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot committed after the given time.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp
+   * @throws IllegalStateException if the first ancestor after the given time can't be determined
    */
-  public static Snapshot firstSnapshotAfterTimestamp(Table table, Long targetTimestampMillis) {
-    Snapshot currentSnapshot = table.currentSnapshot();
-    // Return null if no snapshot exists or target timestamp is more recent than the current snapshot
-    if (currentSnapshot == null || currentSnapshot.timestampMillis() < targetTimestampMillis) {
+  public static Snapshot oldestAncestorAfter(Table table, long timestampMillis) {
+    if (table.currentSnapshot() == null) {
+      // there are no snapshots or ancestors
       return null;
     }
 
-    // Return the oldest snapshot which satisfies >= targetTimestamp
     Snapshot lastSnapshot = null;
     for (Snapshot snapshot : currentAncestors(table)) {
-      if (snapshot.timestampMillis() < targetTimestampMillis) {
+      if (snapshot.timestampMillis() < timestampMillis) {
         return lastSnapshot;
+      } else if (snapshot.timestampMillis() == timestampMillis) {
+        return snapshot;

Review comment:
       OK, thanks for context that it's a new method rather than changing the existing one.  
   
   The previous code is slightly easier (one less case for user to understand, and in my understanding it still works but just takes one more cycle), but not a big deal as the new case is straight forward.  
   
   If it was me, I'd prefer the clarity of Stream methods, but I guess we do mostly manual traversals in Iceberg due to performance.
   ```
   currentAncestors(table).toStream().filter(Snapshot::timestampMillis() <=  timestampMillis).findFirst()
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772692074



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,37 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot committed after the given time.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp

Review comment:
       OK, got it that it's a new method rather than change.  To me the previous javadoc is more clear, as "older" gives the impression of strictly greater than.  (there's also a removal of a comment below on this topic).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772032673



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,35 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot after the given timestamp.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp
+   * @throws IllegalStateException if the first ancestor after the given time can't be determined
    */
-  public static Snapshot firstSnapshotAfterTimestamp(Table table, Long targetTimestampMillis) {
-    Snapshot currentSnapshot = table.currentSnapshot();
-    // Return null if no snapshot exists or target timestamp is more recent than the current snapshot
-    if (currentSnapshot == null || currentSnapshot.timestampMillis() < targetTimestampMillis) {
+  public static Snapshot oldestAncestorAfter(Table table, long timestampMillis) {
+    if (table.currentSnapshot() == null) {
+      // there are no snapshots or ancestors
       return null;
     }
 
-    // Return the oldest snapshot which satisfies >= targetTimestamp
     Snapshot lastSnapshot = null;
     for (Snapshot snapshot : currentAncestors(table)) {
-      if (snapshot.timestampMillis() < targetTimestampMillis) {
+      if (snapshot.timestampMillis() <= timestampMillis) {
         return lastSnapshot;
       }
+
       lastSnapshot = snapshot;
     }
 
-    // Return the oldest snapshot if the target timestamp is less than the oldest snapshot of the table
-    return lastSnapshot;
+    if (lastSnapshot != null && lastSnapshot.parentId() == null) {
+      // this is the first snapshot in the table, return it

Review comment:
       I am a little worried about having a function which works for a given input but only until the starting snapshot is expired. For example
   ```
   oldestAncestorAfter(table,  Long.MinValue) // Returns first snapshot
   expireSnapshots() // Expire first snapshot
   oldestAncestorAfter(table,  Long.MinValue) // Throws exception
   ```
   
   I think if we want to standardize this should probably also throw an exception




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#issuecomment-998476814


   @Me Reminder to back port test changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772616644



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,37 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot committed after the given time.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp

Review comment:
       I'm trying to understand the removal of ">=" from this and other comments, is it intended?
   
   If so, it doens't seem to match the code below which still seems to return ">="
   ```
   else if (snapshot.timestampMillis() == timestampMillis() 
      return snapshot;
   ```
   
   

##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,37 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot committed after the given time.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp
+   * @throws IllegalStateException if the first ancestor after the given time can't be determined
    */
-  public static Snapshot firstSnapshotAfterTimestamp(Table table, Long targetTimestampMillis) {
-    Snapshot currentSnapshot = table.currentSnapshot();
-    // Return null if no snapshot exists or target timestamp is more recent than the current snapshot
-    if (currentSnapshot == null || currentSnapshot.timestampMillis() < targetTimestampMillis) {
+  public static Snapshot oldestAncestorAfter(Table table, long timestampMillis) {
+    if (table.currentSnapshot() == null) {
+      // there are no snapshots or ancestors
       return null;
     }
 
-    // Return the oldest snapshot which satisfies >= targetTimestamp
     Snapshot lastSnapshot = null;
     for (Snapshot snapshot : currentAncestors(table)) {
-      if (snapshot.timestampMillis() < targetTimestampMillis) {
+      if (snapshot.timestampMillis() < timestampMillis) {
         return lastSnapshot;
+      } else if (snapshot.timestampMillis() == timestampMillis) {
+        return snapshot;

Review comment:
       Maybe I'm missing something, but curious what case does adding this condition fix?
   
   Even in the old code, seems like it would return a snapshot which satisfies timestamp >= snapshot.timestamp? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772632039



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,37 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot committed after the given time.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp
+   * @throws IllegalStateException if the first ancestor after the given time can't be determined
    */
-  public static Snapshot firstSnapshotAfterTimestamp(Table table, Long targetTimestampMillis) {
-    Snapshot currentSnapshot = table.currentSnapshot();
-    // Return null if no snapshot exists or target timestamp is more recent than the current snapshot
-    if (currentSnapshot == null || currentSnapshot.timestampMillis() < targetTimestampMillis) {
+  public static Snapshot oldestAncestorAfter(Table table, long timestampMillis) {
+    if (table.currentSnapshot() == null) {
+      // there are no snapshots or ancestors
       return null;
     }
 
-    // Return the oldest snapshot which satisfies >= targetTimestamp
     Snapshot lastSnapshot = null;
     for (Snapshot snapshot : currentAncestors(table)) {
-      if (snapshot.timestampMillis() < targetTimestampMillis) {
+      if (snapshot.timestampMillis() < timestampMillis) {
         return lastSnapshot;
+      } else if (snapshot.timestampMillis() == timestampMillis) {
+        return snapshot;

Review comment:
       This isn't changing the method as much as completely replacing the old code with a different implementation that I suggested originally. I didn't want to go through and figure out what had changed and why, I just wanted to make it work.
   
   The clause here catches the case where the current snapshot is the one to return because its timestamp matches the requested timestamp. In that case, there's no need to have an earlier parent so we short-circuit early.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#issuecomment-998377008


   Thanks, @RussellSpitzer! It's looking a lot better now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772631414



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,35 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot after the given timestamp.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp
+   * @throws IllegalStateException if the first ancestor after the given time can't be determined
    */
-  public static Snapshot firstSnapshotAfterTimestamp(Table table, Long targetTimestampMillis) {
-    Snapshot currentSnapshot = table.currentSnapshot();
-    // Return null if no snapshot exists or target timestamp is more recent than the current snapshot
-    if (currentSnapshot == null || currentSnapshot.timestampMillis() < targetTimestampMillis) {
+  public static Snapshot oldestAncestorAfter(Table table, long timestampMillis) {
+    if (table.currentSnapshot() == null) {
+      // there are no snapshots or ancestors
       return null;
     }
 
-    // Return the oldest snapshot which satisfies >= targetTimestamp
     Snapshot lastSnapshot = null;
     for (Snapshot snapshot : currentAncestors(table)) {
-      if (snapshot.timestampMillis() < targetTimestampMillis) {
+      if (snapshot.timestampMillis() <= timestampMillis) {
         return lastSnapshot;
       }
+
       lastSnapshot = snapshot;
     }
 
-    // Return the oldest snapshot if the target timestamp is less than the oldest snapshot of the table
-    return lastSnapshot;
+    if (lastSnapshot != null && lastSnapshot.parentId() == null) {
+      // this is the first snapshot in the table, return it

Review comment:
       I agree that we should change it, I just think we can just always throw the exception so
   ```
   oldestAncestorAfter(table,  Long.MinValue) // Throw exception
   expireSnapshots() // Expire first snapshot
   oldestAncestorAfter(table,  Long.MinValue) // Throw exception
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772630738



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,37 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot committed after the given time.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp

Review comment:
       I replaced the previous Javadoc with the original one that I wrote. I can be more clear here if needed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772030596



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,35 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot after the given timestamp.

Review comment:
       "First snapshot created after" I was confused about this description before because "after" could refer to the traversal which would be the snapshot one older than the timestamp rather than the  snapshot one younger than the timestamp which I believe is the intent here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772636577



##########
File path: spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -169,8 +169,7 @@ public void stop() {
   private List<FileScanTask> planFiles(StreamingOffset startOffset, StreamingOffset endOffset) {
     List<FileScanTask> fileScanTasks = Lists.newArrayList();
     StreamingOffset batchStartOffset = StreamingOffset.START_OFFSET.equals(startOffset) ?
-        new StreamingOffset(SnapshotUtil.firstSnapshotAfterTimestamp(table, fromTimestamp).snapshotId(), 0, false) :
-        startOffset;
+        determineStartingOffset(table, fromTimestamp) : startOffset;

Review comment:
       I think that the problem is that we have 3 "start" offsets here:
   1. `StreamingOffset.START_OFFSET` is a fake offset for when we don't have enough information, like when the table has no snapshots or the timestamp is in the future
   2. The "starting" offset referred to by `determineStartingOffset` is the concrete replacement for `START_OFFSET`, if the there is enough information to start; otherwise it returns `START_OFFSET` as a placeholder again
   3. `startOffset` is the known starting point for this batch, from the last batch's end offset or initialization
   
   I don't think that it makes sense to mix `startOffset` into the method because they're different things and `determineStartingOffset` is already fairly complicated. We could have a separate method, `determineBatchStartOffset` instead, but that would basically be this expression.
   
   A better fix for this confusion is probably to rename the method. Maybe `determineInitialOffset` is more clear?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772528200



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,35 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot after the given timestamp.

Review comment:
       I see what you mean. We need to specify what we're ordering by. I'll update this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772594645



##########
File path: spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java
##########
@@ -169,8 +169,7 @@ public void stop() {
   private List<FileScanTask> planFiles(StreamingOffset startOffset, StreamingOffset endOffset) {
     List<FileScanTask> fileScanTasks = Lists.newArrayList();
     StreamingOffset batchStartOffset = StreamingOffset.START_OFFSET.equals(startOffset) ?
-        new StreamingOffset(SnapshotUtil.firstSnapshotAfterTimestamp(table, fromTimestamp).snapshotId(), 0, false) :
-        startOffset;
+        determineStartingOffset(table, fromTimestamp) : startOffset;

Review comment:
       I wonder if it would be simpler if we just drop the ternary operator here and have "suppliedOffset" be a parameter like
   ```java
   private Offset determineStartingOffset(table, fromTimestamp, startOffset) {
     if (startOffset != null && !StreamingOffset.equals(startOffset)){
       return startOffset;
     }
   ... logic
   }
   
   private Offset determineStartingOffset(table, fromTimetsamp) {
      determineStartingOffset(table, fromTimetsamp, null)
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772732763



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,37 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot committed after the given time.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp
+   * @throws IllegalStateException if the first ancestor after the given time can't be determined
    */
-  public static Snapshot firstSnapshotAfterTimestamp(Table table, Long targetTimestampMillis) {
-    Snapshot currentSnapshot = table.currentSnapshot();
-    // Return null if no snapshot exists or target timestamp is more recent than the current snapshot
-    if (currentSnapshot == null || currentSnapshot.timestampMillis() < targetTimestampMillis) {
+  public static Snapshot oldestAncestorAfter(Table table, long timestampMillis) {
+    if (table.currentSnapshot() == null) {
+      // there are no snapshots or ancestors
       return null;
     }
 
-    // Return the oldest snapshot which satisfies >= targetTimestamp
     Snapshot lastSnapshot = null;
     for (Snapshot snapshot : currentAncestors(table)) {
-      if (snapshot.timestampMillis() < targetTimestampMillis) {
+      if (snapshot.timestampMillis() < timestampMillis) {
         return lastSnapshot;
+      } else if (snapshot.timestampMillis() == timestampMillis) {
+        return snapshot;

Review comment:
       In this case, we're also more carefully traversing the ancestors. We aren't finding the snapshot before timestampMillis, we're finding the child of the first snapshot before timestamp millis. And if the parent doesn't exist, it throws an exception.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772589004



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,35 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot after the given timestamp.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp
+   * @throws IllegalStateException if the first ancestor after the given time can't be determined
    */
-  public static Snapshot firstSnapshotAfterTimestamp(Table table, Long targetTimestampMillis) {
-    Snapshot currentSnapshot = table.currentSnapshot();
-    // Return null if no snapshot exists or target timestamp is more recent than the current snapshot
-    if (currentSnapshot == null || currentSnapshot.timestampMillis() < targetTimestampMillis) {
+  public static Snapshot oldestAncestorAfter(Table table, long timestampMillis) {
+    if (table.currentSnapshot() == null) {
+      // there are no snapshots or ancestors
       return null;
     }
 
-    // Return the oldest snapshot which satisfies >= targetTimestamp
     Snapshot lastSnapshot = null;
     for (Snapshot snapshot : currentAncestors(table)) {
-      if (snapshot.timestampMillis() < targetTimestampMillis) {
+      if (snapshot.timestampMillis() <= timestampMillis) {
         return lastSnapshot;
       }
+
       lastSnapshot = snapshot;
     }
 
-    // Return the oldest snapshot if the target timestamp is less than the oldest snapshot of the table
-    return lastSnapshot;
+    if (lastSnapshot != null && lastSnapshot.parentId() == null) {
+      // this is the first snapshot in the table, return it

Review comment:
       I see your point here, but the result is based on the table state that gets passed in. If the table state is missing information, then we can't make it consistent.
   
   Here's another way to think about it:
   
   ```
   t1 = commitSnapshotOne()
   t2 = commitSnapshotTwo()
   oldestAncestorAfter(table, Long.MinValue) // returns snapshot one
   expireSnapshots(t2 - 1)
   oldestAncestorAfter(table, Long.MinValue) // returns snapshot two
   ```
   
   I think that the behavior above is worse than throwing an exception based on the table state because it is silently inconsisent. At least throwing an exception tells you why it isn't returning the expected value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3775: Core: Replace SnapshotUtil firstSnapshotAfterTimestamp

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3775:
URL: https://github.com/apache/iceberg/pull/3775#discussion_r772619548



##########
File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
##########
@@ -102,43 +102,37 @@ public static Snapshot oldestAncestor(Table table) {
   }
 
   /**
-   * Traverses the history of the table's current snapshot and:
-   * 1. returns null, if no snapshot exists or target timestamp is more recent than the current snapshot.
-   * 2. else return the first snapshot which satisfies {@literal >=} targetTimestamp.
-   * <p>
-   * Given the snapshots (with timestamp): [S1 (10), S2 (11), S3 (12), S4 (14)]
-   * <p>
-   * firstSnapshotAfterTimestamp(table, x {@literal <=} 10) = S1
-   * firstSnapshotAfterTimestamp(table, 11) = S2
-   * firstSnapshotAfterTimestamp(table, 13) = S4
-   * firstSnapshotAfterTimestamp(table, 14) = S4
-   * firstSnapshotAfterTimestamp(table, x {@literal >} 14) = null
-   * <p>
-   * where x is the target timestamp in milliseconds and Si is the snapshot
+   * Traverses the history of the table's current snapshot and finds the first snapshot committed after the given time.
    *
    * @param table a table
-   * @param targetTimestampMillis a timestamp in milliseconds
-   * @return the first snapshot which satisfies {@literal >=} targetTimestamp, or null if the current snapshot is
-   * more recent than the target timestamp
+   * @param timestampMillis a timestamp in milliseconds
+   * @return the first snapshot after the given timestamp, or null if the current snapshot is older than the timestamp
+   * @throws IllegalStateException if the first ancestor after the given time can't be determined
    */
-  public static Snapshot firstSnapshotAfterTimestamp(Table table, Long targetTimestampMillis) {
-    Snapshot currentSnapshot = table.currentSnapshot();
-    // Return null if no snapshot exists or target timestamp is more recent than the current snapshot
-    if (currentSnapshot == null || currentSnapshot.timestampMillis() < targetTimestampMillis) {
+  public static Snapshot oldestAncestorAfter(Table table, long timestampMillis) {
+    if (table.currentSnapshot() == null) {
+      // there are no snapshots or ancestors
       return null;
     }
 
-    // Return the oldest snapshot which satisfies >= targetTimestamp
     Snapshot lastSnapshot = null;
     for (Snapshot snapshot : currentAncestors(table)) {
-      if (snapshot.timestampMillis() < targetTimestampMillis) {
+      if (snapshot.timestampMillis() < timestampMillis) {
         return lastSnapshot;
+      } else if (snapshot.timestampMillis() == timestampMillis) {
+        return snapshot;

Review comment:
       Maybe I'm missing something, but curious what case does this new condition fix?
   
   Even in the old code, seems like it would return a snapshot which satisfies timestamp >= snapshot.timestamp? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org