You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "danielcweeks (via GitHub)" <gi...@apache.org> on 2023/02/12 00:48:39 UTC

[GitHub] [iceberg] danielcweeks opened a new pull request, #6811: Allow lazy loading of snapshots in TableMetadata

danielcweeks opened a new pull request, #6811:
URL: https://github.com/apache/iceberg/pull/6811

   After discussing with @rdblue about the value of loading refs and snapshots lazily as proposed in #6706, it makes more sense to always load refs and the referenced snapshots, but allow lazy loading of all other snapshots.
   
   This PR is a much smaller alternative to #6706 that keeps all snapshot management within the `TableMetadata` class since we're only concerned with snapshots being loaded.
   
   The optional snapshot supplier should always load all snapshots of the table if called, which will happen only in the event that:
   1. All snapshots need to be traversed (via `TableMetadata::snapshots()`)
   2. A specific snapshot is referenced that does not exist in the current snapshot index
   3. The TableMetadata is used as the base for building new metadata (via `TableMetadata.buildFrom(...)`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6811: Allow lazy loading of snapshots in TableMetadata

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6811:
URL: https://github.com/apache/iceberg/pull/6811#discussion_r1103852878


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -235,16 +236,18 @@ public String toString() {
   private final List<SortOrder> sortOrders;
   private final Map<String, String> properties;
   private final long currentSnapshotId;
-  private final List<Snapshot> snapshots;
-  private final Map<Long, Snapshot> snapshotsById;
   private final Map<Integer, Schema> schemasById;
   private final Map<Integer, PartitionSpec> specsById;
   private final Map<Integer, SortOrder> sortOrdersById;
   private final List<HistoryEntry> snapshotLog;
   private final List<MetadataLogEntry> previousFiles;
-  private final Map<String, SnapshotRef> refs;
   private final List<StatisticsFile> statisticsFiles;
   private final List<MetadataUpdate> changes;
+  private final SerializableSupplier<List<Snapshot>> snapshotsSupplier;
+  private List<Snapshot> snapshots;
+  private Map<Long, Snapshot> snapshotsById;
+  private Map<String, SnapshotRef> refs;
+  private boolean snapshotsLoaded;

Review Comment:
   Can you default this in the constructor?
   
   Also, I think it would make sense to set it to `true` if `snapshotsSupplier` is null. That way it is clear that if there is no supplier the snapshot list should be complete.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #6811: Allow lazy loading of snapshots in TableMetadata

Posted by "danielcweeks (via GitHub)" <gi...@apache.org>.
danielcweeks commented on code in PR #6811:
URL: https://github.com/apache/iceberg/pull/6811#discussion_r1103864436


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -235,16 +236,18 @@ public String toString() {
   private final List<SortOrder> sortOrders;
   private final Map<String, String> properties;
   private final long currentSnapshotId;
-  private final List<Snapshot> snapshots;
-  private final Map<Long, Snapshot> snapshotsById;
   private final Map<Integer, Schema> schemasById;
   private final Map<Integer, PartitionSpec> specsById;
   private final Map<Integer, SortOrder> sortOrdersById;
   private final List<HistoryEntry> snapshotLog;
   private final List<MetadataLogEntry> previousFiles;
-  private final Map<String, SnapshotRef> refs;
   private final List<StatisticsFile> statisticsFiles;
   private final List<MetadataUpdate> changes;
+  private final SerializableSupplier<List<Snapshot>> snapshotsSupplier;
+  private List<Snapshot> snapshots;

Review Comment:
   Ok, I've updated to be `volatile` and `ensureSnapshotsLoaded` should be synchronized to make it thread safe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6811: Allow lazy loading of snapshots in TableMetadata

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6811:
URL: https://github.com/apache/iceberg/pull/6811#discussion_r1103853090


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -481,9 +488,31 @@ public Snapshot currentSnapshot() {
   }
 
   public List<Snapshot> snapshots() {
+    ensureSnapshotsLoaded();
+
     return snapshots;
   }
 
+  private void ensureSnapshotsLoaded() {

Review Comment:
   This should probably be `synchronized` because it is okay to share tables across threads.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6811: Allow lazy loading of snapshots in TableMetadata

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6811:
URL: https://github.com/apache/iceberg/pull/6811#issuecomment-1427118191

   +1 when tests are passing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6811: Allow lazy loading of snapshots in TableMetadata

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6811:
URL: https://github.com/apache/iceberg/pull/6811#discussion_r1103856648


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -885,7 +921,7 @@ private Builder(TableMetadata base) {
       this.sortOrders = Lists.newArrayList(base.sortOrders);
       this.properties = Maps.newHashMap(base.properties);
       this.currentSnapshotId = base.currentSnapshotId;
-      this.snapshots = Lists.newArrayList(base.snapshots);
+      this.snapshots = Lists.newArrayList(base.snapshots());

Review Comment:
   Is this needed?
   
   It will always load snapshots when building a new table metadata and that doesn't seem right. I think it would be correct to do it that way, but you don't really need to load them as long as the `snapshotsSupplier` is passed here as well. If all of the snapshots are needed to write the new metadata to a file, the metadata parser will call `snapshots()` at which point they will be loaded.
   
   The only problem I can think of is that we're replacing the snapshot list completely in `ensureSnapshotsLoaded` and we would need to update that to basically union the initially loaded snapshots with the new ones that are before the sequence number. That's a bit more complex and we could do that later if we wanted to.
   
   I think the argument for lazily loading snapshots is that this is in the critical path during a commit, so hitting a REST endpoint is going to increase the chances of conflicts. On the other hand, this is a simpler way to make snapshot loading lazy for all read cases.
   
   What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6811: Allow lazy loading of snapshots in TableMetadata

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6811:
URL: https://github.com/apache/iceberg/pull/6811#discussion_r1103854081


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -235,16 +236,18 @@ public String toString() {
   private final List<SortOrder> sortOrders;
   private final Map<String, String> properties;
   private final long currentSnapshotId;
-  private final List<Snapshot> snapshots;
-  private final Map<Long, Snapshot> snapshotsById;
   private final Map<Integer, Schema> schemasById;
   private final Map<Integer, PartitionSpec> specsById;
   private final Map<Integer, SortOrder> sortOrdersById;
   private final List<HistoryEntry> snapshotLog;
   private final List<MetadataLogEntry> previousFiles;
-  private final Map<String, SnapshotRef> refs;
   private final List<StatisticsFile> statisticsFiles;
   private final List<MetadataUpdate> changes;
+  private final SerializableSupplier<List<Snapshot>> snapshotsSupplier;

Review Comment:
   Do you think we should make this `null` after it is called, or is there value in keeping it around for debugging purposes? Setting it to null would ensure that this doesn't needlessly keep values from being garbage collected, although I can't think of anything large that would get captured in the lambda.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #6811: Allow lazy loading of snapshots in TableMetadata

Posted by "danielcweeks (via GitHub)" <gi...@apache.org>.
danielcweeks commented on code in PR #6811:
URL: https://github.com/apache/iceberg/pull/6811#discussion_r1103864887


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -885,7 +921,7 @@ private Builder(TableMetadata base) {
       this.sortOrders = Lists.newArrayList(base.sortOrders);
       this.properties = Maps.newHashMap(base.properties);
       this.currentSnapshotId = base.currentSnapshotId;
-      this.snapshots = Lists.newArrayList(base.snapshots);
+      this.snapshots = Lists.newArrayList(base.snapshots());

Review Comment:
   For now, it's safest to eagerly load snapshots when building new metadata.  The issue with carrying over the snapshot supplier is that we don't perform the proper validation and pruning unless we extract the logic that is done in the base metadata.
   
   I think that would be a good follow up to see if it's safe, but I don't think we need to do that immediately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #6811: Allow lazy loading of snapshots in TableMetadata

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue merged PR #6811:
URL: https://github.com/apache/iceberg/pull/6811


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #6811: Allow lazy loading of snapshots in TableMetadata

Posted by "danielcweeks (via GitHub)" <gi...@apache.org>.
danielcweeks commented on code in PR #6811:
URL: https://github.com/apache/iceberg/pull/6811#discussion_r1103864183


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -235,16 +236,18 @@ public String toString() {
   private final List<SortOrder> sortOrders;
   private final Map<String, String> properties;
   private final long currentSnapshotId;
-  private final List<Snapshot> snapshots;
-  private final Map<Long, Snapshot> snapshotsById;
   private final Map<Integer, Schema> schemasById;
   private final Map<Integer, PartitionSpec> specsById;
   private final Map<Integer, SortOrder> sortOrdersById;
   private final List<HistoryEntry> snapshotLog;
   private final List<MetadataLogEntry> previousFiles;
-  private final Map<String, SnapshotRef> refs;
   private final List<StatisticsFile> statisticsFiles;
   private final List<MetadataUpdate> changes;
+  private final SerializableSupplier<List<Snapshot>> snapshotsSupplier;

Review Comment:
   I think it's worth de-referencing since there's no reason to hold the reference after.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6811: Allow lazy loading of snapshots in TableMetadata

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6811:
URL: https://github.com/apache/iceberg/pull/6811#discussion_r1103854206


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -235,16 +236,18 @@ public String toString() {
   private final List<SortOrder> sortOrders;
   private final Map<String, String> properties;
   private final long currentSnapshotId;
-  private final List<Snapshot> snapshots;
-  private final Map<Long, Snapshot> snapshotsById;
   private final Map<Integer, Schema> schemasById;
   private final Map<Integer, PartitionSpec> specsById;
   private final Map<Integer, SortOrder> sortOrdersById;
   private final List<HistoryEntry> snapshotLog;
   private final List<MetadataLogEntry> previousFiles;
-  private final Map<String, SnapshotRef> refs;
   private final List<StatisticsFile> statisticsFiles;
   private final List<MetadataUpdate> changes;
+  private final SerializableSupplier<List<Snapshot>> snapshotsSupplier;
+  private List<Snapshot> snapshots;

Review Comment:
   I think these variables should also be `transient` since they are modified by `ensureSnapshotsLoaded` and that could be called by multiple threads.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org