You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/24 22:33:30 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request #1245: Refactor RemoveSnapshots for easier extension

RussellSpitzer opened a new pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245


   Previously all of the code for looking up which manifests need to be scanned
   and the code for scanning them was deeply tied to the RemoveSnapshots object. This made
   it difficult to write any new implementations which had similar functionality. In
   this patch we extract out many of the subfunctions and more complicated pieces of code
   into a new utility class to allow from access from other modules.
   
   The code for actually scanning manifests was extracted into ManifestExpirationManager
   to allow access to Manifest information without exposing Manifest internals.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#discussion_r461930099



##########
File path: core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.io.UncheckedIOException;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.util.ExpireSnapshotUtil;
+import org.apache.iceberg.util.ExpireSnapshotUtil.ManifestExpirationChanges;
+import org.apache.iceberg.util.ExpireSnapshotUtil.SnapshotExpirationChanges;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class ManifestExpirationManager implements Serializable {
+  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);
+  private final SnapshotExpirationChanges snapshotChanges;
+  private final Set<Long> validSnapshotIds;
+  private final ManifestExpirationChanges manifestChanges;
+  private final Map<Integer, PartitionSpec> specLookup;
+  private final FileIO io;
+
+  public ManifestExpirationManager(TableMetadata original, TableMetadata current, FileIO io){
+    this.io = io;
+    this.specLookup = current.specsById();
+    this.snapshotChanges = ExpireSnapshotUtil.getExpiredSnapshots(current, original);
+    this.validSnapshotIds = snapshotChanges.getValidSnapshotIds();
+    if (snapshotChanges.getExpiredSnapshots().size() != 0) {
+      this.manifestChanges = ExpireSnapshotUtil.determineManifestChangesFromSnapshotExpiration(
+          snapshotChanges.getValidSnapshotIds(), snapshotChanges.getExpiredSnapshotIds(), current, original, io);
+    } else {
+      this.manifestChanges = null;
+    }
+  }
+
+  public ManifestExpirationChanges getManifestChanges() {
+    return manifestChanges;
+  }
+
+  public SnapshotExpirationChanges getSnapshotChanges() {
+    return snapshotChanges;
+  }
+
+  public Set<String> scanManifestsForAbandonedDeletedFiles() {
+    if (manifestChanges != null) {
+      return scanManifestsForAbandonedDeletedFiles(manifestChanges.manifestsToScan());
+    } else {
+      return Sets.newHashSet();
+    }
+  }
+
+  public Set<String> scanManifestsForAbandonedDeletedFiles(Set<ManifestFile> manifestsWithDeletes) {
+    Set<String> filesToDelete = new HashSet<>();

Review comment:
       Is it thread-safe, btw?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#issuecomment-676829494


   I think we should close it out since I'm not sure there is much benefit
   without the other usecase
   
   On Wed, Aug 19, 2020, 6:55 PM Ryan Blue <no...@github.com> wrote:
   
   > @RussellSpitzer <https://github.com/RussellSpitzer>, are you still
   > interested in this or should we close it now that #1264
   > <https://github.com/apache/iceberg/pull/1264> is in?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/pull/1245#issuecomment-676818843>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADE2YJYDXSQVGF6S5Q5UMTSBRQ7DANCNFSM4PHD5P4Q>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#discussion_r464528819



##########
File path: core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java
##########
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.io.UncheckedIOException;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.stream.Collectors;
+import org.apache.commons.compress.utils.Sets;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class ManifestExpirationManager implements Serializable {
+  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);
+
+  private final Set<Long> validSnapshotIds;
+  private final Map<Integer, PartitionSpec> specLookup;
+  private final FileIO io;
+
+  private ManifestExpirationManager(Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specLookup, FileIO io) {
+    this.validSnapshotIds = validSnapshotIds;
+    this.specLookup = specLookup;
+    this.io = io;
+  }
+
+  /**
+   * Creates a Manifest Scanner for analyzing manifests for files which are no longer needed
+   * after expiration.
+   * @param current metadata for the table being expired
+   * @param io FileIO for the table, for Serializable usecases make sure this is also serializable for the framework

Review comment:
       Unfortunately we can't control the underlying implementation of IO here and some instances will require a specfic manipulation to make sure they serialize correctly.
   
   I added this specifically (instead of just allowing passing in ops) because for Spark we use SparkUtil.serializableIO to check whether the IO is of the HadoopIO class and make it serializable if so. We can't do that here because it would require dependencies from other modules and also be more or less specific to Spark. I think our only real option is to leave it up to the caller to make sure that portion of the class is correct for their application. :/




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
rdblue closed pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#issuecomment-665260613






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#discussion_r464478304



##########
File path: core/src/main/java/org/apache/iceberg/util/ExpireSnapshotUtil.java
##########
@@ -0,0 +1,394 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.GenericManifestFile;
+import org.apache.iceberg.ManifestFile;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+public class ExpireSnapshotUtil {
+  private static final Logger LOG = LoggerFactory.getLogger(ExpireSnapshotUtil.class);
+
+  //Utility Class No Instantiation Allowed
+  private ExpireSnapshotUtil() {}
+
+  /**
+   * Determines the manifest files which need to be inspected because they refer to data files which
+   * can be removed after a Snapshot Expiration.
+   *
+   * Our goal is to determine which manifest files we actually need to read through because they
+   * may refer to files which are no longer accessible from any valid snapshot and do not effect
+   * the current table.
+   *
+   * For this we need to look through
+   *   1. Snapshots which have not expired but contain manifests from expired snapshots
+   *   2. Snapshots which have expired and contain manifests referring to now orphaned files
+   *
+   * @param snapshotChanges       Information about which Snapshots are still valid and which have been expired
+   * @param original              The table metadata from before the snapshot expiration
+   * @param io                    FileIO for reading manifest info
+   * @return Wrapper around which manifests contain references to possibly abandoned files
+   */
+  public static ManifestExpirationChanges determineManifestChangesFromSnapshotExpiration(
+      SnapshotExpirationChanges snapshotChanges, TableMetadata original, FileIO io) {
+
+    AncestorInfo ancestorInfo = getAncestorInfo(original);
+
+    Set<Snapshot> currentSnapshots = snapshotChanges.getValidSnapshots();
+    Set<Snapshot> expiredSnapshots = snapshotChanges.getExpiredSnapshots();
+    Set<Long> validIds = snapshotChanges.getValidSnapshotIds();
+    Set<Long> expiredIds = snapshotChanges.getExpiredSnapshotIds();
+    Set<Long> ancestorIds = ancestorInfo.getAncestorIds();
+
+    //Snapshots which are not expired but refer to manifests from expired snapshots
+    Set<ManifestFile> validManifests = getValidManifests(currentSnapshots, io);
+    Set<ManifestFile> manifestsToScan = findValidManifestsInExpiredSnapshots(validManifests, ancestorInfo, validIds);
+
+    //Snapshots which are expired and do not effect the current table
+    List<Snapshot> snapshotsNotInTableState = findSnapshotsNotInTableState(expiredSnapshots, ancestorInfo);
+    ManifestExpirationChanges manifestExpirationChanges =
+        findExpiredManifestsInUnusedSnapshots(snapshotsNotInTableState, validManifests, ancestorIds, expiredIds, io);
+
+    manifestExpirationChanges.manifestsToScan().addAll(manifestsToScan);
+    return manifestExpirationChanges;
+  }
+
+  /**
+   * Compares the Snapshots from the two TableMetadata objects and identifies the snapshots
+   * still in use and those no longer in use
+   * @param current Metadata from a table after an expiration of snapshots
+   * @param original Metadata from the table before expiration of snapshots
+   * @return Information about which Snapshots have Expired and which are Still Valid
+   */
+  public static SnapshotExpirationChanges getExpiredSnapshots(TableMetadata current, TableMetadata original) {
+
+    Set<Snapshot> validSnapshots = Sets.newHashSet();
+    for (Snapshot snapshot : current.snapshots()) {
+      validSnapshots.add(snapshot);
+    }
+
+    Set<Snapshot> expiredSnapshots = Sets.newHashSet();
+    for (Snapshot snapshot : original.snapshots()) {
+      if (!validSnapshots.contains(snapshot)) {

Review comment:
       We are comparing the same objects in memory here, but we could do a more complicated hash and equals if we want. Originally this was just a match made on the SnapshotId but since everything is dealing with the same references I thought this would be a bit clearer.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#discussion_r463816507



##########
File path: core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.io.UncheckedIOException;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.util.ExpireSnapshotUtil;
+import org.apache.iceberg.util.ExpireSnapshotUtil.ManifestExpirationChanges;
+import org.apache.iceberg.util.ExpireSnapshotUtil.SnapshotExpirationChanges;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class ManifestExpirationManager implements Serializable {
+  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);
+  private final SnapshotExpirationChanges snapshotChanges;
+  private final Set<Long> validSnapshotIds;
+  private final ManifestExpirationChanges manifestChanges;
+  private final Map<Integer, PartitionSpec> specLookup;
+  private final FileIO io;
+
+  public ManifestExpirationManager(TableMetadata original, TableMetadata current, FileIO io){
+    this.io = io;
+    this.specLookup = current.specsById();
+    this.snapshotChanges = ExpireSnapshotUtil.getExpiredSnapshots(current, original);
+    this.validSnapshotIds = snapshotChanges.getValidSnapshotIds();
+    if (snapshotChanges.getExpiredSnapshots().size() != 0) {
+      this.manifestChanges = ExpireSnapshotUtil.determineManifestChangesFromSnapshotExpiration(
+          snapshotChanges.getValidSnapshotIds(), snapshotChanges.getExpiredSnapshotIds(), current, original, io);
+    } else {
+      this.manifestChanges = null;
+    }
+  }
+
+  public ManifestExpirationChanges getManifestChanges() {
+    return manifestChanges;
+  }
+
+  public SnapshotExpirationChanges getSnapshotChanges() {
+    return snapshotChanges;
+  }
+
+  public Set<String> scanManifestsForAbandonedDeletedFiles() {
+    if (manifestChanges != null) {
+      return scanManifestsForAbandonedDeletedFiles(manifestChanges.manifestsToScan());
+    } else {
+      return Sets.newHashSet();
+    }
+  }
+
+  public Set<String> scanManifestsForAbandonedDeletedFiles(Set<ManifestFile> manifestsWithDeletes) {
+    Set<String> filesToDelete = new HashSet<>();

Review comment:
       Good catch, this usage requires a Concurrent set




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#issuecomment-669298759


   https://github.com/apache/iceberg/pull/1264 - Expire Snapshots Action without using RemoveSnapshots Code Path


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#issuecomment-668663342


   Brief Discussion with @rdblue + @aokolnychyi lead to another approach we are going to try out for finding expired files. We will instead of doing the old logic, read a table from a specific point in history and get all datafilee from that, then do an anti join with the current state to determine the set of no longer used files.
   
   To do this there will be need to be another PR focused on being able to read a table from a specific manifest


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#issuecomment-663760162


   @aokolnychyi just the Refactor


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#issuecomment-665295397


   @rdblue I think I moved a bit in the wrong direction here and I want to make sure we are on the same page as to the end goal of the refactor. I'll be on the call tomorrow if you want to discuss in person or if you want to keep discussing here that is fine as well.
   
   I think I do want to revert the last commit and maybe concentrate on still isolating the Manifest Scanning code from the Determining which Manifests to Scan code. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#issuecomment-676818843


   @RussellSpitzer, are you still interested in this or should we close it now that #1264 is in?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#issuecomment-667340945


   @aokolnychyi  + @rdblue I worked on cleaning this up, I simplified the public methods so they should hopefully be easier to use. As well I tried to reduce the input to many of the internal private functions to just the inputs they required, rather than the parent classes which contain everything. I also moved out the ancestorInfo into a separate wrapper class so we don't recalculate that either.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#discussion_r461848815



##########
File path: core/src/main/java/org/apache/iceberg/util/ExpireSnapshotUtil.java
##########
@@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.GenericManifestFile;
+import org.apache.iceberg.ManifestFile;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+public class ExpireSnapshotUtil {
+  private static final Logger LOG = LoggerFactory.getLogger(ExpireSnapshotUtil.class);
+
+  //Utility Class No Instantiation Allowed

Review comment:
       Nit: missing space between `//` and `Utility`.

##########
File path: core/src/main/java/org/apache/iceberg/util/ExpireSnapshotUtil.java
##########
@@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.GenericManifestFile;
+import org.apache.iceberg.ManifestFile;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+public class ExpireSnapshotUtil {
+  private static final Logger LOG = LoggerFactory.getLogger(ExpireSnapshotUtil.class);
+
+  //Utility Class No Instantiation Allowed
+  private ExpireSnapshotUtil() {}
+
+  /**
+   * Determines the manifest files which need to be inspected because they refer to data files which
+   * can be removed after a Snapshot Expiration.
+   *
+   * Our goal is to determine which manifest files we actually need to read through because they
+   * may refer to files which are no longer accessible from any valid snapshot and do not effect
+   * the current table.
+   *
+   * For this we need to look through
+   *   1. Snapshots which have not expired but contain manifests from expired snapshots
+   *   2. Snapshots which have expired and contain manifests referring to now orphaned files
+   *
+   * @param validIds              The Ids of the Snapshots which have not been expired
+   * @param expiredIds            The Ids of the Snapshots which have been expired
+   * @param current       The table metadata from after the snapshot expiration
+   * @param original      The table metadata from before the snapshot expiration
+   * @param io                    FileIO for reading manifest info
+   * @return Wrapper around which manifests contain references to possibly abandoned files
+   */
+  public static ManifestExpirationChanges determineManifestChangesFromSnapshotExpiration(Set<Long> validIds,
+      Set<Long> expiredIds, TableMetadata current, TableMetadata original, FileIO io) {

Review comment:
       Why pass both valid/expired IDs as well as the current/original table metadata? The valid and expired IDs can come from the table metadata.

##########
File path: core/src/main/java/org/apache/iceberg/util/ExpireSnapshotUtil.java
##########
@@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.GenericManifestFile;
+import org.apache.iceberg.ManifestFile;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+public class ExpireSnapshotUtil {
+  private static final Logger LOG = LoggerFactory.getLogger(ExpireSnapshotUtil.class);
+
+  //Utility Class No Instantiation Allowed
+  private ExpireSnapshotUtil() {}
+
+  /**
+   * Determines the manifest files which need to be inspected because they refer to data files which
+   * can be removed after a Snapshot Expiration.
+   *
+   * Our goal is to determine which manifest files we actually need to read through because they
+   * may refer to files which are no longer accessible from any valid snapshot and do not effect
+   * the current table.
+   *
+   * For this we need to look through
+   *   1. Snapshots which have not expired but contain manifests from expired snapshots
+   *   2. Snapshots which have expired and contain manifests referring to now orphaned files
+   *
+   * @param validIds              The Ids of the Snapshots which have not been expired
+   * @param expiredIds            The Ids of the Snapshots which have been expired
+   * @param current       The table metadata from after the snapshot expiration
+   * @param original      The table metadata from before the snapshot expiration
+   * @param io                    FileIO for reading manifest info
+   * @return Wrapper around which manifests contain references to possibly abandoned files
+   */
+  public static ManifestExpirationChanges determineManifestChangesFromSnapshotExpiration(Set<Long> validIds,
+      Set<Long> expiredIds, TableMetadata current, TableMetadata original, FileIO io) {
+
+    Set<Long> ancestorIds = Sets.newHashSet(SnapshotUtil.ancestorIds(original.currentSnapshot(), original::snapshot));
+    Set<Long> pickedAncestorIds = getPickedAncestorIds(original, ancestorIds);
+
+    List<Snapshot> currentSnapshots = current.snapshots();
+
+    //Snapshots which are not expired but refer to manifests from expired snapshots
+    Set<ManifestFile> validManifests = getValidManifests(currentSnapshots, io);
+    Set<ManifestFile> manifestsToScan = findValidManifestsInExpiredSnapshots(validManifests, ancestorIds,

Review comment:
       It would be good to have a comment here to explain why the valid manifests are in expired snapshots. Maybe the name should be more clear, too: `validManifestsCreatedByExpiredSnapshots`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#discussion_r461189838



##########
File path: core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.commons.compress.utils.Sets;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class ManifestExpirationManager {
+
+  //No public constructor for utility class
+  private ManifestExpirationManager(){}
+
+  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);
+
+  /**
+   * Scans a set of manifest files for deleted data files which do not refer to a valid snapshot and can be actually
+   * removed from the filesystem. This implementation will use a concurrent approach and will ignore any
+   * failed manifest reads.
+   *
+   * @param manifests        Manifests to scan, all of these files will be scanned
+   * @param validSnapshotIds Snapshot Ids currently considered valid in Table metadata
+   * @param specLookup       A mapping between partitionID and the spec describing that partition
+   * @param io               IO used for reading the files
+   * @return The set of all files that can be safely deleted
+   */
+  public static Set<String> scanManifestsForAbandonedDeletedFiles(
+      Set<ManifestFile> manifests, Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specLookup,
+      FileIO io) {

Review comment:
       This ends up being a bit more complicated than I anticipated if we want to make sure that "ManfiestExpirationChanges" and the Manager are both not only serializable but also splittable ... I think if I do a bit more refactoring I can make this happen but I believe this will make it harder/ more confusing in the future. 
   
   For example I'll need this class to not only generate the full list of manifests which need to be scanned but also have a method which only operates on some of the generated list.
   
   Also we end up loosing a bit of control about how we serialize the component parts, we would have always broadcast everything if we wrap the entire module up.
   
   I can go ahead and finish up the refactor but I would like to check with you before I devote more time to this.
   
   My code at the moment is a bit like
   
   ```
   //These are all serializable
     private final SnapshotExpirationChanges snapshotChanges; 
     private final ManifestExpirationChanges manifestExpirationChanges;
     private final Map<Integer, PartitionSpec> specLookup;
     private final FileIO io;
   
     public ManifestExpirationManager(TableMetadata original, TableMetadata current, FileIO io){
       this.io = io;
       this.specLookup = current.specsById();
       this.snapshotChanges = ExpireSnapshotUtil.getExpiredSnapshots(current, original);
       this.manifestExpirationChanges = ExpireSnapshotUtil.determineManifestChangesFromSnapshotExpiration(
           snapshotChanges.getValidSnapshotIds(), snapshotChanges.getExpiredSnapshotIds(), current, original, io)
       )
     //We can simplify the functions above and move them into this class
   
   // But then we need to be able to get manifestExpirationChanges because we want to parallelize information from them
     getManifestExpirationChanges()
   
    // And now our scan method has to take a list of manifests still or we have to pass some kind of partition/split info
     public Set<String> scanManifestsForAbandonedDeletedFiles(Set<ManifestFile> manifestsWithDeletes) {
   
     }
   ```
   
   Let me know if this is kind of what you were thinking about
   

##########
File path: core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.commons.compress.utils.Sets;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class ManifestExpirationManager {
+
+  //No public constructor for utility class
+  private ManifestExpirationManager(){}
+
+  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);
+
+  /**
+   * Scans a set of manifest files for deleted data files which do not refer to a valid snapshot and can be actually
+   * removed from the filesystem. This implementation will use a concurrent approach and will ignore any
+   * failed manifest reads.
+   *
+   * @param manifests        Manifests to scan, all of these files will be scanned
+   * @param validSnapshotIds Snapshot Ids currently considered valid in Table metadata
+   * @param specLookup       A mapping between partitionID and the spec describing that partition
+   * @param io               IO used for reading the files
+   * @return The set of all files that can be safely deleted
+   */
+  public static Set<String> scanManifestsForAbandonedDeletedFiles(
+      Set<ManifestFile> manifests, Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specLookup,
+      FileIO io) {

Review comment:
       I pushed a commit with this as a work in progress. I tried to keep the class as stateless as possible but it will have some null handling behaviors to maintain old effects.

##########
File path: core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.commons.compress.utils.Sets;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class ManifestExpirationManager {
+
+  //No public constructor for utility class
+  private ManifestExpirationManager(){}
+
+  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);
+
+  /**
+   * Scans a set of manifest files for deleted data files which do not refer to a valid snapshot and can be actually
+   * removed from the filesystem. This implementation will use a concurrent approach and will ignore any
+   * failed manifest reads.
+   *
+   * @param manifests        Manifests to scan, all of these files will be scanned
+   * @param validSnapshotIds Snapshot Ids currently considered valid in Table metadata
+   * @param specLookup       A mapping between partitionID and the spec describing that partition
+   * @param io               IO used for reading the files
+   * @return The set of all files that can be safely deleted
+   */
+  public static Set<String> scanManifestsForAbandonedDeletedFiles(
+      Set<ManifestFile> manifests, Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specLookup,
+      FileIO io) {

Review comment:
       See https://github.com/apache/iceberg/pull/1264 For a demonstration of how we would use this in Spark

##########
File path: core/src/main/java/org/apache/iceberg/util/ExpireSnapshotUtil.java
##########
@@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.GenericManifestFile;
+import org.apache.iceberg.ManifestFile;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+public class ExpireSnapshotUtil {
+  private static final Logger LOG = LoggerFactory.getLogger(ExpireSnapshotUtil.class);
+
+  //Utility Class No Instantiation Allowed
+  private ExpireSnapshotUtil() {}
+
+  /**
+   * Determines the manifest files which need to be inspected because they refer to data files which
+   * can be removed after a Snapshot Expiration.
+   *
+   * Our goal is to determine which manifest files we actually need to read through because they
+   * may refer to files which are no longer accessible from any valid snapshot and do not effect
+   * the current table.
+   *
+   * For this we need to look through
+   *   1. Snapshots which have not expired but contain manifests from expired snapshots
+   *   2. Snapshots which have expired and contain manifests referring to now orphaned files
+   *
+   * @param validIds              The Ids of the Snapshots which have not been expired
+   * @param expiredIds            The Ids of the Snapshots which have been expired
+   * @param current       The table metadata from after the snapshot expiration
+   * @param original      The table metadata from before the snapshot expiration
+   * @param io                    FileIO for reading manifest info
+   * @return Wrapper around which manifests contain references to possibly abandoned files
+   */
+  public static ManifestExpirationChanges determineManifestChangesFromSnapshotExpiration(Set<Long> validIds,
+      Set<Long> expiredIds, TableMetadata current, TableMetadata original, FileIO io) {

Review comment:
       Yes we could recompute that here, I think @aokolnychyi and I were discussing this before and we were trying to reduce the duplication of some calculations. But we could easily just have it recompute changes here. 
   
   The difficulty here, and in most of the code, is that the previous behavior was
   
   Find Snapshot Differences 
   Log Expired Snapshots
   If expiredSnapshots is not empty & deleteFiles:
      Figure out Ancestors // Only requires the current Snapshot
      Figure out CherryPicked Ancestors // requires Metadata
     
   
   In the old code each step of the way we build up some of the information used further down the pathway, sometimes
   the references are serializable and sometimes they aren't. Now we want something we can serialize and split, so because of that we have to be very careful about what things we keep in the class, and what things are only used during Init. 
   
    Ideally I wouldn't want to pass through the tablemetdata at all since it's a very heavy class and not serializable but there are a few points where we can't avoid it. For example in computing cherry picked ancestors we need to use a map that exists within the metadata to lookup snapshots from their Id's.  
   
   Let me see if I can refactor this function so that it doesn't use the metadata at all and only takes in the information from the "SnapshotChanges" discovered previously. I think the cherry picked ancestors is the only one that ends up being a bit hairy.
      




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdsr commented on a change in pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#discussion_r464047469



##########
File path: core/src/main/java/org/apache/iceberg/util/ExpireSnapshotUtil.java
##########
@@ -0,0 +1,394 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.GenericManifestFile;
+import org.apache.iceberg.ManifestFile;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+public class ExpireSnapshotUtil {
+  private static final Logger LOG = LoggerFactory.getLogger(ExpireSnapshotUtil.class);
+
+  //Utility Class No Instantiation Allowed
+  private ExpireSnapshotUtil() {}
+
+  /**
+   * Determines the manifest files which need to be inspected because they refer to data files which
+   * can be removed after a Snapshot Expiration.
+   *
+   * Our goal is to determine which manifest files we actually need to read through because they
+   * may refer to files which are no longer accessible from any valid snapshot and do not effect
+   * the current table.
+   *
+   * For this we need to look through
+   *   1. Snapshots which have not expired but contain manifests from expired snapshots
+   *   2. Snapshots which have expired and contain manifests referring to now orphaned files
+   *
+   * @param snapshotChanges       Information about which Snapshots are still valid and which have been expired
+   * @param original              The table metadata from before the snapshot expiration
+   * @param io                    FileIO for reading manifest info
+   * @return Wrapper around which manifests contain references to possibly abandoned files
+   */
+  public static ManifestExpirationChanges determineManifestChangesFromSnapshotExpiration(
+      SnapshotExpirationChanges snapshotChanges, TableMetadata original, FileIO io) {
+
+    AncestorInfo ancestorInfo = getAncestorInfo(original);
+
+    Set<Snapshot> currentSnapshots = snapshotChanges.getValidSnapshots();
+    Set<Snapshot> expiredSnapshots = snapshotChanges.getExpiredSnapshots();
+    Set<Long> validIds = snapshotChanges.getValidSnapshotIds();
+    Set<Long> expiredIds = snapshotChanges.getExpiredSnapshotIds();
+    Set<Long> ancestorIds = ancestorInfo.getAncestorIds();
+
+    //Snapshots which are not expired but refer to manifests from expired snapshots
+    Set<ManifestFile> validManifests = getValidManifests(currentSnapshots, io);
+    Set<ManifestFile> manifestsToScan = findValidManifestsInExpiredSnapshots(validManifests, ancestorInfo, validIds);
+
+    //Snapshots which are expired and do not effect the current table
+    List<Snapshot> snapshotsNotInTableState = findSnapshotsNotInTableState(expiredSnapshots, ancestorInfo);
+    ManifestExpirationChanges manifestExpirationChanges =
+        findExpiredManifestsInUnusedSnapshots(snapshotsNotInTableState, validManifests, ancestorIds, expiredIds, io);
+
+    manifestExpirationChanges.manifestsToScan().addAll(manifestsToScan);
+    return manifestExpirationChanges;
+  }
+
+  /**
+   * Compares the Snapshots from the two TableMetadata objects and identifies the snapshots
+   * still in use and those no longer in use
+   * @param current Metadata from a table after an expiration of snapshots
+   * @param original Metadata from the table before expiration of snapshots
+   * @return Information about which Snapshots have Expired and which are Still Valid
+   */
+  public static SnapshotExpirationChanges getExpiredSnapshots(TableMetadata current, TableMetadata original) {
+
+    Set<Snapshot> validSnapshots = Sets.newHashSet();
+    for (Snapshot snapshot : current.snapshots()) {
+      validSnapshots.add(snapshot);
+    }
+
+    Set<Snapshot> expiredSnapshots = Sets.newHashSet();
+    for (Snapshot snapshot : original.snapshots()) {
+      if (!validSnapshots.contains(snapshot)) {

Review comment:
       Seems like we don't implement `hashcode` and `equals` for `BaseSnapshot`.  Won't that be a problem here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#issuecomment-665260613






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdsr commented on a change in pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#discussion_r464131542



##########
File path: core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java
##########
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.io.UncheckedIOException;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.stream.Collectors;
+import org.apache.commons.compress.utils.Sets;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class ManifestExpirationManager implements Serializable {
+  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);
+
+  private final Set<Long> validSnapshotIds;
+  private final Map<Integer, PartitionSpec> specLookup;
+  private final FileIO io;
+
+  private ManifestExpirationManager(Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specLookup, FileIO io) {
+    this.validSnapshotIds = validSnapshotIds;
+    this.specLookup = specLookup;
+    this.io = io;
+  }
+
+  /**
+   * Creates a Manifest Scanner for analyzing manifests for files which are no longer needed
+   * after expiration.
+   * @param current metadata for the table being expired
+   * @param io FileIO for the table, for Serializable usecases make sure this is also serializable for the framework

Review comment:
       I don't follow the javadoc on the param completely.
   > for Serializable usecases make sure this is also serializable for the framework
   
   Is that meant as a TODO?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1245: Refactor RemoveSnapshots for easier extension

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#issuecomment-677809128


   Sounds good to me. I'll close it. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org