You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/29 07:54:10 UTC

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1245: Refactor RemoveSnapshots for easier extension

RussellSpitzer commented on a change in pull request #1245:
URL: https://github.com/apache/iceberg/pull/1245#discussion_r461189838



##########
File path: core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.commons.compress.utils.Sets;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class ManifestExpirationManager {
+
+  //No public constructor for utility class
+  private ManifestExpirationManager(){}
+
+  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);
+
+  /**
+   * Scans a set of manifest files for deleted data files which do not refer to a valid snapshot and can be actually
+   * removed from the filesystem. This implementation will use a concurrent approach and will ignore any
+   * failed manifest reads.
+   *
+   * @param manifests        Manifests to scan, all of these files will be scanned
+   * @param validSnapshotIds Snapshot Ids currently considered valid in Table metadata
+   * @param specLookup       A mapping between partitionID and the spec describing that partition
+   * @param io               IO used for reading the files
+   * @return The set of all files that can be safely deleted
+   */
+  public static Set<String> scanManifestsForAbandonedDeletedFiles(
+      Set<ManifestFile> manifests, Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specLookup,
+      FileIO io) {

Review comment:
       This ends up being a bit more complicated than I anticipated if we want to make sure that "ManfiestExpirationChanges" and the Manager are both not only serializable but also splittable ... I think if I do a bit more refactoring I can make this happen but I believe this will make it harder/ more confusing in the future. 
   
   For example I'll need this class to not only generate the full list of manifests which need to be scanned but also have a method which only operates on some of the generated list.
   
   Also we end up loosing a bit of control about how we serialize the component parts, we would have always broadcast everything if we wrap the entire module up.
   
   I can go ahead and finish up the refactor but I would like to check with you before I devote more time to this.
   
   My code at the moment is a bit like
   
   ```
   //These are all serializable
     private final SnapshotExpirationChanges snapshotChanges; 
     private final ManifestExpirationChanges manifestExpirationChanges;
     private final Map<Integer, PartitionSpec> specLookup;
     private final FileIO io;
   
     public ManifestExpirationManager(TableMetadata original, TableMetadata current, FileIO io){
       this.io = io;
       this.specLookup = current.specsById();
       this.snapshotChanges = ExpireSnapshotUtil.getExpiredSnapshots(current, original);
       this.manifestExpirationChanges = ExpireSnapshotUtil.determineManifestChangesFromSnapshotExpiration(
           snapshotChanges.getValidSnapshotIds(), snapshotChanges.getExpiredSnapshotIds(), current, original, io)
       )
     //We can simplify the functions above and move them into this class
   
   // But then we need to be able to get manifestExpirationChanges because we want to parallelize information from them
     getManifestExpirationChanges()
   
    // And now our scan method has to take a list of manifests still or we have to pass some kind of partition/split info
     public Set<String> scanManifestsForAbandonedDeletedFiles(Set<ManifestFile> manifestsWithDeletes) {
   
     }
   ```
   
   Let me know if this is kind of what you were thinking about
   

##########
File path: core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.commons.compress.utils.Sets;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class ManifestExpirationManager {
+
+  //No public constructor for utility class
+  private ManifestExpirationManager(){}
+
+  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);
+
+  /**
+   * Scans a set of manifest files for deleted data files which do not refer to a valid snapshot and can be actually
+   * removed from the filesystem. This implementation will use a concurrent approach and will ignore any
+   * failed manifest reads.
+   *
+   * @param manifests        Manifests to scan, all of these files will be scanned
+   * @param validSnapshotIds Snapshot Ids currently considered valid in Table metadata
+   * @param specLookup       A mapping between partitionID and the spec describing that partition
+   * @param io               IO used for reading the files
+   * @return The set of all files that can be safely deleted
+   */
+  public static Set<String> scanManifestsForAbandonedDeletedFiles(
+      Set<ManifestFile> manifests, Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specLookup,
+      FileIO io) {

Review comment:
       I pushed a commit with this as a work in progress. I tried to keep the class as stateless as possible but it will have some null handling behaviors to maintain old effects.

##########
File path: core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.commons.compress.utils.Sets;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class ManifestExpirationManager {
+
+  //No public constructor for utility class
+  private ManifestExpirationManager(){}
+
+  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);
+
+  /**
+   * Scans a set of manifest files for deleted data files which do not refer to a valid snapshot and can be actually
+   * removed from the filesystem. This implementation will use a concurrent approach and will ignore any
+   * failed manifest reads.
+   *
+   * @param manifests        Manifests to scan, all of these files will be scanned
+   * @param validSnapshotIds Snapshot Ids currently considered valid in Table metadata
+   * @param specLookup       A mapping between partitionID and the spec describing that partition
+   * @param io               IO used for reading the files
+   * @return The set of all files that can be safely deleted
+   */
+  public static Set<String> scanManifestsForAbandonedDeletedFiles(
+      Set<ManifestFile> manifests, Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specLookup,
+      FileIO io) {

Review comment:
       See https://github.com/apache/iceberg/pull/1264 For a demonstration of how we would use this in Spark

##########
File path: core/src/main/java/org/apache/iceberg/util/ExpireSnapshotUtil.java
##########
@@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.GenericManifestFile;
+import org.apache.iceberg.ManifestFile;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+public class ExpireSnapshotUtil {
+  private static final Logger LOG = LoggerFactory.getLogger(ExpireSnapshotUtil.class);
+
+  //Utility Class No Instantiation Allowed
+  private ExpireSnapshotUtil() {}
+
+  /**
+   * Determines the manifest files which need to be inspected because they refer to data files which
+   * can be removed after a Snapshot Expiration.
+   *
+   * Our goal is to determine which manifest files we actually need to read through because they
+   * may refer to files which are no longer accessible from any valid snapshot and do not effect
+   * the current table.
+   *
+   * For this we need to look through
+   *   1. Snapshots which have not expired but contain manifests from expired snapshots
+   *   2. Snapshots which have expired and contain manifests referring to now orphaned files
+   *
+   * @param validIds              The Ids of the Snapshots which have not been expired
+   * @param expiredIds            The Ids of the Snapshots which have been expired
+   * @param current       The table metadata from after the snapshot expiration
+   * @param original      The table metadata from before the snapshot expiration
+   * @param io                    FileIO for reading manifest info
+   * @return Wrapper around which manifests contain references to possibly abandoned files
+   */
+  public static ManifestExpirationChanges determineManifestChangesFromSnapshotExpiration(Set<Long> validIds,
+      Set<Long> expiredIds, TableMetadata current, TableMetadata original, FileIO io) {

Review comment:
       Yes we could recompute that here, I think @aokolnychyi and I were discussing this before and we were trying to reduce the duplication of some calculations. But we could easily just have it recompute changes here. 
   
   The difficulty here, and in most of the code, is that the previous behavior was
   
   Find Snapshot Differences 
   Log Expired Snapshots
   If expiredSnapshots is not empty & deleteFiles:
      Figure out Ancestors // Only requires the current Snapshot
      Figure out CherryPicked Ancestors // requires Metadata
     
   
   In the old code each step of the way we build up some of the information used further down the pathway, sometimes
   the references are serializable and sometimes they aren't. Now we want something we can serialize and split, so because of that we have to be very careful about what things we keep in the class, and what things are only used during Init. 
   
    Ideally I wouldn't want to pass through the tablemetdata at all since it's a very heavy class and not serializable but there are a few points where we can't avoid it. For example in computing cherry picked ancestors we need to use a map that exists within the metadata to lookup snapshots from their Id's.  
   
   Let me see if I can refactor this function so that it doesn't use the metadata at all and only takes in the information from the "SnapshotChanges" discovered previously. I think the cherry picked ancestors is the only one that ends up being a bit hairy.
      




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org