You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/16 03:04:42 UTC

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2841: API: add an action API for rewrite deletes

jackye1995 commented on a change in pull request #2841:
URL: https://github.com/apache/iceberg/pull/2841#discussion_r709716318



##########
File path: core/src/main/java/org/apache/iceberg/actions/RewriteDeleteStrategy.java
##########
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.Table;
+
+public interface RewriteDeleteStrategy {
+
+  /**
+   * Returns the name of this rewrite deletes strategy
+   */
+  String name();
+
+  /**
+   * Returns the table being modified by this rewrite strategy
+   */
+  Table table();
+
+  /**
+   * Select the deletes to rewrite.
+   *
+   * @param dataFiles iterable of FileScanTasks for data files in a given partition
+   * @return iterable of original delete file to be replaced.
+   */
+  Iterable<DeleteFile> selectDeletesToRewrite(Iterable<FileScanTask> dataFiles);
+
+  /**
+   * Define how to rewrite the deletes.
+   *
+   * @param deleteFilesToRewrite a group of files to be rewritten together
+   * @return iterable of delete files used to replace the original delete files.
+   */
+  Set<DeleteFile> rewriteDeletes(Set<DeleteFile> deleteFilesToRewrite);

Review comment:
       When I was talking about that use case I was also in experimental phase, but so far what I see is that in practice the 2 most commonly used ones are still (1) equality -> position, (2) position -> data. Splitting equality deletes is most of the time not worth the computation effort because it's already doing a planning and it's more cost efficient to just rewrite it to position deletes. 
   
   Going back to the conversation we are having for the interface for `selectDeletesToRewrite`, it feels like we should have 2 different interfaces for rewriting equality and position deletes so that their interfaces can be a bit more specialized for their own use cases?

##########
File path: core/src/main/java/org/apache/iceberg/actions/RewriteDeleteStrategy.java
##########
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.Table;
+
+public interface RewriteDeleteStrategy {
+
+  /**
+   * Returns the name of this rewrite deletes strategy
+   */
+  String name();
+
+  /**
+   * Returns the table being modified by this rewrite strategy
+   */
+  Table table();
+
+  /**
+   * Select the deletes to rewrite.
+   *
+   * @param dataFiles iterable of FileScanTasks for data files in a given partition
+   * @return iterable of original delete file to be replaced.
+   */
+  Iterable<DeleteFile> selectDeletesToRewrite(Iterable<FileScanTask> dataFiles);

Review comment:
       I think as long as we follow a similar pattern it's fine, but it's more important to make the interface suitable for deletes. Here we can consider 2 cases:
   1. rewrite equality -> position: we actually need the reverse of the current information, which is a list of data files associated with the delete file.
   2. rewrite position -> data: the current interface using FileScanTask seems suitable because we need to generate something similar for rewrite anyway.
   
   I see a few different ways we can go with this:
   1. use this current interface, the "scan task" is just a holder of delete and data files organized in the form ready for a scan. equality -> position can then reorganize information.
   2. invent a new interface suitable for both use cases, something like a `Iterable<RewriteDeleteTask>`
   3. instead of having a single rewrite strategy for both, having one for each, so that we can have something customized for equality delete.

##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDeletes.java
##########
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Set;
+import org.apache.iceberg.DeleteFile;
+
+public interface RewriteDeletes extends SnapshotUpdate<RewriteDeletes, RewriteDeletes.Result> {
+
+  /**
+   * Set the implementation class name for rewrite strategy.
+   *
+   * @return this for method chaining
+   */
+  RewriteDeletes strategy(String strategyImpl);

Review comment:
       +1 for not having a custom impl here because if you think some strategy is valuable, it's likely also valuable for other users. Or maybe could you provide more context about what customization you would like to do in addition?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org