You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/02/19 21:28:39 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request #2255: API: Add v2 Actions API

aokolnychyi opened a new pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255


   This PR is the first step in reworking our actions. It introduces a set of interfaces in the API module that should be implemented by query engine integrations.
   
   **Note**: new interfaces don't end with `...Action` to avoid conflicts with existing classes. I will gradually move action by action to the new API while keeping the compatibility (old actions will wrap new ones). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790133773


   The problem with using Spark 3 `Identifier` is that Spark does not actually expose it anywhere except the catalog implementations, meaning we don't really have a good engine API to expose for our users.
   
   Using our `TableIdentifier` to refer to non-Iceberg tables does not seem ideal either but let me think about it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582164733



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       There is also an option of having engine-specific return types. However, that is going to complicate things further. It means we need engine-specific actions and to propagate `ThisT` everywhere in base actions. Plus, the return type may be substantially different from engine to engine.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-786312470


   > Having options in the Action directly. That seems simpler than having ConfigurableAction.
   
   +1
   
   > Accepting Strings instead of Tables in all actions so that each action loads the table.
   
   I actually like that some accept a `Table`. I'm trying to think of a better way to structure the ones that take strings. I agree with the idea that it is easier to use that way, but the problem is that we need a way to parse the string into an identifier and to load a catalog. Maybe we should have a version that accepts a catalog and an identifier? I'm not sure what to do here.
   
   > Shall we make action results closeable?
   
   Let's skip it for now if we don't need it.
   
   > Using with prefix for some method names. I'd say let's not use it to be consistent.
   
   +1 for fewer words.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582184258



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       Well, it is hard for me to predict how useful that would be for other engines. We don't really use it in Spark and maybe if we treat these results as summaries, that should be fine. In the worst case, we can cast action to an engine specific type and call a method that returns a custom object. For example, we have `expire` that returns `DataFrame` in Spark.

##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       Well, it is hard for me to predict how useful that would be for other engines. We don't really use it in Spark and maybe if we treat these results as summaries, that should be fine.
   
   In the worst case, we can cast action to an engine specific type and call a method that returns a custom object. For example, we have `expire` that returns `DataFrame` in Spark.

##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       Well, it is hard for me to predict how useful that would be for other engines. We don't really use it in Spark and maybe if we treat these results as summaries, that should be fine.
   
   In the worst case, we can cast action to an engine-specific type and call a method that returns a custom object. For example, we have `expire` that returns `DataFrame` in Spark.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r583259716



##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotUpdate.java
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that produces snapshots. This interface contains common methods for all
+ * actions that create a new table {@link Snapshot}.
+ *
+ * @param <ThisT> the child Java API class, returned by method chaining
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface SnapshotUpdate<ThisT, R> extends Action<R> {
+  /**
+   * Set a summary property in the snapshot produced by this action.
+   *
+   * @param property a String property name
+   * @param value a String property value
+   * @return this for method chaining
+   */
+  ThisT set(String property, String value);

Review comment:
       Yeah, that's what I feel too. Then I'll keep the current way.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579493447



##########
File path: api/src/main/java/org/apache/iceberg/actions/ConfigurableAction.java
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table that accepts options that control the execution.
+ *
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface ConfigurableAction<ThisT, R> extends Action<R> {
+  /**
+   * Configures this action with an extra option.
+   *
+   * @param name an option name
+   * @param value an option value
+   * @return this for method chaining
+   */
+  ThisT withOption(String name, String value);

Review comment:
       There will be configs specific to engines. For example, whether to use `toLocalIterator` in Spark or not. It does not make sense to expose such a method in this API as other engines may not support it.
   
   There are two approaches we can take:
   - Support a map of options in the API.
   - Have engine-specific interfaces that extend base action interfaces. For example, `ExpireSnapshotsSparkAction` will extend `ExpireSnapshotsAction`.
   
   I went with the first option as it seems simpler. I can reconsider depending on what the community thinks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790080538


   Well, maybe using just identifiers won't be sufficient as the catalog resolution can depend on the context. In case of procedures, we default the catalog to the procedure catalog, not the default catalog. However, we can solve that differently.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582197766



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {

Review comment:
       Well, we could also default it to something like `db.tbl_iceberg_snapshot`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdsr commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580549229



##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotUpdate.java
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that produces snapshots. This interface contains common methods for all
+ * actions that create a new table {@link Snapshot}.

Review comment:
       nit: new snapshot




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579497809



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.expressions.Expression;
+
+/**
+ * An action that rewrites data files.
+ */
+public interface RewriteDataFiles extends

Review comment:
       This one will be reworked by @RussellSpitzer as he is working on a new proposal. So we may delay discussing this particular one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579494515



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       I am using `Iterable` vs some more specific implementations. Let me know what everybody thinks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790085690


   Here is an example of what I mean:
   
   ```
   public interface ActionsProvider<I> {
   
     default SnapshotTable<I> snapshotTable(I sourceTableIdent) {
       ...
     }
   
     default MigrateTable<I> migrateTable(I tableIdent) {
       ...
     }
   
     default RewriteManifests rewriteManifests(I tableIdent) {
       ...
     }
   
     default RewriteManifests rewriteManifests(Table table) {
       ...
     }
   
     ...
   }
   
   class SparkActions implements ActionsProvider<org.apache.spark.sql.connector.Identifier> {
   }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580400415



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {
+  /**
+   * Expires a specific {@link Snapshot} identified by id.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireSnapshotId(long)}
+   *
+   * @param snapshotId id of the snapshot to expire
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireSnapshotId(long snapshotId);
+
+  /**
+   * Expires all snapshots older than the given timestamp.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireOlderThan(long)}
+   *
+   * @param timestampMillis a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireOlderThan(long timestampMillis);
+
+  /**
+   * Retains the most recent ancestors of the current snapshot.
+   * <p>
+   * If a snapshot would be expired because it is older than the expiration timestamp, but is one of
+   * the {@code numSnapshots} most recent ancestors of the current state, it will be retained. This
+   * will not cause snapshots explicitly identified by id from expiring.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#retainLast(int)}
+   *
+   * @param numSnapshots the number of snapshots to retain
+   * @return this for method chaining
+   */
+  ExpireSnapshots retainLast(int numSnapshots);
+
+  /**
+   * Passes an alternative delete implementation that will be used for manifests and data files.
+   * <p>
+   * Manifest files that are no longer used by valid snapshots will be deleted. Data files that were
+   * deleted by snapshots that are expired will be deleted.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#deleteWith(Consumer)}
+   *
+   * @param deleteFunc a function that will be called to delete manifests and data files
+   * @return this for method chaining
+   */
+  ExpireSnapshots deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * Passes an alternative executor service that will be used for manifests and data files deletion.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted using
+   * a single threaded executor service.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#executeDeleteWith(ExecutorService)}
+   *
+   * @param executorService the service to use
+   * @return this for method chaining
+   */
+  ExpireSnapshots executeDeleteWith(ExecutorService executorService);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {

Review comment:
       Your example should be a breaking change which is better highlighted by changing the API instead of hiding it behind an interface and depend on the documentation to describe the change.
   
   But that is just an extreme example and I can come up with cases for both approaches, so no clean preference on my side.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-786322311


   >  I'm trying to think of a better way to structure the ones that take strings. I agree with the idea that it is easier to use that way, but the problem is that we need a way to parse the string into an identifier and to load a catalog...
   
   That's the part I am not sure about. I considered using Iceberg's table identifier but it is in the `catalog` package, which makes it weird. Plus, one identifier will point to a non-Iceberg table. Apart from that, only implementations will parse identifiers and the syntax could (?) actually vary from engine to engine. It is unlikely we can make catalog loading and identifier parsing generic.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582312818



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {
+  /**
+   * Expires a specific {@link Snapshot} identified by id.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireSnapshotId(long)}
+   *
+   * @param snapshotId id of the snapshot to expire
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireSnapshotId(long snapshotId);
+
+  /**
+   * Expires all snapshots older than the given timestamp.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireOlderThan(long)}
+   *
+   * @param timestampMillis a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireOlderThan(long timestampMillis);
+
+  /**
+   * Retains the most recent ancestors of the current snapshot.
+   * <p>
+   * If a snapshot would be expired because it is older than the expiration timestamp, but is one of
+   * the {@code numSnapshots} most recent ancestors of the current state, it will be retained. This
+   * will not cause snapshots explicitly identified by id from expiring.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#retainLast(int)}
+   *
+   * @param numSnapshots the number of snapshots to retain
+   * @return this for method chaining
+   */
+  ExpireSnapshots retainLast(int numSnapshots);
+
+  /**
+   * Passes an alternative delete implementation that will be used for manifests and data files.
+   * <p>
+   * Manifest files that are no longer used by valid snapshots will be deleted. Data files that were
+   * deleted by snapshots that are expired will be deleted.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#deleteWith(Consumer)}
+   *
+   * @param deleteFunc a function that will be called to delete manifests and data files
+   * @return this for method chaining
+   */
+  ExpireSnapshots deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * Passes an alternative executor service that will be used for manifests and data files deletion.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted using
+   * a single threaded executor service.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#executeDeleteWith(ExecutorService)}
+   *
+   * @param executorService the service to use
+   * @return this for method chaining
+   */
+  ExpireSnapshots executeDeleteWith(ExecutorService executorService);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {

Review comment:
       Resolving this. Thanks to everybody who participated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790067651


   Could everyone please let me know their thoughts? I'd like to finish with this PR and submit follow-ups.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579502392



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {

Review comment:
       I will deprecate the existing one once the implementation is ready.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580744578



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       > Cases like Iterable<ManifestFile> rewrittenManifests() can also benefit from using CloseableIterable. I feel if we want to do it for one, we should do it for all, because we cannot just assume one result might need to close resource, the other does not. If an engine does it for one action, it is likely that it will have this behavior for all actions.
   
   Given that it adds complexity, I feel like this should be considered on a case by case basis. For orphan files, it may make sense as we can stream the result but I am not sure about `rewrittenManifests`, for example. Most likely, we would need to materialize the list of manifests to rewrite at some point as we need to process all entries in them. Plus, holding a list of rewritten manifests in memory is really cheap even for large tables.
   
   > Why not just use Iterable, and let Result be the object to implement close()?
   
   I think this is a better idea than using `ClosebleIterable`. In this case, we probably should make all action results closable for consistency.
   
   What does everybody on this thread think?
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580464393



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       I think `CloseableIterable` may make sense. We use it mostly in `TableScan`, though. The only downside that it will complicate the usage a bit. Some of those iterables can be large so I'd want to avoid extra copies. Let me think this through.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790178251


   We should probably pick between strings and Iceberg identifiers. I lean towards strings as we sometimes refer to non-Iceberg tables but it is not a strong opinion. Even Spark uses strings in its `DataFrameWriter` APIs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582315361



##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotUpdate.java
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that produces snapshots. This interface contains common methods for all
+ * actions that create a new table {@link Snapshot}.
+ *
+ * @param <ThisT> the child Java API class, returned by method chaining
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface SnapshotUpdate<ThisT, R> extends Action<R> {
+  /**
+   * Set a summary property in the snapshot produced by this action.
+   *
+   * @param property a String property name
+   * @param value a String property value
+   * @return this for method chaining
+   */
+  ThisT set(String property, String value);

Review comment:
       I've switched to the new approach.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580455189



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {
+  /**
+   * Expires a specific {@link Snapshot} identified by id.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireSnapshotId(long)}
+   *
+   * @param snapshotId id of the snapshot to expire
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireSnapshotId(long snapshotId);
+
+  /**
+   * Expires all snapshots older than the given timestamp.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireOlderThan(long)}
+   *
+   * @param timestampMillis a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireOlderThan(long timestampMillis);
+
+  /**
+   * Retains the most recent ancestors of the current snapshot.
+   * <p>
+   * If a snapshot would be expired because it is older than the expiration timestamp, but is one of
+   * the {@code numSnapshots} most recent ancestors of the current state, it will be retained. This
+   * will not cause snapshots explicitly identified by id from expiring.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#retainLast(int)}
+   *
+   * @param numSnapshots the number of snapshots to retain
+   * @return this for method chaining
+   */
+  ExpireSnapshots retainLast(int numSnapshots);
+
+  /**
+   * Passes an alternative delete implementation that will be used for manifests and data files.
+   * <p>
+   * Manifest files that are no longer used by valid snapshots will be deleted. Data files that were
+   * deleted by snapshots that are expired will be deleted.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#deleteWith(Consumer)}
+   *
+   * @param deleteFunc a function that will be called to delete manifests and data files
+   * @return this for method chaining
+   */
+  ExpireSnapshots deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * Passes an alternative executor service that will be used for manifests and data files deletion.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted using
+   * a single threaded executor service.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#executeDeleteWith(ExecutorService)}
+   *
+   * @param executorService the service to use
+   * @return this for method chaining
+   */
+  ExpireSnapshots executeDeleteWith(ExecutorService executorService);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {

Review comment:
       I meant the following interface:
   
   ```
   interface Result {
     Iterable<String> orphanFileLocations();
   }
   ```
   
   Can safely evolve into (just an example, may not be the best idea):
   
   ```
   interface Result {
     Iterable<String> orphanFileLocations(); // reports the same set which is a union of locations reported by methods below
     Iterable<String> orphanDataFileLocations();
     Iterable<String> orphanManifestLocations();
     Iterable<String> orphanManifestListLocations();
   }
   ```
   
   If we always returned just `List<String>`, we would be constrained to evolve the result type even when it is not a breaking change.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790072177


   Actually, we may not need the catalog. We can simply use the identifier.
   
   ```
   public interface ActionsProvider<I> {
   
     default SnapshotTable snapshotTable(I sourceTableIdent) {
       throw new UnsupportedOperationException(this.getClass().getName() + " does not implement snapshotTable");
     }
   
     default MigrateTable migrateTable(I tableIdent) {
       throw new UnsupportedOperationException(this.getClass().getName() + " does not implement migrateTable");
     }
   ```
   
   And then we can offer methods that accept `Table` as well as `Identifier` for other actions.
   
   ```
     default RewriteManifests rewriteManifests(I tableIdent) {
       throw new UnsupportedOperationException(this.getClass().getName() + " does not implement rewriteManifests");
     }
   
     default RewriteManifests rewriteManifests(Table table) {
       throw new UnsupportedOperationException(this.getClass().getName() + " does not implement rewriteManifests");
     }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582165752



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement snapshotTable");
+  }
+
+  /**
+   * Instantiates an action to migrate an existing table to Iceberg.
+   */
+  default MigrateTable migrateTable(String tableIdent) {

Review comment:
       Yeah, I think a String is a reasonable choice here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r586791206



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.expressions.Expression;
+
+/**
+ * An action that rewrites data files.
+ */
+public interface RewriteDataFiles extends SnapshotUpdate<RewriteDataFiles, RewriteDataFiles.Result> {
+  /**
+   * Pass a row filter to filter {@link DataFile}s to be rewritten.
+   * <p>
+   * Note that all files that may contain data matching the filter may be rewritten.
+   * <p>
+   * If not set, all files will be rewritten.
+   *
+   * @param expr a row filter to filter out data files
+   * @return this for method chaining
+   */
+  RewriteDataFiles filter(Expression expr);
+
+  /**
+   * Enables or disables case sensitive expression binding.
+   * <p>
+   * If not set, defaults to false.
+   *
+   * @param caseSensitive caseSensitive
+   * @return this for method chaining
+   */
+  RewriteDataFiles caseSensitive(boolean caseSensitive);
+
+  /**
+   * Pass a PartitionSpec id to specify which PartitionSpec should be used in DataFile rewrite
+   * <p>
+   * If not set, defaults to the table's default spec ID.
+   *
+   * @param specId PartitionSpec id to rewrite
+   * @return this for method chaining
+   */
+  RewriteDataFiles outputSpecId(int specId);
+
+  /**
+   * Specify the target data file size in bytes.
+   * <p>
+   * If not set, defaults to the table's target file size.
+   *
+   * @param targetSizeInBytes size in bytes of rewrite data file
+   * @return this for method chaining
+   */
+  RewriteDataFiles targetSizeInBytes(long targetSizeInBytes);
+
+  /**
+   * Specify the number of "bins" considered when trying to pack the next file split into a task. Increasing this
+   * usually makes tasks a bit more even by considering more ways to pack file regions into a single task with extra
+   * planning cost.
+   * <p>
+   * This configuration can reorder the incoming file regions, to preserve order for lower/upper bounds in file
+   * metadata, user can use a lookback of 1.
+   *
+   * @param splitLookback number of "bins" considered when trying to pack the next file split into a task.
+   * @return this for method chaining
+   */
+  RewriteDataFiles splitLookback(int splitLookback);

Review comment:
       I think we may want to hide this in options during the redesign, @RussellSpitzer.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790110682


   It seems rather complicated either way, although with a single Type parameter we could still pass a parameterized Pair (Catalog + Identifier) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582181081



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.expressions.Expression;
+
+/**
+ * An action that rewrites data files.
+ */
+public interface RewriteDataFiles extends
+    ConfigurableAction<RewriteDataFiles, RewriteDataFiles.Result>,
+    SnapshotUpdate<RewriteDataFiles, RewriteDataFiles.Result> {
+
+  /**
+   * Pass a row filter to filter {@link DataFile}s to be rewritten.
+   * <p>
+   * Note that all files that may contain data matching the filter may be rewritten.
+   *
+   * @param expr a row filter to filter out data files
+   * @return this for method chaining
+   */
+  RewriteDataFiles filter(Expression expr);
+
+  /**
+   * Enables or disables case sensitive expression binding.
+   *
+   * @param caseSensitive caseSensitive
+   * @return this for method chaining
+   */
+  RewriteDataFiles caseSensitive(boolean caseSensitive);
+
+  /**
+   * Pass a PartitionSpec id to specify which PartitionSpec should be used in DataFile rewrite
+   *
+   * @param specId PartitionSpec id to rewrite
+   * @return this for method chaining
+   */
+  RewriteDataFiles outputSpecId(int specId);
+
+  /**
+   * Specify the target rewrite data file size in bytes

Review comment:
       Mention that the default is the table's target size.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790067044


   I think the only open point on this PR is whether we have any better ideas around snapshot/migrate methods. I think implementations like `SparkActions` can offer methods that accept Spark `Catalog` and `Identifier`. That's what we are going to use in our procedures. We may consider either exposing such methods alongside the methods that accept strings or we can introduce type parameters in `ActionProvider`.
   
   ```
   public interface ActionsProvider<C, I> {
   
     default SnapshotTable snapshotTable(C sourceCatalog, I sourceIdent) {
       throw new UnsupportedOperationException(this.getClass().getName() + " does not implement snapshotTable");
     }
   
     default MigrateTable migrateTable(C catalog, I ident) {
       throw new UnsupportedOperationException(this.getClass().getName() + " does not implement migrateTable");
     }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582172982



##########
File path: api/src/main/java/org/apache/iceberg/actions/ConfigurableAction.java
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table that accepts options that control the execution.
+ *
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface ConfigurableAction<ThisT, R> extends Action<R> {
+  /**
+   * Configures this action with an extra option.
+   *
+   * @param name an option name
+   * @param value an option value
+   * @return this for method chaining
+   */
+  ThisT withOption(String name, String value);

Review comment:
       Do you mean `withOptions` to `Action` or engine-specific methods, @rdblue?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582198152



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement snapshotTable");
+  }
+
+  /**
+   * Instantiates an action to migrate an existing table to Iceberg.
+   */
+  default MigrateTable migrateTable(String tableIdent) {

Review comment:
       Ok, let me resolve this one then.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-785403354


   I've addressed some of the comments. Still open questions:
   
   - Having `options` in the `Action` directly. That seems simpler than having `ConfigurableAction`.
   - Accepting Strings instead of Tables in all actions so that each action loads the table. My worry is that it will require a proper catalog config for resolution. I think most people will leverage SQL extensions for simple cases so it feels more flexible to pass the `Table` object. However, I agree it is not consistent with `snapshotTable` and `migrateTable`.
   - Shall we make action results closeable? Seems like that would really complicate the usage so maybe we can skip doing that unless needed by other query engines.
   - Using `with` prefix for some method names. I'd say let's not use it to be consistent.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579520231



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.expressions.Expression;
+
+/**
+ * An action that rewrites data files.
+ */
+public interface RewriteDataFiles extends

Review comment:
       Thanks, Sorry I haven't put this up yet but my goal is a more general action which allows many different types of underlying algorithms for compacting data files. Should have a pr up for discussion with some prototypes next week.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790080538


   Well, maybe using just identifiers won't be sufficient as the catalog resolution can depend on the context. In case of procedures, we default the catalog to the procedure catalog, not the default catalog.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582319075



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java
##########
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Predicate;
+import org.apache.iceberg.ManifestFile;
+
+/**
+ * An action that rewrites manifests.
+ */
+public interface RewriteManifests extends
+    ConfigurableAction<RewriteManifests, RewriteManifests.Result>,
+    SnapshotUpdate<RewriteManifests, RewriteManifests.Result> {
+
+  /**
+   * Rewrites manifests for a given spec id.

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-785267733


   Overall this looks like a good step. Thanks @aokolnychyi! I think we just need to fix a few minor things and then it's ready to go.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r583304649



##########
File path: api/src/main/java/org/apache/iceberg/actions/ConfigurableAction.java
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table that accepts options that control the execution.
+ *
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface ConfigurableAction<ThisT, R> extends Action<R> {
+  /**
+   * Configures this action with an extra option.
+   *
+   * @param name an option name
+   * @param value an option value
+   * @return this for method chaining
+   */
+  ThisT withOption(String name, String value);

Review comment:
       I'll resolve this one as we agreed in another thread.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580616487



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       We don't have a use case where this would be evaluated lazily yet. I can make it `ClosableIterable` if we think other implementations may benefit from this but it will definitely make using this API harder.
   
   My motivation for changing the return type to `Iterable` is that it allows us to change the underlying representation. For example, we could migrate from `List` to `Set` in the internal implementation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580832534



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {

Review comment:
       @aokolnychyi I think we could do either by name, enum or even by interface eg `actionCatalog.action(RewriteDataFiles.class)`. Presumably the catalog would need to expose a list of actions it supports as well. Not sure what others think on this idea? If configurations in configurable actions are engine configurations we may be able to add that to the catalog interface rather than each action?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582170994



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {
+  /**
+   * Expires a specific {@link Snapshot} identified by id.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireSnapshotId(long)}
+   *
+   * @param snapshotId id of the snapshot to expire
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireSnapshotId(long snapshotId);
+
+  /**
+   * Expires all snapshots older than the given timestamp.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireOlderThan(long)}
+   *
+   * @param timestampMillis a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireOlderThan(long timestampMillis);
+
+  /**
+   * Retains the most recent ancestors of the current snapshot.
+   * <p>
+   * If a snapshot would be expired because it is older than the expiration timestamp, but is one of
+   * the {@code numSnapshots} most recent ancestors of the current state, it will be retained. This
+   * will not cause snapshots explicitly identified by id from expiring.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#retainLast(int)}
+   *
+   * @param numSnapshots the number of snapshots to retain
+   * @return this for method chaining
+   */
+  ExpireSnapshots retainLast(int numSnapshots);
+
+  /**
+   * Passes an alternative delete implementation that will be used for manifests and data files.
+   * <p>
+   * Manifest files that are no longer used by valid snapshots will be deleted. Data files that were
+   * deleted by snapshots that are expired will be deleted.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#deleteWith(Consumer)}
+   *
+   * @param deleteFunc a function that will be called to delete manifests and data files
+   * @return this for method chaining
+   */
+  ExpireSnapshots deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * Passes an alternative executor service that will be used for manifests and data files deletion.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted using
+   * a single threaded executor service.

Review comment:
       Single-threaded service or in the current thread?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582168019



##########
File path: api/src/main/java/org/apache/iceberg/actions/ConfigurableAction.java
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table that accepts options that control the execution.
+ *
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface ConfigurableAction<ThisT, R> extends Action<R> {
+  /**
+   * Configures this action with an extra option.
+   *
+   * @param name an option name
+   * @param value an option value
+   * @return this for method chaining
+   */
+  ThisT withOption(String name, String value);

Review comment:
       I think we could move this to the default Action interface if we wanted to make it a bit simpler.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580524808



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       Well, I think it makes sense to use `ClosableIterable` in this particular case. At the same time, other places in this PR could continue to use `Iterable` as it is unlikely they will benefit from lazy materialization.
   
   Let me know what you think, @rymurr @rdsr @RussellSpitzer.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582193184



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement snapshotTable");
+  }
+
+  /**
+   * Instantiates an action to migrate an existing table to Iceberg.
+   */
+  default MigrateTable migrateTable(String tableIdent) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement migrateTable");
+  }
+
+  /**
+   * Instantiates an action to remove orphan files.
+   */
+  default RemoveOrphanFiles removeOrphanFiles(Table table) {

Review comment:
       Will it require proper catalog config in the session?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582167290



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement snapshotTable");
+  }
+
+  /**
+   * Instantiates an action to migrate an existing table to Iceberg.
+   */
+  default MigrateTable migrateTable(String tableIdent) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement migrateTable");
+  }
+
+  /**
+   * Instantiates an action to remove orphan files.
+   */
+  default RemoveOrphanFiles removeOrphanFiles(Table table) {

Review comment:
       If we are passing names that will be parsed to the other methods, why not load the table in the action for these as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582176203



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       Do we need this much complication? I don't think that we would necessarily want to make this lazily evaluated because we expect the set of orphan files to be small and a concrete type is easiest for a caller to handle. I have no problem with this being a `List`, but `Iterable` is also fine with me to be able to change it to a `Set` later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579525757



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {
+  /**
+   * Expires a specific {@link Snapshot} identified by id.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireSnapshotId(long)}
+   *
+   * @param snapshotId id of the snapshot to expire
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireSnapshotId(long snapshotId);
+
+  /**
+   * Expires all snapshots older than the given timestamp.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireOlderThan(long)}
+   *
+   * @param timestampMillis a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireOlderThan(long timestampMillis);
+
+  /**
+   * Retains the most recent ancestors of the current snapshot.
+   * <p>
+   * If a snapshot would be expired because it is older than the expiration timestamp, but is one of
+   * the {@code numSnapshots} most recent ancestors of the current state, it will be retained. This
+   * will not cause snapshots explicitly identified by id from expiring.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#retainLast(int)}
+   *
+   * @param numSnapshots the number of snapshots to retain
+   * @return this for method chaining
+   */
+  ExpireSnapshots retainLast(int numSnapshots);
+
+  /**
+   * Passes an alternative delete implementation that will be used for manifests and data files.
+   * <p>
+   * Manifest files that are no longer used by valid snapshots will be deleted. Data files that were
+   * deleted by snapshots that are expired will be deleted.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#deleteWith(Consumer)}
+   *
+   * @param deleteFunc a function that will be called to delete manifests and data files
+   * @return this for method chaining
+   */
+  ExpireSnapshots deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * Passes an alternative executor service that will be used for manifests and data files deletion.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted using
+   * a single threaded executor service.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#executeDeleteWith(ExecutorService)}
+   *
+   * @param executorService the service to use
+   * @return this for method chaining
+   */
+  ExpireSnapshots executeDeleteWith(ExecutorService executorService);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {

Review comment:
       For example, `RemoveOrphanFiles` returns `List<String>` directly. It is impossible to change the return type without breaking the API. I think having a layer of abstraction would be safer. Who knows what will we come up with? What if we decide to report orphan metadata and data files separately? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdsr commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580426650



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       Lazy could make sense, [and some impls of Iterable could be lazy], but in this case, who owns responsibility releasing the underlying  resource?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582166731



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {

Review comment:
       It's too bad we can't have a required method. I'd much rather not rely on order of arguments here and pass the destination name separately:
   
   ```
   SparkActions.get().snapshotTable("a.b").as("a.c").execute();
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790941992


   Thanks everyone for reviewing, @RussellSpitzer @rdsr @rymurr @jackye1995 @pvary @rdblue!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582169995



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {

Review comment:
       I think I prefer it like this anyway. Any time we can get rid of extra words, we should.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582183318



##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotUpdate.java
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that produces snapshots. This interface contains common methods for all
+ * actions that create a new table {@link Snapshot}.

Review comment:
       I agree with this. It's more clear to say `new {@link Snapshot}` than `new table snapshot`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579495089



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {

Review comment:
       As I wrote in the PR description, we don't have `Action` in the name for backward compatibility.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580430175



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       Closeable iterable then?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582316564



##########
File path: api/src/main/java/org/apache/iceberg/actions/ConfigurableAction.java
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table that accepts options that control the execution.
+ *
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface ConfigurableAction<ThisT, R> extends Action<R> {

Review comment:
       Removed for now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790072177


   Actually, we may not need the catalog. We can simply use the identifier.
   
   ```
   public interface ActionsProvider<I> {
   
     default SnapshotTable snapshotTable(I sourceTableIdent) {
       throw new UnsupportedOperationException(this.getClass().getName() + " does not implement snapshotTable");
     }
   
     default MigrateTable migrateTable(I tableIdent) {
       throw new UnsupportedOperationException(this.getClass().getName() + " does not implement migrateTable");
     }
   ```
   
   And then we can offer methods that accept `Table` as well as `Identifier` for other actions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580464393



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       I think `CloseableIterable` may make sense. We use it mostly in `TableScan`. The only downside that it will complicate the usage a bit. Some of those iterables can be large so I'd want to avoid extra copies. Let me think this through.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580810861



##########
File path: api/src/main/java/org/apache/iceberg/actions/MigrateTable.java
##########
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action that migrates an existing table to Iceberg.
+ */
+public interface MigrateTable extends Action<MigrateTable.Result> {
+  /**
+   * Adds additional properties to the newly created Iceberg table. Any properties with
+   * the same key name will be overwritten.
+   *
+   * @param properties a map of properties to be included
+   * @return this for method chaining
+   */
+  MigrateTable withProperties(Map<String, String> properties);
+
+  /**
+   * Adds an additional property to the newly created Iceberg table. Any properties
+   * with the same key name will be overwritten.
+   *
+   * @param key   the key of the property to add
+   * @param value the value of the property to add
+   * @return this for method chaining
+   */
+  MigrateTable withProperty(String key, String value);

Review comment:
       Yeah that looks good to me!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790197733


   Let me know if there are any other comments I should address.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-786322311


   >  I'm trying to think of a better way to structure the ones that take strings. I agree with the idea that it is easier to use that way, but the problem is that we need a way to parse the string into an identifier and to load a catalog...
   
   That's the part I am not sure about. I considered using Iceberg's table identifier but is in the `catalog` package, which makes it weird. Plus, one identifier will point to a non-Iceberg table. Apart from that, only implementations will parse identifiers and the syntax could? actually vary from engine to engine. It is unlikely we can make catalog loading and identifier parsing generic.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582171744



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {

Review comment:
       I think your concern is valid. Supporting `snapshotTable("db.tbl").as("db.tbl_2")` isn't ideal as we don't pass the required param right away but I'll be okay throwing an exception if the destination name is not set by calling `as`.
   That will be a bit better than what we have now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790085690


   Here is an example of what I mean:
   
   ```
   public interface ActionsProvider<I> {
   
     default SnapshotTable<I> snapshotTable(I sourceTableIdent) {
       ...
     }
   
     default MigrateTable<I> migrateTable(I tableIdent) {
       ...
     }
   
     default RewriteManifests rewriteManifests(I tableIdent) {
       ...
     }
   
     default RewriteManifests rewriteManifests(Table table) {
       ...
     }
   
     ...
   }
   
   class SparkActions implements ActionsProvider<org.apache.spark.sql.connector.Identifier> {
   }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580496746



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {

Review comment:
       Could you elaborate a bit more on the idea, @rymurr? Would we load actions by name?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582316531



##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotUpdate.java
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that produces snapshots. This interface contains common methods for all
+ * actions that create a new table {@link Snapshot}.
+ *
+ * @param <ThisT> the child Java API class, returned by method chaining
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface SnapshotUpdate<ThisT, R> extends Action<R> {
+  /**
+   * Set a summary property in the snapshot produced by this action.
+   *
+   * @param property a String property name
+   * @param value a String property value
+   * @return this for method chaining
+   */
+  ThisT set(String property, String value);

Review comment:
       I have no strong opnions about this although I do think we use the preposition-less(?) names with accessors usually rather than setters.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582310012



##########
File path: api/src/main/java/org/apache/iceberg/actions/Action.java
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table.
+ *
+ * @param <ThisT> the child Java API class, returned by method chaining
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface Action<ThisT, R> {

Review comment:
       This interface is still implemented by old actions that we will deprecate.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582316414



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {

Review comment:
       Ok, then let's keep it this way as we don't really have alternatives.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582318951



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.expressions.Expression;
+
+/**
+ * An action that rewrites data files.
+ */
+public interface RewriteDataFiles extends
+    ConfigurableAction<RewriteDataFiles, RewriteDataFiles.Result>,
+    SnapshotUpdate<RewriteDataFiles, RewriteDataFiles.Result> {
+
+  /**
+   * Pass a row filter to filter {@link DataFile}s to be rewritten.
+   * <p>
+   * Note that all files that may contain data matching the filter may be rewritten.
+   *
+   * @param expr a row filter to filter out data files
+   * @return this for method chaining
+   */
+  RewriteDataFiles filter(Expression expr);
+
+  /**
+   * Enables or disables case sensitive expression binding.
+   *
+   * @param caseSensitive caseSensitive
+   * @return this for method chaining
+   */
+  RewriteDataFiles caseSensitive(boolean caseSensitive);
+
+  /**
+   * Pass a PartitionSpec id to specify which PartitionSpec should be used in DataFile rewrite
+   *
+   * @param specId PartitionSpec id to rewrite
+   * @return this for method chaining
+   */
+  RewriteDataFiles outputSpecId(int specId);
+
+  /**
+   * Specify the target rewrite data file size in bytes
+   *
+   * @param targetSize size in bytes of rewrite data file
+   * @return this for method chaining
+   */
+  RewriteDataFiles targetSizeInBytes(long targetSize);
+
+  /**
+   * Specify the number of "bins" considered when trying to pack the next file split into a task. Increasing this
+   * usually makes tasks a bit more even by considering more ways to pack file regions into a single task with extra
+   * planning cost.
+   * <p>
+   * This configuration can reorder the incoming file regions, to preserve order for lower/upper bounds in file
+   * metadata, user can use a lookback of 1.
+   *
+   * @param splitLookback number of "bins" considered when trying to pack the next file split into a task.
+   * @return this for method chaining
+   */
+  RewriteDataFiles splitLookback(int splitLookback);
+
+  /**
+   * Specify the minimum file size to count to pack into one "bin". If the read file size is smaller than this specified
+   * threshold, Iceberg will use this value to do count.
+   * <p>
+   * this configuration controls the number of files to compact for each task, small value would lead to a high
+   * compaction, the default value is 4MB.
+   *
+   * @param splitOpenFileCost minimum file size to count to pack into one "bin".
+   * @return this for method chaining
+   */
+  RewriteDataFiles splitOpenFileCost(long splitOpenFileCost);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns rewritten data files.
+     */
+    Iterable<DataFile> rewrittenDataFiles();
+    /**

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790090408


   @aokolnychyi Does that mean the caller needs to append the catalog to the identifier based on it's call context?
   
   Ie if i am running Migrate Table Procedure from my_catalog and I get database.table as the arg, then the procedure makes a new identifier my_catalog.database.table and passes it through?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579486917



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {

Review comment:
       One of the issues with the old API is that always assumed it operated on a single existing table. This method was offered as a static method there, which looked weird. In the new API, it is no longer a static method.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582218961



##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotUpdate.java
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that produces snapshots. This interface contains common methods for all
+ * actions that create a new table {@link Snapshot}.
+ *
+ * @param <ThisT> the child Java API class, returned by method chaining
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface SnapshotUpdate<ThisT, R> extends Action<R> {
+  /**
+   * Set a summary property in the snapshot produced by this action.
+   *
+   * @param property a String property name
+   * @param value a String property value
+   * @return this for method chaining
+   */
+  ThisT set(String property, String value);

Review comment:
       Good point, this has to be changed.
   
   I am actually no longer sure having `with` prefix is a good idea as we have it for some configs but not others. Shall we consider dropping `with` from names? Then we will have `snapshotProperty`, `tableProperties`, `options`.
   
   ```
   actions.rewriteManifests()
       .stagingLocation("custom-staging-location")
       .option("use-caching", "true")
       .snapshotProperty("label", "value")
       .execute();
   ```
   
   ```
   actions.removeOrphanFiles()
       .location("custom-location")
       .option("use-caching", "true")
       .execute();
   ```
   
   ```
   actions.snapshot("db.tbl")
       .as("db.tbl_2")
       .option("use-caching", "true")
       .location("custom-location")
       .tableProperties(props)
       .execute();
   ```
   
   ```
   actions.migrate("db.tbl")
       .option("use-caching", "true")
       .tableProperty("p1", "v1")
       .tableProperty("p2", "v2")
       .execute();
   ```
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579493447



##########
File path: api/src/main/java/org/apache/iceberg/actions/ConfigurableAction.java
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table that accepts options that control the execution.
+ *
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface ConfigurableAction<ThisT, R> extends Action<R> {
+  /**
+   * Configures this action with an extra option.
+   *
+   * @param name an option name
+   * @param value an option value
+   * @return this for method chaining
+   */
+  ThisT withOption(String name, String value);

Review comment:
       There will be configs specific to engines. For example, whether to use `toLocalIterator` in Spark or not. It does not make sense to expose such a method in this API as other engines may not support it.
   
   There are two approaches we can take:
   - Support a map of options in the API and parse those options in implementations.
   - Have engine-specific interfaces that extend base action interfaces. For example, `ExpireSnapshotsSparkAction` will extend `ExpireSnapshotsAction`.
   
   I went with the first option as it seems simpler. I can reconsider depending on what the community thinks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580827986



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       +1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582176995



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.expressions.Expression;
+
+/**
+ * An action that rewrites data files.
+ */
+public interface RewriteDataFiles extends
+    ConfigurableAction<RewriteDataFiles, RewriteDataFiles.Result>,
+    SnapshotUpdate<RewriteDataFiles, RewriteDataFiles.Result> {
+
+  /**
+   * Pass a row filter to filter {@link DataFile}s to be rewritten.
+   * <p>
+   * Note that all files that may contain data matching the filter may be rewritten.
+   *
+   * @param expr a row filter to filter out data files
+   * @return this for method chaining
+   */
+  RewriteDataFiles filter(Expression expr);
+
+  /**
+   * Enables or disables case sensitive expression binding.

Review comment:
       Should this mention the default?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579486390



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {

Review comment:
       I wanted to call it `Actions` but there is an existing class with that name in the same package :( 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582171744



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {

Review comment:
       I think your concern is valid.
   
   Supporting `snapshotTable("db.tbl").as("db.tbl_2")` isn't ideal as we don't pass the required param right away but I'll be okay throwing an exception if the destination name is not set by calling `as`. That will be a bit better than what we have now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582172982



##########
File path: api/src/main/java/org/apache/iceberg/actions/ConfigurableAction.java
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table that accepts options that control the execution.
+ *
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface ConfigurableAction<ThisT, R> extends Action<R> {
+  /**
+   * Configures this action with an extra option.
+   *
+   * @param name an option name
+   * @param value an option value
+   * @return this for method chaining
+   */
+  ThisT withOption(String name, String value);

Review comment:
       Do you mean `withOptions` or engine-specific methods, @rdblue?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580746055



##########
File path: api/src/main/java/org/apache/iceberg/actions/MigrateTable.java
##########
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action that migrates an existing table to Iceberg.
+ */
+public interface MigrateTable extends Action<MigrateTable.Result> {
+  /**
+   * Adds additional properties to the newly created Iceberg table. Any properties with
+   * the same key name will be overwritten.
+   *
+   * @param properties a map of properties to be included
+   * @return this for method chaining
+   */
+  MigrateTable withProperties(Map<String, String> properties);
+
+  /**
+   * Adds an additional property to the newly created Iceberg table. Any properties
+   * with the same key name will be overwritten.
+   *
+   * @param key   the key of the property to add
+   * @param value the value of the property to add
+   * @return this for method chaining
+   */
+  MigrateTable withProperty(String key, String value);

Review comment:
       I think we actually have the overwrite behavior here as if there is a property with the same name, it will be replaced.  Maybe, we should improve the doc?
   
   ```
   Sets a table property in the newly created Iceberg table. Any properties with the same key will be overwritten.
   ```
   
   Is that any better, @jackye1995?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579494172



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {
+  /**
+   * Expires a specific {@link Snapshot} identified by id.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireSnapshotId(long)}
+   *
+   * @param snapshotId id of the snapshot to expire
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireSnapshotId(long snapshotId);
+
+  /**
+   * Expires all snapshots older than the given timestamp.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireOlderThan(long)}
+   *
+   * @param timestampMillis a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireOlderThan(long timestampMillis);
+
+  /**
+   * Retains the most recent ancestors of the current snapshot.
+   * <p>
+   * If a snapshot would be expired because it is older than the expiration timestamp, but is one of
+   * the {@code numSnapshots} most recent ancestors of the current state, it will be retained. This
+   * will not cause snapshots explicitly identified by id from expiring.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#retainLast(int)}
+   *
+   * @param numSnapshots the number of snapshots to retain
+   * @return this for method chaining
+   */
+  ExpireSnapshots retainLast(int numSnapshots);
+
+  /**
+   * Passes an alternative delete implementation that will be used for manifests and data files.
+   * <p>
+   * Manifest files that are no longer used by valid snapshots will be deleted. Data files that were
+   * deleted by snapshots that are expired will be deleted.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#deleteWith(Consumer)}
+   *
+   * @param deleteFunc a function that will be called to delete manifests and data files
+   * @return this for method chaining
+   */
+  ExpireSnapshots deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * Passes an alternative executor service that will be used for manifests and data files deletion.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted using
+   * a single threaded executor service.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#executeDeleteWith(ExecutorService)}
+   *
+   * @param executorService the service to use
+   * @return this for method chaining
+   */
+  ExpireSnapshots executeDeleteWith(ExecutorService executorService);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {

Review comment:
       I think representing results as interfaces will allow us to evolve them easier.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579489875



##########
File path: api/src/main/java/org/apache/iceberg/actions/ConfigurableAction.java
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table that accepts options that control the execution.
+ *
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface ConfigurableAction<ThisT, R> extends Action<R> {

Review comment:
       I am debating whether we need a separate interface. There are actions that extend both `ConfigurableAction` and `SnapshotUpdate` (e.g. `RewriteDataFiles`).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580514160



##########
File path: api/src/main/java/org/apache/iceberg/actions/ConfigurableAction.java
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table that accepts options that control the execution.
+ *
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface ConfigurableAction<ThisT, R> extends Action<R> {
+  /**
+   * Configures this action with an extra option.
+   *
+   * @param name an option name
+   * @param value an option value
+   * @return this for method chaining
+   */
+  ThisT withOption(String name, String value);

Review comment:
       Sounds like an intermediate consensus then. I'll keep it as is for now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r587802074



##########
File path: api/src/main/java/org/apache/iceberg/actions/Action.java
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table.
+ *
+ * @param <ThisT> the child Java API class, returned by method chaining
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface Action<ThisT, R> {

Review comment:
       We intend to remove this eventually?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r587827121



##########
File path: api/src/main/java/org/apache/iceberg/actions/Action.java
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table.
+ *
+ * @param <ThisT> the child Java API class, returned by method chaining
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface Action<ThisT, R> {

Review comment:
       I am not sure. I think it is reasonable to have a common parent even for new actions. What are your thoughts?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582315919



##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotTable.java
##########
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action that creates an independent snapshot of an existing table.
+ */
+public interface SnapshotTable extends Action<SnapshotTable.Result> {
+  /**
+   * Sets the table location.
+   *
+   * @param location a table location
+   * @return this for method chaining
+   */
+  SnapshotTable withLocation(String location);
+
+  /**
+   * Adds additional properties to the newly created Iceberg table. Any properties with
+   * the same key name will be overwritten.
+   *
+   * @param properties a map of properties to be included
+   * @return this for method chaining
+   */
+  SnapshotTable withProperties(Map<String, String> properties);
+
+  /**
+   * Adds an additional property to the newly created Iceberg table. Any properties
+   * with the same key name will be overwritten.
+   *
+   * @param key   the key of the property to add
+   * @param value the value of the property to add
+   * @return this for method chaining
+   */
+  SnapshotTable withProperty(String key, String value);

Review comment:
       Replaced with `tableProperty`.

##########
File path: api/src/main/java/org/apache/iceberg/actions/MigrateTable.java
##########
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action that migrates an existing table to Iceberg.
+ */
+public interface MigrateTable extends Action<MigrateTable.Result> {
+  /**
+   * Adds additional properties to the newly created Iceberg table. Any properties with
+   * the same key name will be overwritten.
+   *
+   * @param properties a map of properties to be included
+   * @return this for method chaining
+   */
+  MigrateTable withProperties(Map<String, String> properties);
+
+  /**
+   * Adds an additional property to the newly created Iceberg table. Any properties
+   * with the same key name will be overwritten.
+   *
+   * @param key   the key of the property to add
+   * @param value the value of the property to add
+   * @return this for method chaining
+   */
+  MigrateTable withProperty(String key, String value);

Review comment:
       Done.

##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotUpdate.java
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that produces snapshots. This interface contains common methods for all
+ * actions that create a new table {@link Snapshot}.

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582315479



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.expressions.Expression;
+
+/**
+ * An action that rewrites data files.
+ */
+public interface RewriteDataFiles extends
+    ConfigurableAction<RewriteDataFiles, RewriteDataFiles.Result>,
+    SnapshotUpdate<RewriteDataFiles, RewriteDataFiles.Result> {
+
+  /**
+   * Pass a row filter to filter {@link DataFile}s to be rewritten.
+   * <p>
+   * Note that all files that may contain data matching the filter may be rewritten.
+   *
+   * @param expr a row filter to filter out data files
+   * @return this for method chaining
+   */
+  RewriteDataFiles filter(Expression expr);
+
+  /**
+   * Enables or disables case sensitive expression binding.
+   *
+   * @param caseSensitive caseSensitive
+   * @return this for method chaining
+   */
+  RewriteDataFiles caseSensitive(boolean caseSensitive);
+
+  /**
+   * Pass a PartitionSpec id to specify which PartitionSpec should be used in DataFile rewrite
+   *
+   * @param specId PartitionSpec id to rewrite
+   * @return this for method chaining
+   */
+  RewriteDataFiles outputSpecId(int specId);
+
+  /**
+   * Specify the target rewrite data file size in bytes

Review comment:
       Done.

##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {
+  /**
+   * Expires a specific {@link Snapshot} identified by id.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireSnapshotId(long)}
+   *
+   * @param snapshotId id of the snapshot to expire
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireSnapshotId(long snapshotId);
+
+  /**
+   * Expires all snapshots older than the given timestamp.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireOlderThan(long)}
+   *
+   * @param timestampMillis a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireOlderThan(long timestampMillis);
+
+  /**
+   * Retains the most recent ancestors of the current snapshot.
+   * <p>
+   * If a snapshot would be expired because it is older than the expiration timestamp, but is one of
+   * the {@code numSnapshots} most recent ancestors of the current state, it will be retained. This
+   * will not cause snapshots explicitly identified by id from expiring.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#retainLast(int)}
+   *
+   * @param numSnapshots the number of snapshots to retain
+   * @return this for method chaining
+   */
+  ExpireSnapshots retainLast(int numSnapshots);
+
+  /**
+   * Passes an alternative delete implementation that will be used for manifests and data files.
+   * <p>
+   * Manifest files that are no longer used by valid snapshots will be deleted. Data files that were
+   * deleted by snapshots that are expired will be deleted.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#deleteWith(Consumer)}
+   *
+   * @param deleteFunc a function that will be called to delete manifests and data files
+   * @return this for method chaining
+   */
+  ExpireSnapshots deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * Passes an alternative executor service that will be used for manifests and data files deletion.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted using
+   * a single threaded executor service.

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-786322311


   >  I'm trying to think of a better way to structure the ones that take strings. I agree with the idea that it is easier to use that way, but the problem is that we need a way to parse the string into an identifier and to load a catalog...
   
   That's the part I am not sure about. I considered using Iceberg's table identifier but it is in the `catalog` package, which makes it weird. Plus, one identifier will point to a non-Iceberg table. Apart from that, only implementations will parse identifiers and the syntax could? actually vary from engine to engine. It is unlikely we can make catalog loading and identifier parsing generic.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582172187



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {
+  /**
+   * Expires a specific {@link Snapshot} identified by id.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireSnapshotId(long)}
+   *
+   * @param snapshotId id of the snapshot to expire
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireSnapshotId(long snapshotId);
+
+  /**
+   * Expires all snapshots older than the given timestamp.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireOlderThan(long)}
+   *
+   * @param timestampMillis a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireOlderThan(long timestampMillis);
+
+  /**
+   * Retains the most recent ancestors of the current snapshot.
+   * <p>
+   * If a snapshot would be expired because it is older than the expiration timestamp, but is one of
+   * the {@code numSnapshots} most recent ancestors of the current state, it will be retained. This
+   * will not cause snapshots explicitly identified by id from expiring.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#retainLast(int)}
+   *
+   * @param numSnapshots the number of snapshots to retain
+   * @return this for method chaining
+   */
+  ExpireSnapshots retainLast(int numSnapshots);
+
+  /**
+   * Passes an alternative delete implementation that will be used for manifests and data files.
+   * <p>
+   * Manifest files that are no longer used by valid snapshots will be deleted. Data files that were
+   * deleted by snapshots that are expired will be deleted.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#deleteWith(Consumer)}
+   *
+   * @param deleteFunc a function that will be called to delete manifests and data files
+   * @return this for method chaining
+   */
+  ExpireSnapshots deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * Passes an alternative executor service that will be used for manifests and data files deletion.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted using
+   * a single threaded executor service.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#executeDeleteWith(ExecutorService)}
+   *
+   * @param executorService the service to use
+   * @return this for method chaining
+   */
+  ExpireSnapshots executeDeleteWith(ExecutorService executorService);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {

Review comment:
       I prefer this approach because it puts the context in the API rather than in documentation. To get the locations, it is clear that they are the `orphanFileLocations` rather than a list of strings (if you don't have Javadoc to see what the strings are).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790072177


   Actually, we may not need the catalog. We can simply use the identifier.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582195442



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {

Review comment:
       Let me resolve this one to simplify the review. Thanks, @rymurr @rdblue!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582162516



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {

Review comment:
       I agree with Anton here. If we were to load by name, we wouldn't get a specific interface to configure the action. We could get the interface if we loaded by passing in the interface, but then I don't see how we'd be much better off. The API would be awkward (you have to find and supply the action interface you want) and the benefit is avoiding a few methods here.
   
   I think what Anton has in the PR is a good step forward instead.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579516161



##########
File path: api/src/main/java/org/apache/iceberg/actions/ConfigurableAction.java
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table that accepts options that control the execution.
+ *
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface ConfigurableAction<ThisT, R> extends Action<R> {
+  /**
+   * Configures this action with an extra option.
+   *
+   * @param name an option name
+   * @param value an option value
+   * @return this for method chaining
+   */
+  ThisT withOption(String name, String value);

Review comment:
       I think this follows some of our discussion yesterday. I would prefer that any options that a user would not modify in 90% of use cases be an "option", I think engine specific things fall into that category. The example of `toLocalIterator` is a good one since it takes a much more advanced Spark user to understand the requirement so we should hide that in the public api exposed via any java IDE.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580253033



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       Is it possible that the orphaned files could be generated lazily? Would a `Stream` be more appropriate?

##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {

Review comment:
       One thing that I would be concerned about here is this interface grows forever with every new action we add. I can imagine some actions are going to be engine (or a subgroup of engines) specific as well. I wonder if its possible to have a more dynamic way of registering actions (eg an action catalog of sorts) or some categories and an interface per category?

##########
File path: api/src/main/java/org/apache/iceberg/actions/ConfigurableAction.java
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table that accepts options that control the execution.
+ *
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface ConfigurableAction<ThisT, R> extends Action<R> {
+  /**
+   * Configures this action with an extra option.
+   *
+   * @param name an option name
+   * @param value an option value
+   * @return this for method chaining
+   */
+  ThisT withOption(String name, String value);

Review comment:
       I don't love an opque map here but I see the purpose.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580744578



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       > Cases like Iterable<ManifestFile> rewrittenManifests() can also benefit from using CloseableIterable. I feel if we want to do it for one, we should do it for all, because we cannot just assume one result might need to close resource, the other does not. If an engine does it for one action, it is likely that it will have this behavior for all actions.
   
   Given that it adds complexity, I feel like this should be considered on a case by case basis. For orphan files, it may make sense as we can stream the result but I am not sure about cases like `rewrittenManifests`. Most likely, we would need to materialize the list of manifests to rewrite at some point as we need to process all entries in them.
   
   > Why not just use Iterable, and let Result be the object to implement close()?
   
   I think this is a better idea than using `ClosebleIterable`. In this case, we probably should make all action results closable for consistency.
   
   What does everybody on this thread think?
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582183800



##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotUpdate.java
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that produces snapshots. This interface contains common methods for all
+ * actions that create a new table {@link Snapshot}.
+ *
+ * @param <ThisT> the child Java API class, returned by method chaining
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface SnapshotUpdate<ThisT, R> extends Action<R> {
+  /**
+   * Set a summary property in the snapshot produced by this action.
+   *
+   * @param property a String property name
+   * @param value a String property value
+   * @return this for method chaining
+   */
+  ThisT set(String property, String value);

Review comment:
       In other places, we use `with` and descriptive name (option, property). Should this be `withSnapshotProperty`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580524808



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       Well, I think it makes sense to use `ClosableIterable` in this particular case. At the same time, other places in this PR could continue to use `Iterable`.
   
   Let me know what you think, @rymurr @rdsr @RussellSpitzer.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582155042



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {

Review comment:
       I am open to considering this idea but I am not sure it is going to help.
   
   It sounds like our `SparkProcedures` class. Loading by name/enum/class, will most likely give us a parent action interface, not a specific interface to call. Also, we will probably need to change how arguments are passed. And each time we add a new action, we will have to add a new constant to the enum, which is more or less equivalent to adding a new method here with a default implementation.
   
   I see this API kind of close to our `Table` API. That one also has a number of methods but it is pretty convenient. That said, I do get the concern of having a lot of methods in this API but we can address it differently. For example, we will support multiple compaction strategies. Instead of offering a separate action for each type, we can make the configuration flexible.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790096083


   > Ie if i am running Migrate Table Procedure from my_catalog and I get database.table as the arg, then the procedure makes a new identifier my_catalog.database.table and passes it through?
   
   Well, that would work but not very clean. We can keep Catalog and Identifier separate as well. The main idea is to add params to `ActionsProvider` and use query engine specific classes instead of strings. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-782385240


   It would be great to get as many eyes as possible on this one. Especially, from folks working on query engines other than Spark.
   
   @rdblue @RussellSpitzer @pvary @openinx @holdenk @shardulm94 @danielcweeks @rdsr


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579488391



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement snapshotTable");
+  }
+
+  /**
+   * Instantiates an action to migrate an existing table to Iceberg.
+   */
+  default MigrateTable migrateTable(String tableIdent) {

Review comment:
       I don't really like using Strings as identifiers as it requires implementations to parse it but I am not sure other options are better. I considered using `TableIdentifier` in Iceberg defines but it is in the catalog module so it will be a bit weird to use it here. Also, parsing seems inevitable as there may be engine-specific parsing.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r586791206



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.expressions.Expression;
+
+/**
+ * An action that rewrites data files.
+ */
+public interface RewriteDataFiles extends SnapshotUpdate<RewriteDataFiles, RewriteDataFiles.Result> {
+  /**
+   * Pass a row filter to filter {@link DataFile}s to be rewritten.
+   * <p>
+   * Note that all files that may contain data matching the filter may be rewritten.
+   * <p>
+   * If not set, all files will be rewritten.
+   *
+   * @param expr a row filter to filter out data files
+   * @return this for method chaining
+   */
+  RewriteDataFiles filter(Expression expr);
+
+  /**
+   * Enables or disables case sensitive expression binding.
+   * <p>
+   * If not set, defaults to false.
+   *
+   * @param caseSensitive caseSensitive
+   * @return this for method chaining
+   */
+  RewriteDataFiles caseSensitive(boolean caseSensitive);
+
+  /**
+   * Pass a PartitionSpec id to specify which PartitionSpec should be used in DataFile rewrite
+   * <p>
+   * If not set, defaults to the table's default spec ID.
+   *
+   * @param specId PartitionSpec id to rewrite
+   * @return this for method chaining
+   */
+  RewriteDataFiles outputSpecId(int specId);
+
+  /**
+   * Specify the target data file size in bytes.
+   * <p>
+   * If not set, defaults to the table's target file size.
+   *
+   * @param targetSizeInBytes size in bytes of rewrite data file
+   * @return this for method chaining
+   */
+  RewriteDataFiles targetSizeInBytes(long targetSizeInBytes);
+
+  /**
+   * Specify the number of "bins" considered when trying to pack the next file split into a task. Increasing this
+   * usually makes tasks a bit more even by considering more ways to pack file regions into a single task with extra
+   * planning cost.
+   * <p>
+   * This configuration can reorder the incoming file regions, to preserve order for lower/upper bounds in file
+   * metadata, user can use a lookback of 1.
+   *
+   * @param splitLookback number of "bins" considered when trying to pack the next file split into a task.
+   * @return this for method chaining
+   */
+  RewriteDataFiles splitLookback(int splitLookback);

Review comment:
       I think we may want to hide this in options during the redesign, @RussellSpitzer. Same for the one below.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790133773


   The problem with using Spark 3 `Identifier` is that Spark does not actually expose it anywhere except the catalog implementations, meaning we don't really have a good engine API to expose to our users.
   
   Using our `TableIdentifier` to refer to non-Iceberg tables does not seem ideal either but let me think about it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790067044


   I think the only open point on this PR is whether we have any better ideas around snapshot/migrate methods. I think implementations like `SparkActions` can offer methods that accept Spark `Catalog` and `Identifier`. That's what we are going to use in our procedures.
   
   We may consider either exposing such methods alongside the methods that accept strings or we can introduce type parameters in `ActionProvider`.
   
   ```
   public interface ActionsProvider<C, I> {
   
     default SnapshotTable snapshotTable(C sourceCatalog, I sourceIdent) {
       throw new UnsupportedOperationException(this.getClass().getName() + " does not implement snapshotTable");
     }
   
     default MigrateTable migrateTable(C catalog, I ident) {
       throw new UnsupportedOperationException(this.getClass().getName() + " does not implement migrateTable");
     }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-782385240


   It would be great to get as many eyes as possible on this one. Especially, from folks working on query engines other than Spark.
   
   cc @rdblue @RussellSpitzer @pvary @openinx @holdenk @shardulm94 @danielcweeks @rdsr @jackye1995 @rymurr @karuppayya @JingsongLi


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579519418



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends ConfigurableAction<ExpireSnapshots, ExpireSnapshots.Result> {
+  /**
+   * Expires a specific {@link Snapshot} identified by id.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireSnapshotId(long)}
+   *
+   * @param snapshotId id of the snapshot to expire
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireSnapshotId(long snapshotId);
+
+  /**
+   * Expires all snapshots older than the given timestamp.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#expireOlderThan(long)}
+   *
+   * @param timestampMillis a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  ExpireSnapshots expireOlderThan(long timestampMillis);
+
+  /**
+   * Retains the most recent ancestors of the current snapshot.
+   * <p>
+   * If a snapshot would be expired because it is older than the expiration timestamp, but is one of
+   * the {@code numSnapshots} most recent ancestors of the current state, it will be retained. This
+   * will not cause snapshots explicitly identified by id from expiring.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#retainLast(int)}
+   *
+   * @param numSnapshots the number of snapshots to retain
+   * @return this for method chaining
+   */
+  ExpireSnapshots retainLast(int numSnapshots);
+
+  /**
+   * Passes an alternative delete implementation that will be used for manifests and data files.
+   * <p>
+   * Manifest files that are no longer used by valid snapshots will be deleted. Data files that were
+   * deleted by snapshots that are expired will be deleted.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#deleteWith(Consumer)}
+   *
+   * @param deleteFunc a function that will be called to delete manifests and data files
+   * @return this for method chaining
+   */
+  ExpireSnapshots deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * Passes an alternative executor service that will be used for manifests and data files deletion.
+   * <p>
+   * If this method is not called, unnecessary manifests and data files will still be deleted using
+   * a single threaded executor service.
+   * <p>
+   * Identical to {@link org.apache.iceberg.ExpireSnapshots#executeDeleteWith(ExecutorService)}
+   *
+   * @param executorService the service to use
+   * @return this for method chaining
+   */
+  ExpireSnapshots executeDeleteWith(ExecutorService executorService);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {

Review comment:
       Is this a concern? I'm not sure we have had this issue yet. I don't think there is a real negative to doing this I just don't have a lot of imaginative thinking here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582168605



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {

Review comment:
       agreed :-) 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582311404



##########
File path: api/src/main/java/org/apache/iceberg/actions/ExpireSnapshots.java
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.concurrent.ExecutorService;
+import java.util.function.Consumer;
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that expires snapshots in a table.
+ * <p>
+ * Similar to {@link org.apache.iceberg.ExpireSnapshots} but may use a query engine to distribute
+ * parts of the work.
+ */
+public interface ExpireSnapshots extends Action<ExpireSnapshots, ExpireSnapshots.Result> {

Review comment:
       I don't have `<ThisT>` in `ExpireSnapshots` as we don't expect engine-specific interfaces.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582318648



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.expressions.Expression;
+
+/**
+ * An action that rewrites data files.
+ */
+public interface RewriteDataFiles extends
+    ConfigurableAction<RewriteDataFiles, RewriteDataFiles.Result>,
+    SnapshotUpdate<RewriteDataFiles, RewriteDataFiles.Result> {
+
+  /**
+   * Pass a row filter to filter {@link DataFile}s to be rewritten.
+   * <p>
+   * Note that all files that may contain data matching the filter may be rewritten.
+   *
+   * @param expr a row filter to filter out data files
+   * @return this for method chaining
+   */
+  RewriteDataFiles filter(Expression expr);
+
+  /**
+   * Enables or disables case sensitive expression binding.

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580688034



##########
File path: api/src/main/java/org/apache/iceberg/actions/MigrateTable.java
##########
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action that migrates an existing table to Iceberg.
+ */
+public interface MigrateTable extends Action<MigrateTable.Result> {
+  /**
+   * Adds additional properties to the newly created Iceberg table. Any properties with
+   * the same key name will be overwritten.
+   *
+   * @param properties a map of properties to be included
+   * @return this for method chaining
+   */
+  MigrateTable withProperties(Map<String, String> properties);
+
+  /**
+   * Adds an additional property to the newly created Iceberg table. Any properties
+   * with the same key name will be overwritten.
+   *
+   * @param key   the key of the property to add
+   * @param value the value of the property to add
+   * @return this for method chaining
+   */
+  MigrateTable withProperty(String key, String value);

Review comment:
       nit: based on your description, I think this method should be named `addProperty`, `withXXX` usually indicates overwriting the value.

##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       > Well, I think it makes sense to use `ClosableIterable` in this particular case. At the same time, other places in this PR could continue to use `Iterable` as it is unlikely they will benefit from lazy materialization.
   
   Why? Cases like `Iterable<ManifestFile> rewrittenManifests()` can also benefit from using `CloseableIterable`. I feel if we want to do it for one, we should do it for all, because we cannot just assume one result might need to close resource, the other does not. If an engine does it for one action, it is likely that it will have this behavior for all actions.
   
   With that being said, I agree that using `CloseableIterable` would make the interface harder to use. Why not just use `Iterable`, and let `Result` be the object to implement `close()`? If one engine has a closable iterable, it can be closed there.

##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotTable.java
##########
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action that creates an independent snapshot of an existing table.
+ */
+public interface SnapshotTable extends Action<SnapshotTable.Result> {
+  /**
+   * Sets the table location.
+   *
+   * @param location a table location
+   * @return this for method chaining
+   */
+  SnapshotTable withLocation(String location);
+
+  /**
+   * Adds additional properties to the newly created Iceberg table. Any properties with
+   * the same key name will be overwritten.
+   *
+   * @param properties a map of properties to be included
+   * @return this for method chaining
+   */
+  SnapshotTable withProperties(Map<String, String> properties);
+
+  /**
+   * Adds an additional property to the newly created Iceberg table. Any properties
+   * with the same key name will be overwritten.
+   *
+   * @param key   the key of the property to add
+   * @param value the value of the property to add
+   * @return this for method chaining
+   */
+  SnapshotTable withProperty(String key, String value);

Review comment:
       nit: same comment as before, can be `addProperty`

##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement snapshotTable");
+  }
+
+  /**
+   * Instantiates an action to migrate an existing table to Iceberg.
+   */
+  default MigrateTable migrateTable(String tableIdent) {

Review comment:
       I am not sure why `TableIdentifier` is in the catalog module, but here I agree that it seems the most flexible to use a string. For Spark it in the end uses `spark.sessionState().sqlParser().parseMultipartIdentifier(string)` anyway.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r587948085



##########
File path: api/src/main/java/org/apache/iceberg/actions/Action.java
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+
+/**
+ * An action performed on a table.
+ *
+ * @param <ThisT> the child Java API class, returned by method chaining
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface Action<ThisT, R> {

Review comment:
       Yeah, I think it's reasonable to have a common parent. Your comment made me think you planned to remove it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdsr commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r580548924



##########
File path: api/src/main/java/org/apache/iceberg/actions/RemoveOrphanFiles.java
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Consumer;
+
+/**
+ * An action that removes orphan files in a table.
+ * <p>
+ * A metadata or data file is considered orphan if it is not reachable by any valid snapshot.
+ * The set of actual files is built by listing the underlying storage which makes this operation
+ * expensive.
+ */
+public interface RemoveOrphanFiles extends Action<RemoveOrphanFiles.Result> {
+  /**
+   * Passes a location which should be scanned for orphan files.
+   * <p>
+   * If not set, the root table location will be scanned potentially removing both orphan data and
+   * metadata files.
+   *
+   * @param location the location where to look for orphan files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles location(String location);
+
+  /**
+   * Removes orphan files only if they are older than the given timestamp.
+   * <p>
+   * This is a safety measure to avoid removing files that are being added to the table.
+   * For example, there may be a concurrent operation adding new files while the action searches
+   * for orphan files. New files may not be referenced by the metadata yet but they are not orphan.
+   * <p>
+   * If not set, defaults to a timestamp 3 days ago.
+   *
+   * @param olderThanTimestamp a long timestamp, as returned by {@link System#currentTimeMillis()}
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles olderThan(long olderThanTimestamp);
+
+  /**
+   * Passes an alternative delete implementation that will be used for orphan files.
+   * <p>
+   * This method allows to customize the delete func. For example, one may set a custom delete func
+   * and collect all orphan files into a set instead of physically removing them.
+   * <p>
+   * If not called, uses the table's {@link org.apache.iceberg.io.FileIO io} implementation.
+   *
+   * @param deleteFunc a function that will be called to delete files
+   * @return this for method chaining
+   */
+  RemoveOrphanFiles deleteWith(Consumer<String> deleteFunc);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns locations of orphan files.
+     */
+    Iterable<String> orphanFileLocations();

Review comment:
       I was looking at the V1 Action impl for RemoveOrphanFiles. Seems like the orphan file list is computed upfornt. @aokolnychyi do you see the list being lazy in this particular case? My argument was mostly for lazy streams (or lazy impl of `Iterables`). If all our results are computed eagerly then returning a List is fine as that would remove the question of resource cleanup and ownership. I'm fine with `Iterables` also]




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582181411



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.expressions.Expression;
+
+/**
+ * An action that rewrites data files.
+ */
+public interface RewriteDataFiles extends
+    ConfigurableAction<RewriteDataFiles, RewriteDataFiles.Result>,
+    SnapshotUpdate<RewriteDataFiles, RewriteDataFiles.Result> {
+
+  /**
+   * Pass a row filter to filter {@link DataFile}s to be rewritten.
+   * <p>
+   * Note that all files that may contain data matching the filter may be rewritten.
+   *
+   * @param expr a row filter to filter out data files
+   * @return this for method chaining
+   */
+  RewriteDataFiles filter(Expression expr);
+
+  /**
+   * Enables or disables case sensitive expression binding.
+   *
+   * @param caseSensitive caseSensitive
+   * @return this for method chaining
+   */
+  RewriteDataFiles caseSensitive(boolean caseSensitive);
+
+  /**
+   * Pass a PartitionSpec id to specify which PartitionSpec should be used in DataFile rewrite
+   *
+   * @param specId PartitionSpec id to rewrite
+   * @return this for method chaining
+   */
+  RewriteDataFiles outputSpecId(int specId);
+
+  /**
+   * Specify the target rewrite data file size in bytes
+   *
+   * @param targetSize size in bytes of rewrite data file
+   * @return this for method chaining
+   */
+  RewriteDataFiles targetSizeInBytes(long targetSize);
+
+  /**
+   * Specify the number of "bins" considered when trying to pack the next file split into a task. Increasing this
+   * usually makes tasks a bit more even by considering more ways to pack file regions into a single task with extra
+   * planning cost.
+   * <p>
+   * This configuration can reorder the incoming file regions, to preserve order for lower/upper bounds in file
+   * metadata, user can use a lookback of 1.
+   *
+   * @param splitLookback number of "bins" considered when trying to pack the next file split into a task.
+   * @return this for method chaining
+   */
+  RewriteDataFiles splitLookback(int splitLookback);
+
+  /**
+   * Specify the minimum file size to count to pack into one "bin". If the read file size is smaller than this specified
+   * threshold, Iceberg will use this value to do count.
+   * <p>
+   * this configuration controls the number of files to compact for each task, small value would lead to a high
+   * compaction, the default value is 4MB.
+   *
+   * @param splitOpenFileCost minimum file size to count to pack into one "bin".
+   * @return this for method chaining
+   */
+  RewriteDataFiles splitOpenFileCost(long splitOpenFileCost);
+
+  /**
+   * The action result that contains a summary of the execution.
+   */
+  interface Result {
+    /**
+     * Returns rewritten data files.
+     */
+    Iterable<DataFile> rewrittenDataFiles();
+    /**

Review comment:
       Nit: missing newline between methods.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r579488391



##########
File path: api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Table;
+
+/**
+ * An API that should be implemented by query engine integrations for providing actions.
+ */
+public interface ActionsProvider {
+
+  /**
+   * Instantiates an action to snapshot an existing table.
+   */
+  default SnapshotTable snapshotTable(String sourceTableIdent, String destTableIdent) {
+    throw new UnsupportedOperationException(this.getClass().getName() + " does not implement snapshotTable");
+  }
+
+  /**
+   * Instantiates an action to migrate an existing table to Iceberg.
+   */
+  default MigrateTable migrateTable(String tableIdent) {

Review comment:
       I don't really like using Strings as identifiers as it requires implementations to parse it but I am not sure other options are better. I considered using `TableIdentifier` Iceberg defines but it is in the catalog module so it will be a bit weird to use it here. Also, parsing seems inevitable as there may be engine-specific parsing.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790178251


   We should probably pick between strings and Iceberg identifiers. I lean towards strings as we sometimes refer to non-Iceberg tables but it is not a strong opinion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790180247


   Let's use Strings for now. That's what we already do in the actions and we didn't have a problem with it before.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582181755



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java
##########
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.function.Predicate;
+import org.apache.iceberg.ManifestFile;
+
+/**
+ * An action that rewrites manifests.
+ */
+public interface RewriteManifests extends
+    ConfigurableAction<RewriteManifests, RewriteManifests.Result>,
+    SnapshotUpdate<RewriteManifests, RewriteManifests.Result> {
+
+  /**
+   * Rewrites manifests for a given spec id.

Review comment:
       I think this should mention the default value, which is the table's default spec ID.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#discussion_r582219639



##########
File path: api/src/main/java/org/apache/iceberg/actions/SnapshotUpdate.java
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import org.apache.iceberg.Snapshot;
+
+/**
+ * An action that produces snapshots. This interface contains common methods for all
+ * actions that create a new table {@link Snapshot}.
+ *
+ * @param <ThisT> the child Java API class, returned by method chaining
+ * @param <R> the Java type of the result produced by this action
+ */
+public interface SnapshotUpdate<ThisT, R> extends Action<R> {
+  /**
+   * Set a summary property in the snapshot produced by this action.
+   *
+   * @param property a String property name
+   * @param value a String property value
+   * @return this for method chaining
+   */
+  ThisT set(String property, String value);

Review comment:
       cc @RussellSpitzer who developed the original snapshot/migrate API. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2255: API: Add v2 Actions API

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2255:
URL: https://github.com/apache/iceberg/pull/2255#issuecomment-790918079


   Looks good to me!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org