You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/10 22:38:42 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request #2314: Spark: Refactor action for expiring snapshots

aokolnychyi opened a new pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314


   This PR refactors the action for expiring snapshots and moves it to the new API.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591984524



##########
File path: spark/src/main/java/org/apache/iceberg/actions/Actions.java
##########
@@ -79,7 +80,8 @@ public RewriteDataFilesAction rewriteDataFiles() {
   }
 
   public ExpireSnapshotsAction expireSnapshots() {
-    return new ExpireSnapshotsAction(spark, table);
+    BaseExpireSnapshotsSparkAction delegate = new BaseExpireSnapshotsSparkAction(spark, table);
+    return new ExpireSnapshotsAction(delegate);

Review comment:
       Oh, yeah, we will mark `Actions` deprecated once we have a new alternative. We will have a totally new entry point called `SparkActions` in another package. Right now, I am migrating action by action to reduce the amount of changes in a single pr. Once all actions are done, I'll create a new public API for users to use and deprecate the rest.
   
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r592779843



##########
File path: spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java
##########
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.actions;
+
+import java.util.Iterator;
+import java.util.Set;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.function.Consumer;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.actions.BaseExpireSnapshotsActionResult;
+import org.apache.iceberg.actions.BaseSparkAction;
+import org.apache.iceberg.actions.ExpireSnapshots;
+import org.apache.iceberg.exceptions.NotFoundException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.util.PropertyUtil;
+import org.apache.iceberg.util.Tasks;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static org.apache.iceberg.TableProperties.GC_ENABLED;
+import static org.apache.iceberg.TableProperties.GC_ENABLED_DEFAULT;
+
+/**
+ * An action that performs the same operation as {@link org.apache.iceberg.ExpireSnapshots} but uses Spark
+ * to determine the delta in files between the pre and post-expiration table metadata. All of the same
+ * restrictions of {@link org.apache.iceberg.ExpireSnapshots} also apply to this action.
+ * <p>
+ * This action first leverages {@link org.apache.iceberg.ExpireSnapshots} to expire snapshots and then
+ * uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets
+ * that contain all manifest and data files before and after the expiration. The snapshot expiration
+ * will be fully committed before any deletes are issued.
+ * <p>
+ * This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.
+ * <p>
+ * Deletes are still performed locally after retrieving the results from the Spark executors.
+ */
+@SuppressWarnings("UnnecessaryAnonymousClass")
+public class BaseExpireSnapshotsSparkAction

Review comment:
       I agree with @aokolnychyi 's reasoning here as I had to do a lot of those static methods as some of our actions could descend from the base class and other couldn't, I'd prefer each frame work be allowed to have it's own base class while shared logic belongs to a class in the "core module" and either is accessed through composition or pure functions




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r592775947



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseExpireSnapshotsActionResult.java
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+public class BaseExpireSnapshotsActionResult implements ExpireSnapshots.Result {

Review comment:
       Here, I followed the pattern we have in many classes in `core` like `BaseFile`, `BaseTable`, `BaseRewriteManifests`. Also, there is a class called `ExpireSnapshotsActionResult` in this package in Spark already. We deprecate it with the introduction of this new one.
   
   I agree query engines will most likely use the same result classes (unless they are doing something really specific).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r592775947



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseExpireSnapshotsActionResult.java
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+public class BaseExpireSnapshotsActionResult implements ExpireSnapshots.Result {

Review comment:
       Here, I followed the naming we have in `core` like `BaseFile`, `BaseTable`, `BaseRewriteManifests`.
   
   Also, there is a class called `ExpireSnapshotsActionResult` in this package in Spark already. We deprecate it with the introduction of this new one.
   
   I agree query engines will most likely use the same result classes (unless they are doing something really specific).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r592938662



##########
File path: spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java
##########
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.actions;
+
+import java.util.Iterator;
+import java.util.Set;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.function.Consumer;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.actions.BaseExpireSnapshotsActionResult;
+import org.apache.iceberg.actions.BaseSparkAction;
+import org.apache.iceberg.actions.ExpireSnapshots;
+import org.apache.iceberg.exceptions.NotFoundException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.util.PropertyUtil;
+import org.apache.iceberg.util.Tasks;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static org.apache.iceberg.TableProperties.GC_ENABLED;
+import static org.apache.iceberg.TableProperties.GC_ENABLED_DEFAULT;
+
+/**
+ * An action that performs the same operation as {@link org.apache.iceberg.ExpireSnapshots} but uses Spark
+ * to determine the delta in files between the pre and post-expiration table metadata. All of the same
+ * restrictions of {@link org.apache.iceberg.ExpireSnapshots} also apply to this action.
+ * <p>
+ * This action first leverages {@link org.apache.iceberg.ExpireSnapshots} to expire snapshots and then
+ * uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets
+ * that contain all manifest and data files before and after the expiration. The snapshot expiration
+ * will be fully committed before any deletes are issued.
+ * <p>
+ * This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.
+ * <p>
+ * Deletes are still performed locally after retrieving the results from the Spark executors.
+ */
+@SuppressWarnings("UnnecessaryAnonymousClass")
+public class BaseExpireSnapshotsSparkAction

Review comment:
       In my local refactoring, I've put common logic for the current compaction code in a utility class and that seems to work pretty well and gives us freedom in terms of designing the action hierarchy for query engines.
   
   Let me know what you think, @openinx @rymurr.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r592311927



##########
File path: spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java
##########
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.actions;
+
+import java.util.Iterator;
+import java.util.Set;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.function.Consumer;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.actions.BaseExpireSnapshotsActionResult;
+import org.apache.iceberg.actions.BaseSparkAction;
+import org.apache.iceberg.actions.ExpireSnapshots;
+import org.apache.iceberg.exceptions.NotFoundException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.util.PropertyUtil;
+import org.apache.iceberg.util.Tasks;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static org.apache.iceberg.TableProperties.GC_ENABLED;
+import static org.apache.iceberg.TableProperties.GC_ENABLED_DEFAULT;
+
+/**
+ * An action that performs the same operation as {@link org.apache.iceberg.ExpireSnapshots} but uses Spark
+ * to determine the delta in files between the pre and post-expiration table metadata. All of the same
+ * restrictions of {@link org.apache.iceberg.ExpireSnapshots} also apply to this action.
+ * <p>
+ * This action first leverages {@link org.apache.iceberg.ExpireSnapshots} to expire snapshots and then
+ * uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets
+ * that contain all manifest and data files before and after the expiration. The snapshot expiration
+ * will be fully committed before any deletes are issued.
+ * <p>
+ * This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.
+ * <p>
+ * Deletes are still performed locally after retrieving the results from the Spark executors.
+ */
+@SuppressWarnings("UnnecessaryAnonymousClass")
+public class BaseExpireSnapshotsSparkAction

Review comment:
       Yeah, seems like a lot of this is not spark specific. Coudl we have a more generic `BaseExpireSnapshotsAction` that this class extends?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r592773628



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseExpireSnapshotsActionResult.java
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+public class BaseExpireSnapshotsActionResult implements ExpireSnapshots.Result {
+
+  private final long deletedDataFilesCount;

Review comment:
       Yep, you are correct. I keep the scope of this PR to the existing logic. 
   
   @RussellSpitzer is also working on redesigning the rewrite data files action to also support sort-based compactions and per partition compactions. Shall 4 of us meet next week to discuss this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591969881



##########
File path: spark/src/main/java/org/apache/iceberg/actions/Actions.java
##########
@@ -79,7 +80,8 @@ public RewriteDataFilesAction rewriteDataFiles() {
   }
 
   public ExpireSnapshotsAction expireSnapshots() {
-    return new ExpireSnapshotsAction(spark, table);
+    BaseExpireSnapshotsSparkAction delegate = new BaseExpireSnapshotsSparkAction(spark, table);
+    return new ExpireSnapshotsAction(delegate);

Review comment:
       Not sure about the Iceberg deprecating process. Here is my understanding, we need to create a new method with new return type for this user-facing API, which can be used by user moving forward. Meanwhile we deprecate this method.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r595636768



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseExpireSnapshotsActionResult.java
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+public class BaseExpireSnapshotsActionResult implements ExpireSnapshots.Result {

Review comment:
       I am fine converting `Result` into a class but I am not sure we will gain much by doing that. The current class is in the core module so it is accessible to everyone. Making `Result` a class may make inheritance a bit harder. I do like our current pattern of having `BaseXXX` classes in core and interfaces in the API module.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#issuecomment-800794216


   Thanks everyone for reviewing! There is one open question about converting `Result` interface into a class. It is not directly related to this PR so I merged it to unblock subsequent changes. Let's continue to discuss [here](https://github.com/apache/iceberg/pull/2314#discussion_r592111677).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591985598



##########
File path: spark/src/main/java/org/apache/iceberg/actions/BaseSparkAction.java
##########
@@ -46,10 +48,11 @@
 
 import static org.apache.iceberg.MetadataTableType.ALL_MANIFESTS;
 
-abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {
+public abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {

Review comment:
       Temporarily yes since we are using it in another package. Once we get rid of the old actions, we should be able to move this and make it non-public.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r595600844



##########
File path: spark/src/main/java/org/apache/iceberg/actions/BaseSparkAction.java
##########
@@ -46,10 +48,11 @@
 
 import static org.apache.iceberg.MetadataTableType.ALL_MANIFESTS;
 
-abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {
+public abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {

Review comment:
       If this is public but will be removed, we should mark it deprecated so people know not to rely on it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591957092



##########
File path: spark/src/main/java/org/apache/iceberg/actions/BaseSparkAction.java
##########
@@ -46,10 +48,11 @@
 
 import static org.apache.iceberg.MetadataTableType.ALL_MANIFESTS;
 
-abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {
+public abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {

Review comment:
       Do we need it to be public?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591985598



##########
File path: spark/src/main/java/org/apache/iceberg/actions/BaseSparkAction.java
##########
@@ -46,10 +48,11 @@
 
 import static org.apache.iceberg.MetadataTableType.ALL_MANIFESTS;
 
-abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {
+public abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {

Review comment:
       Temporarily yes since we are using it in another package. Once we get rid of the old actions, we should be able to make this non-public.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591957092



##########
File path: spark/src/main/java/org/apache/iceberg/actions/BaseSparkAction.java
##########
@@ -46,10 +48,11 @@
 
 import static org.apache.iceberg.MetadataTableType.ALL_MANIFESTS;
 
-abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {
+public abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {

Review comment:
       Do we need it to be public? Looks like there is no scope change needed in this PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r592772420



##########
File path: spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java
##########
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.actions;
+
+import java.util.Iterator;
+import java.util.Set;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.function.Consumer;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.actions.BaseExpireSnapshotsActionResult;
+import org.apache.iceberg.actions.BaseSparkAction;
+import org.apache.iceberg.actions.ExpireSnapshots;
+import org.apache.iceberg.exceptions.NotFoundException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.util.PropertyUtil;
+import org.apache.iceberg.util.Tasks;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static org.apache.iceberg.TableProperties.GC_ENABLED;
+import static org.apache.iceberg.TableProperties.GC_ENABLED_DEFAULT;
+
+/**
+ * An action that performs the same operation as {@link org.apache.iceberg.ExpireSnapshots} but uses Spark
+ * to determine the delta in files between the pre and post-expiration table metadata. All of the same
+ * restrictions of {@link org.apache.iceberg.ExpireSnapshots} also apply to this action.
+ * <p>
+ * This action first leverages {@link org.apache.iceberg.ExpireSnapshots} to expire snapshots and then
+ * uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets
+ * that contain all manifest and data files before and after the expiration. The snapshot expiration
+ * will be fully committed before any deletes are issued.
+ * <p>
+ * This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.
+ * <p>
+ * Deletes are still performed locally after retrieving the results from the Spark executors.
+ */
+@SuppressWarnings("UnnecessaryAnonymousClass")
+public class BaseExpireSnapshotsSparkAction

Review comment:
       I am not sure it is a good idea to build a common hierarchy for different query engines. We tried that in the past but it led to a really weird situation with our Spark actions now. For example, we can no longer use our BaseSparkAction in some cases due to single class inheritance in Java. So some Spark actions extend that some don't. That forced us to create more static methods in places where we don't need them. There will be more problems like this, I guess.
   
   I agree about sharing code wherever possible, though. I'd prefer to do that using utility classes instead of building a common hierarchy. Refactoring common code into utility classes is beyond this PR. I tried to basically create a new action while keeping the backward compatibility. Also, designing utility classes seems easier when we know what parts Flink can reuse. This action heavily depends on Spark `Row` and metadata tables, for example.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r595601245



##########
File path: spark/src/main/java/org/apache/iceberg/actions/ExpireSnapshotsAction.java
##########
@@ -56,44 +36,15 @@
  * require a shuffle so parallelism can be controlled through spark.sql.shuffle.partitions. The expiration is done
  * locally using a direct call to RemoveSnapshots. The snapshot expiration will be fully committed before any deletes
  * are issued. Deletes are still performed locally after retrieving the results from the Spark executors.
+ *
+ * @deprecated since 0.12.0

Review comment:
       We should also tell the reader what to use instead and when to expect this to be removed. That also helps us find things that should be removed for a release by grepping for "will be removed in 0.13.0".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r592113447



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseExpireSnapshotsActionResult.java
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+public class BaseExpireSnapshotsActionResult implements ExpireSnapshots.Result {
+
+  private final long deletedDataFilesCount;

Review comment:
       Seems we don't consider the format v2 in this patch, right ?  That's OK if true because we could support v2 in a separate issue.  For the RewriteDataFilesAction, now @chenjunjiedada and I have prepared few PRs to support format v2's Rewrite actions.  
   
   1. https://github.com/apache/iceberg/pull/2303 
   2. https://github.com/apache/iceberg/pull/2216

##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseExpireSnapshotsActionResult.java
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+public class BaseExpireSnapshotsActionResult implements ExpireSnapshots.Result {

Review comment:
       I'm grateful that the  ExpireSnapshotsAction refactoring work has been considered to work for different compute engines, such as  flink.  In this way,  we flink don't have to abstract the common logic of actions between flink and spark and we could just extend those BaseXXAction to implement our flink own actions.  That's really important ! 
   
   About the `ExpireSnapshotsAction`,  I think both flink & spark will have the same `Result`.  So maybe we could just use the `ExpireSnapshotsActionResult` directly ( I don't think there will be a reason that flink or spark will extend this `BaseExpireSnapshotsActionResult`, so maybe we could just remove the `Base` prefix).

##########
File path: spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java
##########
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.actions;
+
+import java.util.Iterator;
+import java.util.Set;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.function.Consumer;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.actions.BaseExpireSnapshotsActionResult;
+import org.apache.iceberg.actions.BaseSparkAction;
+import org.apache.iceberg.actions.ExpireSnapshots;
+import org.apache.iceberg.exceptions.NotFoundException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.util.PropertyUtil;
+import org.apache.iceberg.util.Tasks;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static org.apache.iceberg.TableProperties.GC_ENABLED;
+import static org.apache.iceberg.TableProperties.GC_ENABLED_DEFAULT;
+
+/**
+ * An action that performs the same operation as {@link org.apache.iceberg.ExpireSnapshots} but uses Spark
+ * to determine the delta in files between the pre and post-expiration table metadata. All of the same
+ * restrictions of {@link org.apache.iceberg.ExpireSnapshots} also apply to this action.
+ * <p>
+ * This action first leverages {@link org.apache.iceberg.ExpireSnapshots} to expire snapshots and then
+ * uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets
+ * that contain all manifest and data files before and after the expiration. The snapshot expiration
+ * will be fully committed before any deletes are issued.
+ * <p>
+ * This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.
+ * <p>
+ * Deletes are still performed locally after retrieving the results from the Spark executors.
+ */
+@SuppressWarnings("UnnecessaryAnonymousClass")
+public class BaseExpireSnapshotsSparkAction

Review comment:
       OK,  seems we will still need to abstract the expireSnapshotAction core logic for flink.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r595654742



##########
File path: spark/src/main/java/org/apache/iceberg/actions/ExpireSnapshotsAction.java
##########
@@ -56,44 +36,15 @@
  * require a shuffle so parallelism can be controlled through spark.sql.shuffle.partitions. The expiration is done
  * locally using a direct call to RemoveSnapshots. The snapshot expiration will be fully committed before any deletes
  * are issued. Deletes are still performed locally after retrieving the results from the Spark executors.
+ *
+ * @deprecated since 0.12.0

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r592775947



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseExpireSnapshotsActionResult.java
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+public class BaseExpireSnapshotsActionResult implements ExpireSnapshots.Result {

Review comment:
       Here, I followed the pattern we have in many classes in `core` like `BaseFile`, `BaseTable`, `BaseRewriteManifests`. Also, there is a class called `ExpireSnapshotsActionResult` in this package in Spark already. We deprecate it with the introduction of this new one.
   
   I agree query engines will most likely use the same result classes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591941006



##########
File path: spark/src/main/java/org/apache/iceberg/actions/Actions.java
##########
@@ -79,7 +80,8 @@ public RewriteDataFilesAction rewriteDataFiles() {
   }
 
   public ExpireSnapshotsAction expireSnapshots() {
-    return new ExpireSnapshotsAction(spark, table);
+    BaseExpireSnapshotsSparkAction delegate = new BaseExpireSnapshotsSparkAction(spark, table);
+    return new ExpireSnapshotsAction(delegate);

Review comment:
       Why do we use a deprecated class(ExpireSnapshotsAction) here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r595633624



##########
File path: spark/src/main/java/org/apache/iceberg/actions/BaseSparkAction.java
##########
@@ -46,10 +48,11 @@
 
 import static org.apache.iceberg.MetadataTableType.ALL_MANIFESTS;
 
-abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {
+public abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {

Review comment:
       This class will be used even after refactoring. It may be moved but to another package, though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591957092



##########
File path: spark/src/main/java/org/apache/iceberg/actions/BaseSparkAction.java
##########
@@ -46,10 +48,11 @@
 
 import static org.apache.iceberg.MetadataTableType.ALL_MANIFESTS;
 
-abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {
+public abstract class BaseSparkAction<ThisT, R> implements Action<ThisT, R> {

Review comment:
       Do we need it to be public? Looks like there is scope change needed in this PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591986153



##########
File path: spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java
##########
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.actions;

Review comment:
       The package here is important. I am going to add other actions here too.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r593048843



##########
File path: spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java
##########
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.actions;
+
+import java.util.Iterator;
+import java.util.Set;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.function.Consumer;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.actions.BaseExpireSnapshotsActionResult;
+import org.apache.iceberg.actions.BaseSparkAction;
+import org.apache.iceberg.actions.ExpireSnapshots;
+import org.apache.iceberg.exceptions.NotFoundException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.util.PropertyUtil;
+import org.apache.iceberg.util.Tasks;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static org.apache.iceberg.TableProperties.GC_ENABLED;
+import static org.apache.iceberg.TableProperties.GC_ENABLED_DEFAULT;
+
+/**
+ * An action that performs the same operation as {@link org.apache.iceberg.ExpireSnapshots} but uses Spark
+ * to determine the delta in files between the pre and post-expiration table metadata. All of the same
+ * restrictions of {@link org.apache.iceberg.ExpireSnapshots} also apply to this action.
+ * <p>
+ * This action first leverages {@link org.apache.iceberg.ExpireSnapshots} to expire snapshots and then
+ * uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets
+ * that contain all manifest and data files before and after the expiration. The snapshot expiration
+ * will be fully committed before any deletes are issued.
+ * <p>
+ * This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.
+ * <p>
+ * Deletes are still performed locally after retrieving the results from the Spark executors.
+ */
+@SuppressWarnings("UnnecessaryAnonymousClass")
+public class BaseExpireSnapshotsSparkAction

Review comment:
       I like this idea a lot more. On reflection I can definitely see sharing a base class across engines could be an issue. Sharing through composition and util classes makes a lot of sense to me




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591941006



##########
File path: spark/src/main/java/org/apache/iceberg/actions/Actions.java
##########
@@ -79,7 +80,8 @@ public RewriteDataFilesAction rewriteDataFiles() {
   }
 
   public ExpireSnapshotsAction expireSnapshots() {
-    return new ExpireSnapshotsAction(spark, table);
+    BaseExpireSnapshotsSparkAction delegate = new BaseExpireSnapshotsSparkAction(spark, table);
+    return new ExpireSnapshotsAction(delegate);

Review comment:
       Are we still use ExpireSnapshotsAction since it is deprecated in this PR?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#issuecomment-796249577


   @RussellSpitzer @rdblue @rymurr @shardulm94 @karuppayya @flyrain, could you please review?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r595600204



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseExpireSnapshotsActionResult.java
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+public class BaseExpireSnapshotsActionResult implements ExpireSnapshots.Result {

Review comment:
       What about making `ExpireSnapshots.Result` a class instead of an interface? Then we would always have the same default implementation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r592777262



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseExpireSnapshotsActionResult.java
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+public class BaseExpireSnapshotsActionResult implements ExpireSnapshots.Result {
+
+  private final long deletedDataFilesCount;

Review comment:
       It would be easy to extend the interface and this implementation to include other counts once we support them.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591953966



##########
File path: spark/src/main/java/org/apache/iceberg/actions/Actions.java
##########
@@ -79,7 +80,8 @@ public RewriteDataFilesAction rewriteDataFiles() {
   }
 
   public ExpireSnapshotsAction expireSnapshots() {
-    return new ExpireSnapshotsAction(spark, table);
+    BaseExpireSnapshotsSparkAction delegate = new BaseExpireSnapshotsSparkAction(spark, table);
+    return new ExpireSnapshotsAction(delegate);

Review comment:
       This is a user-facing API so we cannot break it without deprecating first.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r591969881



##########
File path: spark/src/main/java/org/apache/iceberg/actions/Actions.java
##########
@@ -79,7 +80,8 @@ public RewriteDataFilesAction rewriteDataFiles() {
   }
 
   public ExpireSnapshotsAction expireSnapshots() {
-    return new ExpireSnapshotsAction(spark, table);
+    BaseExpireSnapshotsSparkAction delegate = new BaseExpireSnapshotsSparkAction(spark, table);
+    return new ExpireSnapshotsAction(delegate);

Review comment:
       Not sure about the Iceberg deprecating process. Here is my understanding, we need to create a new method with new return type for this user-facing API, which can be used by user moving forward. Meanwhile we mark this method deprecated so that user won't use it anymore.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#discussion_r595602499



##########
File path: spark/src/main/java/org/apache/iceberg/spark/actions/BaseExpireSnapshotsSparkAction.java
##########
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.actions;
+
+import java.util.Iterator;
+import java.util.Set;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.function.Consumer;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.actions.BaseExpireSnapshotsActionResult;
+import org.apache.iceberg.actions.BaseSparkAction;
+import org.apache.iceberg.actions.ExpireSnapshots;
+import org.apache.iceberg.exceptions.NotFoundException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.util.PropertyUtil;
+import org.apache.iceberg.util.Tasks;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static org.apache.iceberg.TableProperties.GC_ENABLED;
+import static org.apache.iceberg.TableProperties.GC_ENABLED_DEFAULT;
+
+/**
+ * An action that performs the same operation as {@link org.apache.iceberg.ExpireSnapshots} but uses Spark
+ * to determine the delta in files between the pre and post-expiration table metadata. All of the same
+ * restrictions of {@link org.apache.iceberg.ExpireSnapshots} also apply to this action.
+ * <p>
+ * This action first leverages {@link org.apache.iceberg.ExpireSnapshots} to expire snapshots and then
+ * uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets
+ * that contain all manifest and data files before and after the expiration. The snapshot expiration
+ * will be fully committed before any deletes are issued.
+ * <p>
+ * This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.
+ * <p>
+ * Deletes are still performed locally after retrieving the results from the Spark executors.
+ */
+@SuppressWarnings("UnnecessaryAnonymousClass")
+public class BaseExpireSnapshotsSparkAction

Review comment:
       +1 for utility classes rather than using the class hierarchy.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2314: Spark: Refactor action for expiring snapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2314:
URL: https://github.com/apache/iceberg/pull/2314#issuecomment-800674083


   Overall, this looks good to me. It might be a good idea to make the `Result` a class rather than an interface to avoid multiple implementations, but I think that's minor.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org