You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "wecharyu (via GitHub)" <gi...@apache.org> on 2023/11/11 14:22:24 UTC

[PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

wecharyu opened a new pull request, #43766:
URL: https://github.com/apache/spark/pull/43766

   ### What changes were proposed in this pull request?
   Support drop multiple partitions in batch in `HiveShim`, including:
   1. add a config `spark.sql.dropPartitionInBatch.size` to control the batch size of drop partitions.
   2. Implement the `dropPartitions()` for different hive versions
   
   
   ### Why are the changes needed?
   To improve the performance of `dropPartitions()`.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, user can use the new config `spark.sql.dropPartitionInBatch.size` to control the drop batch size.
   
   
   ### How was this patch tested?
   Passing all existing tests.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #43766:
URL: https://github.com/apache/spark/pull/43766#issuecomment-1814222407

   Seems the test failed related to this pr?
   
   https://github.com/wecharyu/spark/actions/runs/6848414843/job/18618598690
   
   <img width="1146" alt="image" src="https://github.com/apache/spark/assets/1475305/ce38a02a-4486-4183-8d5e-0c40544ad0cb">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on PR #43766:
URL: https://github.com/apache/spark/pull/43766#issuecomment-1868711153

   LGTM. cc @wangyum @yaooqinn @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on PR #43766:
URL: https://github.com/apache/spark/pull/43766#issuecomment-1809473083

   What will happen after this change if the user uses the built-in Hive 2.3.9(or upcoming 2.3.10) client to access the lower version of HMS?
   
   Although there is neither guarantee nor test coverage, in terms of all APIs Spark used, there is a fact that the current built-in Hive 2.3.9 client works well with HMS 2.0+, and the upcoming Hive 2.3.10 works well with HMS 1.2+


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "wecharyu (via GitHub)" <gi...@apache.org>.
wecharyu commented on PR #43766:
URL: https://github.com/apache/spark/pull/43766#issuecomment-1823982491

   ping @pan3793 @LuciferYang , pls take a look again at your convenience.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "wecharyu (via GitHub)" <gi...@apache.org>.
wecharyu commented on PR #43766:
URL: https://github.com/apache/spark/pull/43766#issuecomment-1816592919

   @LuciferYang `DDLSemanticAnalyzer#makeBinaryPredicate` is private before Hive 2.2, I replace it by reflection invoke. 
   Also the `Hive.dropPartitions()` api will encounter JDK incompatible issue like #25414, I have disable some Hive 3.0+ tests for "dropPartitions` and "createPartitions".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "wecharyu (via GitHub)" <gi...@apache.org>.
wecharyu commented on PR #43766:
URL: https://github.com/apache/spark/pull/43766#issuecomment-1807840793

   cc @pan3793 and @LuciferYang


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #43766:
URL: https://github.com/apache/spark/pull/43766#discussion_r1391927089


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala:
##########
@@ -535,18 +542,78 @@ private[client] class Shim_v2_0 extends Shim with Logging {
     alterPartitionsMethod.invoke(hive, tableName, newParts)
   }
 
-  override def dropPartition(
+  override def dropPartitions(
       hive: Hive,
       dbName: String,
       tableName: String,
-      part: JList[String],
+      specs: Seq[TablePartitionSpec],
       deleteData: Boolean,
+      ignoreIfNotExists: Boolean,
       purge: Boolean): Unit = {
     val dropOptions = new PartitionDropOptions
     dropOptions.deleteData(deleteData)
     dropOptions.purgeData(purge)
+    dropOptions.ifExists(ignoreIfNotExists)
     recordHiveCall()

Review Comment:
   The `recordHiveCall()` must be present before each **HMS RPC call** (you need to check the underlayer implementation of each HiveShim.xxx method to figure out whether it's a RPC call or in-process function invocation)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #43766: [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive
URL: https://github.com/apache/spark/pull/43766


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #43766:
URL: https://github.com/apache/spark/pull/43766#discussion_r1435936724


##########
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala:
##########
@@ -501,46 +501,52 @@ class HiveClientSuite(version: String, allVersions: Seq[String])
   }
 
   test("dropPartitions") {
-    val spec = Map("key1" -> "1", "key2" -> "3")
-    val versionsWithoutPurge =
-      if (allVersions.contains("1.2")) allVersions.takeWhile(_ != "1.2") else Nil
-    // Similar to dropTable; try with purge set, and if it fails, make sure we're running
-    // with a version that is older than the minimum (1.2 in this case).
-    try {
-      client.dropPartitions("default", "src_part", Seq(spec), ignoreIfNotExists = true,
-        purge = true, retainData = false)
-      assert(!versionsWithoutPurge.contains(version))
-    } catch {
-      case _: UnsupportedOperationException =>
-        assert(versionsWithoutPurge.contains(version))
+    // Since Hive 3.0(HIVE-19383), we can not run `Hive.dropPartitions` with JDK 11.
+    if (version != "3.0" && version != "3.1") {
+      val spec = Map("key1" -> "1", "key2" -> "3")
+      val versionsWithoutPurge =
+        if (allVersions.contains("1.2")) allVersions.takeWhile(_ != "1.2") else Nil

Review Comment:
   `allVersions.contains("1.2")` is always `false` now since Spark 4.0 only supports Hive 2.0.0 and above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #43766:
URL: https://github.com/apache/spark/pull/43766#discussion_r1391927089


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala:
##########
@@ -535,18 +542,78 @@ private[client] class Shim_v2_0 extends Shim with Logging {
     alterPartitionsMethod.invoke(hive, tableName, newParts)
   }
 
-  override def dropPartition(
+  override def dropPartitions(
       hive: Hive,
       dbName: String,
       tableName: String,
-      part: JList[String],
+      specs: Seq[TablePartitionSpec],
       deleteData: Boolean,
+      ignoreIfNotExists: Boolean,
       purge: Boolean): Unit = {
     val dropOptions = new PartitionDropOptions
     dropOptions.deleteData(deleteData)
     dropOptions.purgeData(purge)
+    dropOptions.ifExists(ignoreIfNotExists)
     recordHiveCall()

Review Comment:
   The `recordHiveCall()` must be present before each **HMS RPC call** (you need to check the underlayer implementation of each Hive.xxx method to figure out whether it's a RPC call or in-process function invocation)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "wecharyu (via GitHub)" <gi...@apache.org>.
wecharyu commented on PR #43766:
URL: https://github.com/apache/spark/pull/43766#issuecomment-1815860421

   > What will happen after this change if the user uses the built-in Hive 2.3.9(or upcoming 2.3.10) client to access the lower version of HMS?
   
   This patch also works well for HMS 2.0+, because the `drop_partitions_req` api was introduced in HMS 0.13, HIVE-6256.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #43766:
URL: https://github.com/apache/spark/pull/43766#discussion_r1391927089


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala:
##########
@@ -535,18 +542,78 @@ private[client] class Shim_v2_0 extends Shim with Logging {
     alterPartitionsMethod.invoke(hive, tableName, newParts)
   }
 
-  override def dropPartition(
+  override def dropPartitions(
       hive: Hive,
       dbName: String,
       tableName: String,
-      part: JList[String],
+      specs: Seq[TablePartitionSpec],
       deleteData: Boolean,
+      ignoreIfNotExists: Boolean,
       purge: Boolean): Unit = {
     val dropOptions = new PartitionDropOptions
     dropOptions.deleteData(deleteData)
     dropOptions.purgeData(purge)
+    dropOptions.ifExists(ignoreIfNotExists)
     recordHiveCall()

Review Comment:
   The `recordHiveCall()` must be present before each **HMS RPC call** (you need to check the underlying implementation of each Hive.xxx method to figure out whether it's a RPC call or in-process function invocation)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #43766:
URL: https://github.com/apache/spark/pull/43766#issuecomment-2035843071

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "wecharyu (via GitHub)" <gi...@apache.org>.
wecharyu commented on code in PR #43766:
URL: https://github.com/apache/spark/pull/43766#discussion_r1396791716


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala:
##########
@@ -535,18 +542,78 @@ private[client] class Shim_v2_0 extends Shim with Logging {
     alterPartitionsMethod.invoke(hive, tableName, newParts)
   }
 
-  override def dropPartition(
+  override def dropPartitions(
       hive: Hive,
       dbName: String,
       tableName: String,
-      part: JList[String],
+      specs: Seq[TablePartitionSpec],
       deleteData: Boolean,
+      ignoreIfNotExists: Boolean,
       purge: Boolean): Unit = {
     val dropOptions = new PartitionDropOptions
     dropOptions.deleteData(deleteData)
     dropOptions.purgeData(purge)
+    dropOptions.ifExists(ignoreIfNotExists)
     recordHiveCall()
-    hive.dropPartition(dbName, tableName, part, dropOptions)

Review Comment:
   Add a new switch named `spark.sql.dropPartitionInBatch.enabled`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "wecharyu (via GitHub)" <gi...@apache.org>.
wecharyu commented on code in PR #43766:
URL: https://github.com/apache/spark/pull/43766#discussion_r1395465609


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala:
##########
@@ -535,18 +542,78 @@ private[client] class Shim_v2_0 extends Shim with Logging {
     alterPartitionsMethod.invoke(hive, tableName, newParts)
   }
 
-  override def dropPartition(
+  override def dropPartitions(
       hive: Hive,
       dbName: String,
       tableName: String,
-      part: JList[String],
+      specs: Seq[TablePartitionSpec],
       deleteData: Boolean,
+      ignoreIfNotExists: Boolean,
       purge: Boolean): Unit = {
     val dropOptions = new PartitionDropOptions
     dropOptions.deleteData(deleteData)
     dropOptions.purgeData(purge)
+    dropOptions.ifExists(ignoreIfNotExists)
     recordHiveCall()

Review Comment:
   Noted, will double check and update it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45893][HIVE] Support drop multiple partitions in batch for hive [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #43766:
URL: https://github.com/apache/spark/pull/43766#discussion_r1391928256


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala:
##########
@@ -535,18 +542,78 @@ private[client] class Shim_v2_0 extends Shim with Logging {
     alterPartitionsMethod.invoke(hive, tableName, newParts)
   }
 
-  override def dropPartition(
+  override def dropPartitions(
       hive: Hive,
       dbName: String,
       tableName: String,
-      part: JList[String],
+      specs: Seq[TablePartitionSpec],
       deleteData: Boolean,
+      ignoreIfNotExists: Boolean,
       purge: Boolean): Unit = {
     val dropOptions = new PartitionDropOptions
     dropOptions.deleteData(deleteData)
     dropOptions.purgeData(purge)
+    dropOptions.ifExists(ignoreIfNotExists)
     recordHiveCall()
-    hive.dropPartition(dbName, tableName, part, dropOptions)

Review Comment:
   a switch is needed to allow user to restore the previous behavior



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org