You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/11/11 03:31:56 UTC

[GitHub] [spark] wangyum opened a new pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

wangyum opened a new pull request #30325:
URL: https://github.com/apache/spark/pull/30325


   ### What changes were proposed in this pull request?
   
   It seems Hive metastore rewrite `In`/`InSet` predicate to `or` expressions when pruning Hive partitions. That will cause Hive metastore stack over flow if there are a lot of values.
   
   This pr rewrite `InSet` predicate to `GreaterThanOrEqual` min value and `LessThanOrEqual ` max value when pruning Hive partitions to avoid Hive metastore stack overflow.
   
   From our experience, `spark.sql.hive.metastorePartitionPruningInSetThreshold` should be less than 10000.
   
   ### Why are the changes needed?
   
   Avoid Hive metastore stack overflow when `InSet` predicate have many values.
   Especially DPP, it may generate many values.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Manual test.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r521124097



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##########
@@ -1267,9 +1267,19 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
     val catalogTable = restoreTableMetadata(rawTable)
 
     val partColNameMap = buildLowerCasePartColNameMap(catalogTable)
+    val hivePredicates = predicates.map {
+      // Avoid Hive metastore stack overflow.
+      case InSet(child, values)

Review comment:
       This is because we ensure that [HiveExternalCatalog.listPartitionsByFilter](https://github.com/apache/spark/blob/7ddc547d1bdd2e8a2d0e9a69976200166472e41a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L1261-L1286) returns the correct value. After pruning by Hive metastore. We also pruned by client use origin predicates: 
   https://github.com/apache/spark/blob/7ddc547d1bdd2e8a2d0e9a69976200166472e41a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L1285
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r522963666



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -815,6 +815,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val HIVE_METASTORE_PARTITION_PRUNING_INSET_THRESHOLD =
+    buildStaticConf("spark.sql.hive.metastorePartitionPruningInSetThreshold")
+      .doc("The threshold of set size for InSet predicate when pruning partitions through Hive " +
+        "Metastore. Larger values may cause Hive metastore stack overflow.")
+      .version("3.1.0")
+      .intConf
+      .checkValue(_ > 0, "The value of metastorePartitionPruningInSetThreshold must be positive")
+      .createWithDefault(Int.MaxValue)

Review comment:
       shall we give a reasonable default value?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725272352


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130901/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725413718






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725127654


   **[Test build #130901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130901/testReport)** for PR 30325 at commit [`7ddc547`](https://github.com/apache/spark/commit/7ddc547d1bdd2e8a2d0e9a69976200166472e41a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725271692


   retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725193336






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-726897284


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35679/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-727039141






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725261010






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725413718






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-728734485


   **[Test build #131186 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131186/testReport)** for PR 30325 at commit [`231cd6f`](https://github.com/apache/spark/commit/231cd6f15f3fa71d677aae7170e1e404b82cef2a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-728938245


   thanks, merging to master!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725349329


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35527/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725231145






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r522963861



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -815,6 +815,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val HIVE_METASTORE_PARTITION_PRUNING_INSET_THRESHOLD =
+    buildStaticConf("spark.sql.hive.metastorePartitionPruningInSetThreshold")

Review comment:
       shall this be an internal conf?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-728663205


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35789/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-728663219


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35789/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-728642372


   **[Test build #131186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131186/testReport)** for PR 30325 at commit [`231cd6f`](https://github.com/apache/spark/commit/231cd6f15f3fa71d677aae7170e1e404b82cef2a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r521126955



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##########
@@ -1267,9 +1267,19 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
     val catalogTable = restoreTableMetadata(rawTable)
 
     val partColNameMap = buildLowerCasePartColNameMap(catalogTable)
+    val hivePredicates = predicates.map {
+      // Avoid Hive metastore stack overflow.
+      case InSet(child, values)

Review comment:
       Ah, ok.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725231145


   **[Test build #130907 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130907/testReport)** for PR 30325 at commit [`f9d018c`](https://github.com/apache/spark/commit/f9d018c28a860c65c5ba47bd28f2a23a3b5d2be3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r524198995



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##########
@@ -1267,9 +1267,19 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
     val catalogTable = restoreTableMetadata(rawTable)
 
     val partColNameMap = buildLowerCasePartColNameMap(catalogTable)
+    val hivePredicates = predicates.map {
+      // Avoid Hive metastore stack overflow.

Review comment:
       `HiveExternalCatalog.prunePartitionsByFilter` will prune the partitions returned by hive client again. It doesn't seem to matter where we handle this special logic. Since it's hive specific, I think we should put it as close to hive client as possible.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725271763






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r521130075



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
##########
@@ -54,6 +54,15 @@ object StaticSQLConf {
     .transform(_.toLowerCase(Locale.ROOT))
     .createWithDefault("global_temp")
 
+  val HIVE_METASTORE_PARTITION_PRUNING_INSET_THRESHOLD =

Review comment:
       Yes, for 2 reasons:
   1. This parameter should be set according to your own Hive Metastore and does not need to be modified frequently.
   2. All SQL configs in `HiveExternalCatalog` are static config, e.g.: `SCHEMA_STRING_LENGTH_THRESHOLD` and `DEBUG_MODE`.
   
   Of course, we can make this parameter to runtime config if needed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725335087


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35527/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r524196258



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##########
@@ -1267,9 +1267,19 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
     val catalogTable = restoreTableMetadata(rawTable)
 
     val partColNameMap = buildLowerCasePartColNameMap(catalogTable)
+    val hivePredicates = predicates.map {
+      // Avoid Hive metastore stack overflow.

Review comment:
       or put it in where we generate the partition pruning HQL string as @maropu suggested.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725193336






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-728663215






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725271757






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725274176


   **[Test build #130922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130922/testReport)** for PR 30325 at commit [`f9d018c`](https://github.com/apache/spark/commit/f9d018c28a860c65c5ba47bd28f2a23a3b5d2be3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-728735429






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-728735429






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r521127915



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
##########
@@ -54,6 +54,15 @@ object StaticSQLConf {
     .transform(_.toLowerCase(Locale.ROOT))
     .createWithDefault("global_temp")
 
+  val HIVE_METASTORE_PARTITION_PRUNING_INSET_THRESHOLD =

Review comment:
       Not a runtime config but a static config?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725193327


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35507/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725260995


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35513/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-726918455






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725252148


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35513/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725187023


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35507/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #30325:
URL: https://github.com/apache/spark/pull/30325


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-728642372


   **[Test build #131186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131186/testReport)** for PR 30325 at commit [`231cd6f`](https://github.com/apache/spark/commit/231cd6f15f3fa71d677aae7170e1e404b82cef2a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725349344






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r523073946



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -815,6 +815,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val HIVE_METASTORE_PARTITION_PRUNING_INSET_THRESHOLD =
+    buildStaticConf("spark.sql.hive.metastorePartitionPruningInSetThreshold")
+      .doc("The threshold of set size for InSet predicate when pruning partitions through Hive " +
+        "Metastore. Larger values may cause Hive metastore stack overflow.")
+      .version("3.1.0")
+      .intConf
+      .checkValue(_ > 0, "The value of metastorePartitionPruningInSetThreshold must be positive")
+      .createWithDefault(Int.MaxValue)

Review comment:
       Set it to 1000 because the default value of `hive.exec.max.dynamic.partitions` is also 1000.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-728655607


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35789/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r521115188



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##########
@@ -1267,9 +1267,19 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
     val catalogTable = restoreTableMetadata(rawTable)
 
     val partColNameMap = buildLowerCasePartColNameMap(catalogTable)
+    val hivePredicates = predicates.map {
+      // Avoid Hive metastore stack overflow.
+      case InSet(child, values)

Review comment:
       Any reason to rewrite catalyst exprs instead of generating a predicate string directly in the HiveShim side? Seems generating it directly can avoid the object generation for rewriting predicates.
   https://github.com/apache/spark/blob/5197c5d2e7648d75def3e159e0d2aa3e20117105/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L722-L724 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-726869625


   **[Test build #131076 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131076/testReport)** for PR 30325 at commit [`a97468b`](https://github.com/apache/spark/commit/a97468b72bc21001fa388e8013c4c9e4414b61ff).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r524841692



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##########
@@ -1267,9 +1267,19 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
     val catalogTable = restoreTableMetadata(rawTable)
 
     val partColNameMap = buildLowerCasePartColNameMap(catalogTable)
+    val hivePredicates = predicates.map {
+      // Avoid Hive metastore stack overflow.

Review comment:
       I moved it to HiveShim.scala#L741-L750:
   https://github.com/apache/spark/blob/231cd6f15f3fa71d677aae7170e1e404b82cef2a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L741-L750




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725272345






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-728663215


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-727039141






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-727038372


   **[Test build #131076 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131076/testReport)** for PR 30325 at commit [`a97468b`](https://github.com/apache/spark/commit/a97468b72bc21001fa388e8013c4c9e4414b61ff).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725274176


   **[Test build #130922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130922/testReport)** for PR 30325 at commit [`f9d018c`](https://github.com/apache/spark/commit/f9d018c28a860c65c5ba47bd28f2a23a3b5d2be3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r521127475



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
##########
@@ -54,6 +54,15 @@ object StaticSQLConf {
     .transform(_.toLowerCase(Locale.ROOT))
     .createWithDefault("global_temp")
 
+  val HIVE_METASTORE_PARTITION_PRUNING_INSET_THRESHOLD =
+    buildStaticConf("spark.sql.hive.metastorePartitionPruningInSetThreshold")
+      .doc("The threshold of set size for InSet predicate when pruning partitions through Hive" +

Review comment:
       nit: add a space in the end. `HiveMetastore` -> `Hive Metastore`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725271757


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-726869625


   **[Test build #131076 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131076/testReport)** for PR 30325 at commit [`a97468b`](https://github.com/apache/spark/commit/a97468b72bc21001fa388e8013c4c9e4414b61ff).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725261010






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r522963377



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -815,6 +815,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val HIVE_METASTORE_PARTITION_PRUNING_INSET_THRESHOLD =
+    buildStaticConf("spark.sql.hive.metastorePartitionPruningInSetThreshold")
+      .doc("The threshold of set size for InSet predicate when pruning partitions through Hive " +
+        "Metastore. Larger values may cause Hive metastore stack overflow.")

Review comment:
       let's mention what happens if we exceed the size treshold.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -815,6 +815,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val HIVE_METASTORE_PARTITION_PRUNING_INSET_THRESHOLD =
+    buildStaticConf("spark.sql.hive.metastorePartitionPruningInSetThreshold")
+      .doc("The threshold of set size for InSet predicate when pruning partitions through Hive " +
+        "Metastore. Larger values may cause Hive metastore stack overflow.")

Review comment:
       let's mention what happens if we exceed the size threshold.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r524193466



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -815,6 +815,18 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val HIVE_METASTORE_PARTITION_PRUNING_INSET_THRESHOLD =

Review comment:
       static conf should go to `StaticSQLConf`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-726918455






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r524845159



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##########
@@ -1267,9 +1267,19 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
     val catalogTable = restoreTableMetadata(rawTable)
 
     val partColNameMap = buildLowerCasePartColNameMap(catalogTable)
+    val hivePredicates = predicates.map {
+      // Avoid Hive metastore stack overflow.

Review comment:
       I have a little concern, we don’t use `HiveClient.getPartitionsByFilter` in other code path, but I’m not sure if any users use it.
   
   ![image](https://user-images.githubusercontent.com/5399861/99322759-9c326880-28ab-11eb-88fb-d0e5dfb52533.png)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725349344






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725271314






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-726918428


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35679/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r524194934



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##########
@@ -1267,9 +1267,19 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
     val catalogTable = restoreTableMetadata(rawTable)
 
     val partColNameMap = buildLowerCasePartColNameMap(catalogTable)
+    val hivePredicates = predicates.map {
+      // Avoid Hive metastore stack overflow.

Review comment:
       shall we put this in `HiveClientImpl`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r521126955



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##########
@@ -1267,9 +1267,19 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
     val catalogTable = restoreTableMetadata(rawTable)
 
     val partColNameMap = buildLowerCasePartColNameMap(catalogTable)
+    val hivePredicates = predicates.map {
+      // Avoid Hive metastore stack overflow.
+      case InSet(child, values)

Review comment:
       Ah, ok. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30325:
URL: https://github.com/apache/spark/pull/30325#issuecomment-725412738


   **[Test build #130922 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130922/testReport)** for PR 30325 at commit [`f9d018c`](https://github.com/apache/spark/commit/f9d018c28a860c65c5ba47bd28f2a23a3b5d2be3).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30325: [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30325:
URL: https://github.com/apache/spark/pull/30325#discussion_r521135541



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
##########
@@ -54,6 +54,15 @@ object StaticSQLConf {
     .transform(_.toLowerCase(Locale.ROOT))
     .createWithDefault("global_temp")
 
+  val HIVE_METASTORE_PARTITION_PRUNING_INSET_THRESHOLD =

Review comment:
       hm, I see. I think users might lower the value on runtime just after they see the exception, so  IMO it is useful that users can update the value on runtime.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org