You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kmanamcheri <gi...@git.apache.org> on 2018/10/02 18:28:08 UTC

[GitHub] spark pull request #22614: HiveClient.getPartitionsByFilter should not throw...

GitHub user kmanamcheri opened a pull request:

    https://github.com/apache/spark/pull/22614

    HiveClient.getPartitionsByFilter should not throw an exception if HMS retries directSql

    ## What changes were proposed in this pull request?
    
    When using partition filter pushdown to HMS, Spark should expect a MetaException from HMS if
    partition filtering is not supported and should call getAllPartitions instead. HMS is expected to
    throw a MetaException even if directSql is enabled.
    
    ## How was this patch tested?
    
    Unit tests on the Spark SQL component were run successfully.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kmanamcheri/spark SPARK-25561

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22614.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22614
    
----
commit dddffcae8824e72d614fd6202e7fc562c490098b
Author: Karthik Manamcheri <ka...@...>
Date:   2018-10-02T16:11:20Z

    HiveShim should expect MetaException from HMS in all cases

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #96999 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96999/testReport)** for PR 22614 at commit [`cb0577b`](https://github.com/apache/spark/commit/cb0577bef7058401b354f7695c4319d73c7156e6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    @gatorsmile I have added the config option and an additional test.
    
    Here's the new behavior
    - Setting spark.sql.metastorePartitionPruningFallback to 'false' will ALWAYS throw an exception if partition pushdown fails (Hive throws an exception). This is suggested for queries where you want to fail fast and you know that you have a large number of partitions.
    - Setting spark.sql.metastorePartitionPruningFallback to 'true' (this is the default setting) will ALWAYS catch exception from Hive and retry with fetching all partitions. However, to be helpful to users, Spark will read the directSql config value from Hive and provide good log messages on what the next steps to do.
    
    @dongjoon-hyun @mallman @vanzin If these look good, can we move on this to merge? Thanks a lot for all the comments and discussions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by rezasafi <gi...@git.apache.org>.
Github user rezasafi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222349989
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    Hive has a config "hive.metastore.limit.partition.request" that can limit number of partitions that can be requested from HMS. So I think there is no need for a new config on the Spark side. 
    Also since direct sql is a best effort approach just failing when direct sql is enabled is not good.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97126/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223139874
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala ---
    @@ -87,6 +90,18 @@ class HiveClientSuite(version: String)
         assert(filteredPartitions.size == testPartitionCount)
       }
     
    +  test(s"getPartitionsByFilter should throw an exception if $partPruningFallbackKey=false") {
    +    withSQLConf(SQLConf.HIVE_METASTORE_PARTITION_PRUNING_FALLBACK.key -> "false") {
    --- End diff --
    
    Ok will do.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by rezasafi <gi...@git.apache.org>.
Github user rezasafi commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    @tejasapatil @kmanamcheri  any update? thank you very much in advance.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r224639756
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
    --- End diff --
    
    @kmanamcheri : Lets do this:
    - We should prefer doing `getPartitionsByFilterMethod()`. If it fails, we retry with increasing delay across retries.
    - If retries are exhausted, we could fetch all the partitions of the table. Some people might not want this so lets control this using a conf flag. For those who don't want it, the query could fail at this point.
    
    What do you think ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223462348
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala ---
    @@ -79,12 +82,30 @@ class HiveClientSuite(version: String)
         client = init(true)
       }
     
    -  test(s"getPartitionsByFilter returns all partitions when $tryDirectSqlKey=false") {
    -    val client = init(false)
    -    val filteredPartitions = client.getPartitionsByFilter(client.getTable("default", "test"),
    -      Seq(attr("ds") === 20170101))
    +  test(s"getPartitionsByFilter returns all partitions when $partPruningFallbackKey=true") {
    +    withSQLConf(SQLConf.HIVE_METASTORE_PARTITION_PRUNING_FALLBACK_ENABLED.key -> "true",
    +        SQLConf.HIVE_METASTORE_PARTITION_PRUNING.key -> "true") {
    +      val client = init(false)
    +      // tryDirectSql = false and a non-string partition filter will always fail. This condition
    +      // is used to test if the fallback works
    +      val filteredPartitions = client.getPartitionsByFilter(client.getTable("default", "test"),
    +        Seq(attr("ds") === 20170101))
     
    -    assert(filteredPartitions.size == testPartitionCount)
    +      assert(filteredPartitions.size == testPartitionCount)
    +    }
    +  }
    +
    +  test(s"getPartitionsByFilter should throw an exception if $partPruningFallbackKey=false") {
    --- End diff --
    
    Shorter: "getPartitionsByFilter fails if metastore call fails and $partPruningFallbackKey=false"


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by mallman <gi...@git.apache.org>.
Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223415835
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -754,26 +755,38 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
               tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
    --- End diff --
    
    To me your revised comment suggests that if we try to filter on a non-string partition column we can expect the call to `getPartitionedByFilter` to fail. That's not true.
    
    Although I wrote the original comment, I can see how making an assumption about Hive's behavior when calling this method is rather specious and unwise. I suggest we just remove this comment entirely. To simply state that
    
    ```
    Hive may throw an exception when calling this method in some circumstances.
    ```
    
    only states the obvious—of course any method call may throw an exception in some circumstances.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Looks ok to me based on discussion in the bug. Will leave here to see if others have any comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222190138
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    To be honest, I do not think we should issue a warning message and call getAllPartitions. When the number of partitions is huge, getAllPartitions will be super super slow. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223469446
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala ---
    @@ -79,12 +82,30 @@ class HiveClientSuite(version: String)
         client = init(true)
       }
     
    -  test(s"getPartitionsByFilter returns all partitions when $tryDirectSqlKey=false") {
    -    val client = init(false)
    -    val filteredPartitions = client.getPartitionsByFilter(client.getTable("default", "test"),
    -      Seq(attr("ds") === 20170101))
    +  test(s"getPartitionsByFilter returns all partitions when $partPruningFallbackKey=true") {
    +    withSQLConf(SQLConf.HIVE_METASTORE_PARTITION_PRUNING_FALLBACK_ENABLED.key -> "true",
    +        SQLConf.HIVE_METASTORE_PARTITION_PRUNING.key -> "true") {
    +      val client = init(false)
    +      // tryDirectSql = false and a non-string partition filter will always fail. This condition
    +      // is used to test if the fallback works
    +      val filteredPartitions = client.getPartitionsByFilter(client.getTable("default", "test"),
    +        Seq(attr("ds") === 20170101))
     
    -    assert(filteredPartitions.size == testPartitionCount)
    +      assert(filteredPartitions.size == testPartitionCount)
    +    }
    +  }
    +
    +  test(s"getPartitionsByFilter should throw an exception if $partPruningFallbackKey=false") {
    --- End diff --
    
    Ok I changed the test description to be more accurate of what we are testing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #96899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96899/testReport)** for PR 22614 at commit [`2ad9cf4`](https://github.com/apache/spark/commit/2ad9cf4923c19127efbdeef3fe6e3cd3e3f728ff).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #96899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96899/testReport)** for PR 22614 at commit [`2ad9cf4`](https://github.com/apache/spark/commit/2ad9cf4923c19127efbdeef3fe6e3cd3e3f728ff).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223148534
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala ---
    @@ -87,6 +90,18 @@ class HiveClientSuite(version: String)
         assert(filteredPartitions.size == testPartitionCount)
       }
     
    +  test(s"getPartitionsByFilter should throw an exception if $partPruningFallbackKey=false") {
    +    withSQLConf(SQLConf.HIVE_METASTORE_PARTITION_PRUNING_FALLBACK.key -> "false") {
    --- End diff --
    
    Done.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #97126 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97126/testReport)** for PR 22614 at commit [`01e2123`](https://github.com/apache/spark/commit/01e2123a3ddd0cf0c15d5e79c2beafad821a2cf6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by mallman <gi...@git.apache.org>.
Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222348674
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    > sometimes failing is better than spending several hours to get all partitions. Shall we add a config to switch the behavior?
    
    I think that's what `${SQLConf.HIVE_MANAGE_FILESOURCE_PARTITIONS}` is for. @kmanamcheri, what happens if you set this to false?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223117511
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -544,6 +544,15 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val HIVE_METASTORE_PARTITION_PRUNING_FALLBACK =
    +    buildConf("spark.sql.hive.metastorePartitionPruningFallback")
    --- End diff --
    
    Should we use `spark.sql.legacy` prefix like [SPARK-19724](https://github.com/apache/spark/pull/22515), @rxin and @cloud-fan ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    @mallman @cloud-fan @ericl @rezasafi 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222340141
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    sometimes failing is better than spending several hours to get all partitions. Shall we add a config to switch the behavior?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by mallman <gi...@git.apache.org>.
Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223441744
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala ---
    @@ -79,12 +82,30 @@ class HiveClientSuite(version: String)
         client = init(true)
       }
     
    -  test(s"getPartitionsByFilter returns all partitions when $tryDirectSqlKey=false") {
    -    val client = init(false)
    -    val filteredPartitions = client.getPartitionsByFilter(client.getTable("default", "test"),
    -      Seq(attr("ds") === 20170101))
    +  test(s"getPartitionsByFilter returns all partitions when $partPruningFallbackKey=true") {
    +    withSQLConf(SQLConf.HIVE_METASTORE_PARTITION_PRUNING_FALLBACK_ENABLED.key -> "true",
    +        SQLConf.HIVE_METASTORE_PARTITION_PRUNING.key -> "true") {
    +      val client = init(false)
    +      // tryDirectSql = false and a non-string partition filter will always fail. This condition
    +      // is used to test if the fallback works
    +      val filteredPartitions = client.getPartitionsByFilter(client.getTable("default", "test"),
    +        Seq(attr("ds") === 20170101))
     
    -    assert(filteredPartitions.size == testPartitionCount)
    +      assert(filteredPartitions.size == testPartitionCount)
    +    }
    +  }
    +
    +  test(s"getPartitionsByFilter should throw an exception if $partPruningFallbackKey=false") {
    --- End diff --
    
    The test name states that `getPartitionsByFilter` should throw an exception if partition pruning fallback is disabled. But that's not right. I think we need an accurate name for this and the previous test. Perhaps it should include a mention that the underlying call to the metastore throws an exception. How about
    
    ```
    s"getPartitionsByFilter should throw an exception if the underlying call to the
    metastore throws an exception and $partPruningFallbackKey=false"
    ```
    
    ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222135430
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    @kmanamcheri . Could you show different and more correct warning message based on `tryDirectSql` value here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Yes. Let us add a conf for controlling the fallback. Please also add the test cases for verifying it. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222359233
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    @mallman I haven't tried using that config option. If I am understanding [the documentation for HIVE_MANAGE_FILESOURCE_PARTITIONS](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L547) correctly, if I set that value to false, partitions will not be stored in HMS. That sounds like it is addressing a different issue, no?
    
    If that's the suggested way to deal with non-supported partition filters, then this code should always fail if getPartitionsByFilter fails, no? Why even have a fallback (as we do currently)? SPARK-17992 seems to say that Spark should handle certain cases of partition pushdown failures (such as HMS ORM mode). My argument is that the case should be expanded to include even if hive.metastore.try.direct.sql is enabled to be true.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    > a better and safer solution by introducing exponential backoff with retries
    
    I'm confused about that suggestion. What is being retried? The user query? The HMS call? What is changing in between the retries that gives any hope that a future call will succeed? And if all the retries fail, do you fallback like this change is doing, or what?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    The PR description and title may need to change accordingly. Can you update it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by mallman <gi...@git.apache.org>.
Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222345462
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    I think the original warning message is more accurate. Direct sql mode isn't just about performance. It's also about enhanced capability, e.g. supporting filtering on non-string type columns. As the original comment states, setting the direct sql config value to true may resolve a problem around metastore-side partition filtering.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97014/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223461273
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,45 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    +        val shouldFallback = SQLConf.get.metastorePartitionPruningFallback
             val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
             // We should get this config value from the metaStore. otherwise hit SPARK-18681.
             // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
             // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
             val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
               tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
    -          // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    -            logWarning("Caught Hive MetaException attempting to get partition metadata by " +
    -              "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    -            // HiveShim clients are expected to handle a superset of the requested partitions
    -            getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              tryDirectSql =>
    -            throw new RuntimeException("Caught Hive MetaException attempting to get partition " +
    -              "metadata by filter from Hive. You can set the Spark configuration setting " +
    -              s"${SQLConf.HIVE_MANAGE_FILESOURCE_PARTITIONS.key} to false to work around this " +
    -              "problem, however this will result in degraded performance. Please report a bug: " +
    -              "https://issues.apache.org/jira/browse/SPARK", ex)
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
    +            if (shouldFallback) {
    +              if (!tryDirectSql) {
    --- End diff --
    
    Not really your fault, but getting the value of this config requires a round-trip to the HMS, so it would be good to only do that here and avoid that extra remote call in the normal case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96868/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96899/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222373627
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    One option if we want to get all fancy is to add a configurable timeout in the fallback case - assuming it's possible to cancel an ongoing call (run in a separate thread + interrupt maybe?).
    
    My main concern with the fallback, really, isn't the slowness, but that in the case where it would be slow (= too many partitions), the HMS might just run itself out of memory trying to serve the request.
    
    Reza mentions the Hive config which I think is the right thing to do by the HMS admin, since it avoids apps DoS'ing the server. Not sure what's the behavior there, but I hope if fails the call if there are too many partitions (instead of returning a subset). IMO that config seems to cover all the concerns here assuming the call will fail when you have too many partitions, no?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223429117
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala ---
    @@ -79,12 +82,30 @@ class HiveClientSuite(version: String)
         client = init(true)
       }
     
    -  test(s"getPartitionsByFilter returns all partitions when $tryDirectSqlKey=false") {
    -    val client = init(false)
    -    val filteredPartitions = client.getPartitionsByFilter(client.getTable("default", "test"),
    -      Seq(attr("ds") === 20170101))
    +  test(s"getPartitionsByFilter returns all partitions when $partPruningFallbackKey=true") {
    +    withSQLConf(SQLConf.HIVE_METASTORE_PARTITION_PRUNING_FALLBACK_ENABLED.key -> "true",
    +        SQLConf.HIVE_METASTORE_PARTITION_PRUNING.key -> "true") {
    +      val client = init(false)
    +      // tryDirectSql = false and a non-string partition filter will always fail. This condition
    +      // is used to test if the fallback works
    +      val filteredPartitions = client.getPartitionsByFilter(client.getTable("default", "test"),
    +        Seq(attr("ds") === 20170101))
     
    -    assert(filteredPartitions.size == testPartitionCount)
    +      assert(filteredPartitions.size == testPartitionCount)
    +    }
    +  }
    +
    +  test(s"getPartitionsByFilter should throw an exception if $partPruningFallbackKey=false") {
    --- End diff --
    
    Hmm.. The behavior does not depend on the value of tryDirectSqlKey though. It solely only depends on if pruning is enabled and if pruning.fallback is enabled. 
    
    The reason we set tryDirectSqlKey to false is to generate a consistent way for pruning to fail. Only setting tryDirectSql to false does not throw an exception.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222123856
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
    --- End diff --
    
    Also, it's not blindly calling that API right? It was already being called before if direct sql was disabled. In the other case, it was just throwing an exception. So now instead of erroring out it will work, just more slowly than expected.
    
    Unless there's some retry at a higher layer that I'm not aware of.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    @viirya I have updated the title and description. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223137994
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala ---
    @@ -87,6 +90,18 @@ class HiveClientSuite(version: String)
         assert(filteredPartitions.size == testPartitionCount)
       }
     
    +  test(s"getPartitionsByFilter should throw an exception if $partPruningFallbackKey=false") {
    +    withSQLConf(SQLConf.HIVE_METASTORE_PARTITION_PRUNING_FALLBACK.key -> "false") {
    --- End diff --
    
    Also need to set `HIVE_METASTORE_PARTITION_PRUNING.key` to `true`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Also, https://github.com/apache/spark/pull/22614#discussion_r223172392 proposes a better and safer solution by introducing `exponential backoff with retries`. @kmanamcheri Could you update your PR with this suggestion?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    > Let us add a conf to control it? Failing fast is better than hanging. If users want to get all partitions, they can change the conf by themselves.
    
    @gatorsmile We already have a config option "spark.sql.hive.metastorePartitionPruning". If that is set to false, we will never push down the partitions to HMS. I will add "spark.sql.hive.metastorePartitionPruningFallback" which in addition to the previous one controls the fallback behavior. Irrespective of the value of Hive direct SQL, if we enable the pruning fallback, we will catch the exception and fallback to fetch all partitions. Does this sound like a reasonable compromise @mallman ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222122400
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
    --- End diff --
    
    We should not blindly call getAllPartitions. This will be super slow. We should do some retries. It depends on the errors we got. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223498018
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
    --- End diff --
    
    Could you review the newer changes I have done? Basically, yes, I agree that fetching all partitions is going to be bad and hence we'll leave it up to the user. They can disable fetching all the partitions by setting "spark.sql.hive.metastorePartitionPruning.fallback.enabled" to false. In that case, we'll never retry. If it is set to "true", then we'll retry. As simple as that.
    
    I don't completely understand "exponential backoff with retries". Do you do this at the HMS level? or at the query level? If HMS filter pushdown fails once, does it mean it will succeed in the future? Maybe this is a future improvement to this where instead of a boolean "fallback-enabled" or "fallback-disabled", we can have multiple levels for trying the fallback with timing etc. Thoughts @tejasapatil 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223148506
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -544,6 +544,15 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val HIVE_METASTORE_PARTITION_PRUNING_FALLBACK =
    +    buildConf("spark.sql.hive.metastorePartitionPruningFallback")
    +      .doc("When true, enable fallback to fetch all partitions if Hive metastore partition " +
    +           "push down fails. This is applicable only if partition pruning is enabled (see " +
    +           s" ${HIVE_METASTORE_PARTITION_PRUNING.key}). Enabling this may degrade performance " +
    +           "if there are a large number of partitions." )
    +      .booleanConf
    +      .createWithDefault(true)
    --- End diff --
    
    Done.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223139984
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -544,6 +544,15 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val HIVE_METASTORE_PARTITION_PRUNING_FALLBACK =
    +    buildConf("spark.sql.hive.metastorePartitionPruningFallback")
    --- End diff --
    
    spark.sql.hive.metastorePartitionPruning.fallback.enabled


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223172392
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
    --- End diff --
    
    @gatorsmile : Sorry for late reply. We had seen issues with this in past and resorted to do exponential backoff with retries. Fetching all the partitions is going to be bad in a prod setting.... even if it makes it through, the underlying problem if left un-noticed is bad for the system health.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223136013
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -544,6 +544,15 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val HIVE_METASTORE_PARTITION_PRUNING_FALLBACK =
    +    buildConf("spark.sql.hive.metastorePartitionPruningFallback")
    +      .doc("When true, enable fallback to fetch all partitions if Hive metastore partition " +
    +           "push down fails. This is applicable only if partition pruning is enabled (see " +
    +           s" ${HIVE_METASTORE_PARTITION_PRUNING.key}). Enabling this may degrade performance " +
    +           "if there are a large number of partitions." )
    +      .booleanConf
    +      .createWithDefault(true)
    --- End diff --
    
    If we set that to false, SPARK-17992 will change the default behavior. Is that ok?
    
    So if we set this flag to false by default, then an exception from HMS will always be re-thrown (direct sql or not). Just want to make sure that we understand that the default behavior will change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222140323
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    Good idea. Yes that would be better. I'll add that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223473324
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,45 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    +        val shouldFallback = SQLConf.get.metastorePartitionPruningFallback
             val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
             // We should get this config value from the metaStore. otherwise hit SPARK-18681.
             // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
             // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
             val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
               tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
    -          // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    -            logWarning("Caught Hive MetaException attempting to get partition metadata by " +
    -              "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    -            // HiveShim clients are expected to handle a superset of the requested partitions
    -            getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              tryDirectSql =>
    -            throw new RuntimeException("Caught Hive MetaException attempting to get partition " +
    -              "metadata by filter from Hive. You can set the Spark configuration setting " +
    -              s"${SQLConf.HIVE_MANAGE_FILESOURCE_PARTITIONS.key} to false to work around this " +
    -              "problem, however this will result in degraded performance. Please report a bug: " +
    -              "https://issues.apache.org/jira/browse/SPARK", ex)
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
    +            if (shouldFallback) {
    +              if (!tryDirectSql) {
    --- End diff --
    
    Good idea. Done.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #97119 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97119/testReport)** for PR 22614 at commit [`544b2ad`](https://github.com/apache/spark/commit/544b2ad6e5c21d5b1cc503fe068e44afa6973f1d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    > Based on my understanding, the solution of FB team is to retry the following commands multiple times:
    > 
    > ```
    > getPartitionsByFilterMethod.invoke(hive, table, filter).asInstanceOf[JArrayList[Partition]]
    > ```
    
    @gatorsmile hmm my understanding was different. I thought they were retrying the fetchAllpartitions method. Maybe @tejasapatil can clarify here?
    
    > This really depends on what is the actual errors that fail `getPartitionsByFilterMethod`. When there are many concurrent users share the same metastore, `exponential backoff with retries` is very reasonable since most of errors might be caused by timeout or similar reasons.
    
    Doesn't this apply with every other HMS API as well? If so, shouldn't we be building a complete solution in HiveShim around this to do an `exponential backoff with retries` on every single HMS call in HiveShim?
    
    > If it still fails, I would suggest to fail fast or depends on the conf value of `spark.sql.hive.metastorePartitionPruning.fallback.enabled`
    
    Ok I agree. 
    
    I think we need clarification from @tejasapatil on which call they retry.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Based on my understanding, the solution of FB team is to retry the following commands multiple times:
    ```
    getPartitionsByFilterMethod.invoke(hive, table, filter).asInstanceOf[JArrayList[Partition]]
    ```
    
    This really depends on what is the actual errors that fail `getPartitionsByFilterMethod`. When there are many concurrent users share the same metastore, `exponential backoff with retries` is very reasonable since most of errors might be caused by timeout or similar reasons. 
    
    If it still fails, I would suggest to fail fast or depends on the conf value of `spark.sql.hive.metastorePartitionPruning.fallback.enabled`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #96868 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96868/testReport)** for PR 22614 at commit [`dddffca`](https://github.com/apache/spark/commit/dddffcae8824e72d614fd6202e7fc562c490098b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #97119 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97119/testReport)** for PR 22614 at commit [`544b2ad`](https://github.com/apache/spark/commit/544b2ad6e5c21d5b1cc503fe068e44afa6973f1d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223459567
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -544,6 +544,15 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val HIVE_METASTORE_PARTITION_PRUNING_FALLBACK_ENABLED =
    +    buildConf("spark.sql.hive.metastorePartitionPruning.fallback.enabled")
    +      .doc("When true, enable fallback to fetch all partitions if Hive metastore partition " +
    +           "push down fails. This is applicable only if partition pruning is enabled (see " +
    +           s" ${HIVE_METASTORE_PARTITION_PRUNING.key}). Enabling this may degrade performance " +
    +           "if there are a large number of partitions." )
    --- End diff --
    
    "if there is"


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561] [SQL] HiveClient.getPartitionsByFilter sho...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223419868
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -754,26 +755,38 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
               tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
    --- End diff --
    
    Done @mallman 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223418766
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -754,26 +755,38 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
               tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
    --- End diff --
    
    > of course any method call may throw an exception in some circumstances.
    :+1: Yes, you are right. I agree with removing comments which might mislead future developers. I'll rephrase this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96999/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #96999 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96999/testReport)** for PR 22614 at commit [`cb0577b`](https://github.com/apache/spark/commit/cb0577bef7058401b354f7695c4319d73c7156e6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by mallman <gi...@git.apache.org>.
Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223422030
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,45 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    +        val shouldFallback = SQLConf.get.metastorePartitionPruningFallback
             val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
             // We should get this config value from the metaStore. otherwise hit SPARK-18681.
             // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
             // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
             val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
               tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
    -          // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    -            logWarning("Caught Hive MetaException attempting to get partition metadata by " +
    -              "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    -            // HiveShim clients are expected to handle a superset of the requested partitions
    -            getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              tryDirectSql =>
    -            throw new RuntimeException("Caught Hive MetaException attempting to get partition " +
    -              "metadata by filter from Hive. You can set the Spark configuration setting " +
    -              s"${SQLConf.HIVE_MANAGE_FILESOURCE_PARTITIONS.key} to false to work around this " +
    -              "problem, however this will result in degraded performance. Please report a bug: " +
    -              "https://issues.apache.org/jira/browse/SPARK", ex)
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
    +            if (shouldFallback) {
    +              if (!tryDirectSql) {
    +                logWarning("Caught Hive MetaException attempting to get partition metadata by " +
    +                  "filter from Hive. Falling back to fetching all partition metadata, which will " +
    +                  "degrade performance. Modifying your Hive metastore configuration to set " +
    +                  s"${tryDirectSqlConfVar.varname} to true may resolve this problem.")
    +              } else {
    +                logWarning("Caught Hive MetaException attempting to get partition metadata " +
    +                  "by filter from Hive. Hive metastore's direct SQL feature has been enabled, " +
    +                  "but it is an optimistic optimization and not guaranteed to work. Falling back " +
    +                  "to fetching all partition metadata, which will degrade performance (for the " +
    +                  "current query). If you see this error consistently, you can set the Spark " +
    +                  s"configuration setting ${SQLConf.HIVE_MANAGE_FILESOURCE_PARTITIONS.key} to " +
    +                  "false as a work around, however this will result in degraded performance. " +
    +                  "Please report a bug to Hive stating that direct SQL is failing consistently " +
    +                  "for the specified query: https://issues.apache.org/jira/browse/HIVE")
    --- End diff --
    
    I think we should remove the suggestion to file a Hive project bug. Even with the direct SQL configuration setting enabled, there are valid metastore deployments for which it will be ignored. For example, my understanding is that if the metastore uses MongoDB for its underlying storage, the direct SQL configuration setting will be ignored. That means a failure here is not a Hive bug with direct SQL.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #96868 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96868/testReport)** for PR 22614 at commit [`dddffca`](https://github.com/apache/spark/commit/dddffcae8824e72d614fd6202e7fc562c490098b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by rezasafi <gi...@git.apache.org>.
Github user rezasafi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222381843
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    Yes, if the query tries to get more partitions than "hive.metastore.limit.partition.request", the query will fail. Using the hive config the user can judge if he wants to get all the partitions or not. I also think that config covers all the concerns


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #97014 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97014/testReport)** for PR 22614 at commit [`f42bbec`](https://github.com/apache/spark/commit/f42bbec8d7ba23cca77f2bf83230ad2e2ceafeb9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222122774
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
    --- End diff --
    
    ping @srinathshankar @ericl 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #97014 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97014/testReport)** for PR 22614 at commit [`f42bbec`](https://github.com/apache/spark/commit/f42bbec8d7ba23cca77f2bf83230ad2e2ceafeb9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97119/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    As @tejasapatil suggested above, this fallback is not suggested to be on in a prod setting. It could also impact the system health (e.g., the other concurrent queries that are querying the same Hive metastore could be blocked). Thus, I would suggest to make the new conf as internal and make the configuration description more clear regarding the impact of this fallback. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223122115
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -544,6 +544,15 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val HIVE_METASTORE_PARTITION_PRUNING_FALLBACK =
    +    buildConf("spark.sql.hive.metastorePartitionPruningFallback")
    --- End diff --
    
    What is the reasoning for marking this as legacy?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223134564
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -544,6 +544,15 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val HIVE_METASTORE_PARTITION_PRUNING_FALLBACK =
    +    buildConf("spark.sql.hive.metastorePartitionPruningFallback")
    +      .doc("When true, enable fallback to fetch all partitions if Hive metastore partition " +
    +           "push down fails. This is applicable only if partition pruning is enabled (see " +
    +           s" ${HIVE_METASTORE_PARTITION_PRUNING.key}). Enabling this may degrade performance " +
    +           "if there are a large number of partitions." )
    +      .booleanConf
    +      .createWithDefault(true)
    --- End diff --
    
    By default, this should be false


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561] [SQL] HiveClient.getPartitionsByFilter sho...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222123426
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
    --- End diff --
    
    @gatorsmile From HMS side, the error is always the same "MetaException" and there is no way to tell apart a direct SQL error from an error of "not supported" (unfortunately!). How do you propose we address this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Let us add a conf to control it? Failing fast is better than hanging. If users want to get all partitions, they can change the conf by themselves. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222372452
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    @dongjoon-hyun @mallman I have updated the log messages to be more descriptive and helpful for the user to indicate what they should try doing. Does that help?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222134301
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
    --- End diff --
    
    cc @sameeragarwal @tejasapatil  Could you share what FB does for the retry?  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222348679
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
             try {
               // Hive may throw an exception when calling this method in some circumstances, such as
    -          // when filtering on a non-string partition column when the hive config key
    -          // hive.metastore.try.direct.sql is false
    +          // when filtering on a non-string partition column.
               getPartitionsByFilterMethod.invoke(hive, table, filter)
                 .asInstanceOf[JArrayList[Partition]]
             } catch {
    -          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] &&
    -              !tryDirectSql =>
    +          case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] =>
                 logWarning("Caught Hive MetaException attempting to get partition metadata by " +
                   "filter from Hive. Falling back to fetching all partition metadata, which will " +
    -              "degrade performance. Modifying your Hive metastore configuration to set " +
    -              s"${tryDirectSqlConfVar.varname} to true may resolve this problem.", ex)
    +              "degrade performance. Enable direct SQL mode in hive metastore to attempt " +
    +              "to improve performance. However, Hive's direct SQL mode is an optimistic " +
    +              "optimization and does not guarantee improved performance.")
    --- End diff --
    
    @mallman The key point to note here is that setting direct sql on HMS "may" resolve the problem. It is not guaranteed. HMS only optimistically optimizes this. If direct sql on HMS fails, it will fall back on ORM and then fail again. Spark'ss behavior should not be inconsistent depending on HMS config value. 
    
    My suggested fix would still call getPartitionsByFilter and if that fails, will call getAllPartitions. We won't be calling getAllPartitions in all cases. It is a fallback mechanism. 
    
    @gatorsmile hmm.. why not? We know it might be slow and hence the warning. Maybe the warning message should read that this could be slow depending on the number of partitions since partition push-down to HMS failed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    **[Test build #97126 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97126/testReport)** for PR 22614 at commit [`01e2123`](https://github.com/apache/spark/commit/01e2123a3ddd0cf0c15d5e79c2beafad821a2cf6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] Implement a new config to control par...

Posted by kmanamcheri <gi...@git.apache.org>.
Github user kmanamcheri commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    @gatorsmile, @tejasapatil was reviewing the code before I added the new config option. I have asked him to review the new code. Lets see what his thoughts are on that. I have also asked him clarification on what he means by exponential backoff with retries.
    
    I want to take a step back and revisit [SPARK-17992](https://issues.apache.org/jira/browse/SPARK-17992) and in particular [one of the comments](https://github.com/apache/spark/pull/15673#issuecomment-257120666) from @ericl 
    
    > For large tables, the degraded performance should be considered a bug as well.
    >
    > How about this.
    >
    >If direct sql is disabled, log a warning about degraded performance with this flag and fall back to >fetching all partitions.
    >If direct sql is enabled, crash with a message suggesting to disable filesource partition management >and report a bug.
    >That way, we will know if there are cases where metastore pruning fails with direct sql enabled.
    
    It looks like a compromise was reached where we don't support fetching all the time (and only for a subset of cases). My suggested fix is a cleaner way of approaching it through a SQLConf instead of looking at the Hive config. 
    
    Thoughts @mallman @ericl 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by mallman <gi...@git.apache.org>.
Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223425011
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala ---
    @@ -79,12 +82,30 @@ class HiveClientSuite(version: String)
         client = init(true)
       }
     
    -  test(s"getPartitionsByFilter returns all partitions when $tryDirectSqlKey=false") {
    -    val client = init(false)
    -    val filteredPartitions = client.getPartitionsByFilter(client.getTable("default", "test"),
    -      Seq(attr("ds") === 20170101))
    +  test(s"getPartitionsByFilter returns all partitions when $partPruningFallbackKey=true") {
    +    withSQLConf(SQLConf.HIVE_METASTORE_PARTITION_PRUNING_FALLBACK_ENABLED.key -> "true",
    +        SQLConf.HIVE_METASTORE_PARTITION_PRUNING.key -> "true") {
    +      val client = init(false)
    +      // tryDirectSql = false and a non-string partition filter will always fail. This condition
    +      // is used to test if the fallback works
    +      val filteredPartitions = client.getPartitionsByFilter(client.getTable("default", "test"),
    +        Seq(attr("ds") === 20170101))
     
    -    assert(filteredPartitions.size == testPartitionCount)
    +      assert(filteredPartitions.size == testPartitionCount)
    +    }
    +  }
    +
    +  test(s"getPartitionsByFilter should throw an exception if $partPruningFallbackKey=false") {
    --- End diff --
    
    Change test name to
    
    ```
    s"getPartitionsByFilter should throw an exception when $tryDirectSqlKey=false and $partPruningFallbackKey=false"
    ```
    
    ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...

Posted by mallman <gi...@git.apache.org>.
Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r223424625
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientSuite.scala ---
    @@ -79,12 +82,30 @@ class HiveClientSuite(version: String)
         client = init(true)
       }
     
    -  test(s"getPartitionsByFilter returns all partitions when $tryDirectSqlKey=false") {
    -    val client = init(false)
    -    val filteredPartitions = client.getPartitionsByFilter(client.getTable("default", "test"),
    -      Seq(attr("ds") === 20170101))
    +  test(s"getPartitionsByFilter returns all partitions when $partPruningFallbackKey=true") {
    --- End diff --
    
    Change test name to
    
    ```
    s"getPartitionsByFilter returns all partitions when $tryDirectSqlKey=false and $partPruningFallbackKey=true"
    ```
    
    ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilt...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22614#discussion_r222134090
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
    @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {
             getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]]
           } else {
             logDebug(s"Hive metastore filter is '$filter'.")
    -        val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL
    -        // We should get this config value from the metaStore. otherwise hit SPARK-18681.
    -        // To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by:
    -        // val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean
    -        val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname,
    -          tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean
    --- End diff --
    
    Ping @wangyum, too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22614: [SPARK-25561][SQL] HiveClient.getPartitionsByFilter shou...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22614
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org