You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Alex Ivanov (JIRA)" <ji...@apache.org> on 2018/11/02 05:18:00 UTC

[jira] [Commented] (SPARK-25925) Spark 2.3.1 retrieves all partitions from Hive Metastore by default

    [ https://issues.apache.org/jira/browse/SPARK-25925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672577#comment-16672577 ] 

Alex Ivanov commented on SPARK-25925:
-------------------------------------

[~budde], [~michael], can you please share your thoughts?

> Spark 2.3.1 retrieves all partitions from Hive Metastore by default
> -------------------------------------------------------------------
>
>                 Key: SPARK-25925
>                 URL: https://issues.apache.org/jira/browse/SPARK-25925
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Alex Ivanov
>            Priority: Major
>
> Spark 2.3.1 comes with the following _spark-defaults.conf_ parameters by default:
> {code:java}
> spark.sql.hive.convertMetastoreParquet true
> spark.sql.hive.metastorePartitionPruning true
> spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE{code}
> While the first two properties are fine, the last one has an unfortunate side-effect. I realize it's set to INFER_AND_SAVE for a reason, namely https://issues.apache.org/jira/browse/SPARK-19611, however that also causes an issue.
> The problem is at this point:
> [https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L232]
> The inference causes all partitions to be retrieved for the table from Hive Metastore. This is a problem because even running *explain* on a simple query on a table with thousands of partitions seems to hang, and is very difficult to debug.
> Moreover, many people will address the issue by changing:
> {code:java}
> spark.sql.hive.convertMetastoreParquet false{code}
> see that it works, and call it a day, thereby forgoing the benefits of using Parquet support in Spark directly. In our experience, this causes significant slow-downs on at least some queries.
> This Jira is mostly to document the issue, even if it cannot be addressed, so that people who inevitably run into this behavior can see the resolution, which is changing the parameter to *NEVER_INFER*, provided there are no issues with Parquet-Hive schema compatibility, i.e. all of the schema is in lower-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org