You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Alex Ivanov (JIRA)" <ji...@apache.org> on 2018/11/03 00:06:00 UTC

[jira] [Comment Edited] (SPARK-25925) Spark 2.3.1 retrieves all partitions from Hive Metastore by default

    [ https://issues.apache.org/jira/browse/SPARK-25925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673842#comment-16673842 ] 

Alex Ivanov edited comment on SPARK-25925 at 11/3/18 12:05 AM:
---------------------------------------------------------------

Thank you for the clarification, [~budde]. This all makes sense, and seems like the better of the two evils, i.e. correctness over performance.

Perhaps this can be a suitable documentation change. Right now the only mention of *spark.sql.hive.caseSensitiveInferenceMode* in [Spark SQL Programming Guide|[https://spark.apache.org/docs/latest/sql-programming-guide.html]|https://spark.apache.org/docs/latest/sql-programming-guide.html].] is in the [Upgrading From Spark SQL 2.1 to 2.2|https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-21-to-22] section[.|https://spark.apache.org/docs/latest/sql-programming-guide.html].] If this information is provided in the section [Hive metastore Parquet table conversion|https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion], it would be a lot clearer to users they should definitely consider setting this property to *NEVER_INFER* if they don't have mixed case Parquet schema.

Would you be OK with that change?


was (Author: axenol):
Thank you for the clarification, [~budde]. This all makes sense, and seems like the better of the two evils, i.e. correctness over performance.

Perhaps this can be a suitable documentation change. Right now the only mention of *spark.sql.hive.caseSensitiveInferenceMode* in [Spark SQL Programming Guide|[https://spark.apache.org/docs/latest/sql-programming-guide.html]|https://spark.apache.org/docs/latest/sql-programming-guide.html].] is in the [Upgrading From Spark SQL 2.1 to 2.2|https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-21-to-22] section[.|https://spark.apache.org/docs/latest/sql-programming-guide.html].] If this information is provided in the section [Hive metastore Parquet table conversion|https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion], it would be a lot clearer to users they should definitely consider setting this property to *NEVER_INFER* if they don't have mixed case Parquet schema.
**

Would you be OK with that change?

> Spark 2.3.1 retrieves all partitions from Hive Metastore by default
> -------------------------------------------------------------------
>
>                 Key: SPARK-25925
>                 URL: https://issues.apache.org/jira/browse/SPARK-25925
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Alex Ivanov
>            Priority: Major
>
> Spark 2.3.1 comes with the following _spark-defaults.conf_ parameters by default:
> {code:java}
> spark.sql.hive.convertMetastoreParquet true
> spark.sql.hive.metastorePartitionPruning true
> spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE{code}
> While the first two properties are fine, the last one has an unfortunate side-effect. I realize it's set to INFER_AND_SAVE for a reason, namely https://issues.apache.org/jira/browse/SPARK-19611, however that also causes an issue.
> The problem is at this point:
> [https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L232]
> The inference causes all partitions to be retrieved for the table from Hive Metastore. This is a problem because even running *explain* on a simple query on a table with thousands of partitions seems to hang, and is very difficult to debug.
> Moreover, many people will address the issue by changing:
> {code:java}
> spark.sql.hive.convertMetastoreParquet false{code}
> see that it works, and call it a day, thereby forgoing the benefits of using Parquet support in Spark directly. In our experience, this causes significant slow-downs on at least some queries.
> This Jira is mostly to document the issue, even if it cannot be addressed, so that people who inevitably run into this behavior can see the resolution, which is changing the parameter to *NEVER_INFER*, provided there are no issues with Parquet-Hive schema compatibility, i.e. all of the schema is in lower-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org