You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2017/10/22 18:07:04 UTC

[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/19552

    [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default

    ## What changes were proposed in this pull request?
    
    In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical issue by default. 
    
    - [SPARK-19611](https://issues.apache.org/jira/browse/SPARK-19611) uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files.
    
      > This situation will occur for any Hive table that wasn't created by Spark or that was created prior to Spark 2.1.0. If a user attempts to run a query over such a table containing a case-sensitive field name in the query projection or in the query filter, the query will return 0 results in every case.
    
    - However, [SPARK-22306](https://issues.apache.org/jira/browse/SPARK-22306) reports this also corrupts Hive Metastore schema by removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner. This is undesirable side-effects. Hive Metastore is a shared resource and Spark should not corrupt it by default. 
    
    - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look okay at least. However, we need to figure out the issue of changing owners. Also, we cannot backport bucketing patch into `branch-2.2`. We need to verify this option with more tests before releasing 2.3.0.
    
    This PR proposes to recover that option back to `NEVER_INFO` like Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by themselves.
    
    ## How was this patch tested?
    
    Pass the existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-22329

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19552.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19552
    
----
commit a256627dbc2772e69cd0f9f2aa43b384165e3657
Author: Dongjoon Hyun <do...@apache.org>
Date:   2017-10-22T17:59:15Z

    [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19552#discussion_r146385329
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -388,7 +388,7 @@ object SQLConf {
         .stringConf
         .transform(_.toUpperCase(Locale.ROOT))
         .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
    -    .createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
    +    .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)
    --- End diff --
    
    We can improve the documentation instead of changing the default. 
    
    If my understanding is right, this occurs only when Spark SQL tries to read the table created by the other tables.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19552
  
    I close this since #19622 is merged.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19552
  
    Thank you for review, @gatorsmile and @budde .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19552
  
    **[Test build #82961 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82961/testReport)** for PR 19552 at commit [`a256627`](https://github.com/apache/spark/commit/a256627dbc2772e69cd0f9f2aa43b384165e3657).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19552
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82961/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19552
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19552
  
    **[Test build #82961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82961/testReport)** for PR 19552 at commit [`a256627`](https://github.com/apache/spark/commit/a256627dbc2772e69cd0f9f2aa43b384165e3657).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun closed the pull request at:

    https://github.com/apache/spark/pull/19552


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...

Posted by budde <gi...@git.apache.org>.
Github user budde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19552#discussion_r146416338
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -388,7 +388,7 @@ object SQLConf {
         .stringConf
         .transform(_.toUpperCase(Locale.ROOT))
         .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
    -    .createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
    +    .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)
    --- End diff --
    
    ```INFER_AND_SAVE``` was introduced to fix the issues presented in [SPARK-19611](https://issues.apache.org/jira/browse/SPARK-19611) that broke any table without the Spark-embedded table schema. This would break any table not created with Spark 2.0 or above, so it included tables created by older versions of Spark SQL (this was the situation we ran in to).
    
    Some issues with how this affects other Hive table properties were uncovered in [SPARK-22306](https://issues.apache.org/jira/browse/SPARK-22306). These problems are resolved by falling back to the previous default of ```NEVER_INFER``` that was used prior to Spark 2.2.0. This will mean that out of the box Spark still won't be compatible with Hive tables backed by case-sensitive data files that weren't created by Spark SQL 2.0 or above but will avoid mangling existing Hive table properties. This is meant as a short term fix until I can go back and debug/resolve the conflicts that are occurring.
    
    I think these issues have highlighted how brittle of an approach that relying on Spark-specific Hive table properties is, especially since it's impossible to predict how other frameworks will utilize the table properties themselves, but I don't think there's any better way of doing this and we may just have to deal with conflicts like this as they arise.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19552
  
    Ping, @budde .
    You can override this PR whenever you're ready.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org