You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by seancxmao <gi...@git.apache.org> on 2018/09/05 15:11:18 UTC

[GitHub] spark pull request #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must b...

GitHub user seancxmao opened a pull request:

    https://github.com/apache/spark/pull/22343

    [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consistent to do the conversion

    ## What changes were proposed in this pull request?
    parquet data source tables and hive parquet tables have different behaviors about parquet field resolution. So, when `spark.sql.hive.convertMetastoreParquet` is true, users might face inconsistent behaviors. The differences are:
    * Whether respect spark.sql.caseSensitive. Without #22148, both data source tables and hive tables do NOT respect spark.sql.caseSensitive. However data source tables always do case-sensitive parquet field resolution, while hive tables always do case-insensitive parquet field resolution no matter whether spark.sql.caseSensitive is set to true or false. #22148 let data source tables respect spark.sql.caseSensitive while hive serde table behavior is not changed.
    * How to resolve ambiguity in case-insensitive mode. Without #22148, data source tables do case-sensitive resolution and return columns with the corresponding letter cases, while hive tables always return the first matched column ignoring cases. #22148 let data source tables throw exception when there is ambiguity while hive table behavior is not changed.
    
    This PR aims to make behaviors consistent when converting hive table to data source table.
    * The behavior must be consistent to do the conversion, so we skip the conversion in case-sensitive mode because hive parquet table always do case-insensitive field resolution.
    * When converting hive parquet table to parquet data source, we switch the duplicated fields resolution mode to ask parquet data source to pick the first matched field - the same behavior as hive parquet table - to keep behaviors consistent.
    
    ## How was this patch tested?
    Unit tests added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/seancxmao/spark SPARK-25132-CONSISTENCY

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22343.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22343
    
----
commit 95673cdacb1da65d7ac45c212a365e167ae0d713
Author: seancxmao <se...@...>
Date:   2018-09-04T08:00:31Z

    [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consistent to do the conversion

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    Could you close this PR and JIRA, @seancxmao ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22343: [SPARK-25391][SQL] Make behaviors consistent when...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22343#discussion_r216218261
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala ---
    @@ -69,12 +69,25 @@ class ParquetOptions(
         .get(MERGE_SCHEMA)
         .map(_.toBoolean)
         .getOrElse(sqlConf.isParquetSchemaMergingEnabled)
    +
    +  /**
    +   * How to resolve duplicated field names. By default, parquet data source fails when hitting
    +   * duplicated field names in case-insensitive mode. When converting hive parquet table to parquet
    +   * data source, we need to ask parquet data source to pick the first matched field - the same
    +   * behavior as hive parquet table - to keep behaviors consistent.
    +   */
    +  val duplicatedFieldsResolutionMode: String = {
    +    parameters.getOrElse(DUPLICATED_FIELDS_RESOLUTION_MODE,
    --- End diff --
    
    I agree this is a little unusual. Usually we have a SQL config first, then we create an option for it if necessary. In this case, we are not adding a config/option from user's requirement, but we need it for an internal optimization.
    
    If we can I would suggest we make it an internal option. But anyway we shouldn't rush to add a SQL config, until we get requirement from users.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    Hi, @dongjoon-hyun
    When we find duplicated field names in the case of convertMetastoreXXX, we have 2 options
    (1) raise exception as parquet data source. To most of end users, they do not know the difference between hive parquet table and parquet data source. If the conversion leads to different behaviors, they may be confused, and in some cases even lead to tricky data issues silently.
    (2) Adjust behaviors of parquet data source to keep behaviors consistent. This seems more friendly to end users, and avoid any potential issues introduced by the conversion.
    
    BTW, for parquet data source which is not converted from hive parquet table, we raise exception when there is ambiguity, sine this is more intuitive and reasonable.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22343: [SPARK-25391][SQL] Make behaviors consistent when...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22343#discussion_r216204114
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala ---
    @@ -69,12 +69,25 @@ class ParquetOptions(
         .get(MERGE_SCHEMA)
         .map(_.toBoolean)
         .getOrElse(sqlConf.isParquetSchemaMergingEnabled)
    +
    +  /**
    +   * How to resolve duplicated field names. By default, parquet data source fails when hitting
    +   * duplicated field names in case-insensitive mode. When converting hive parquet table to parquet
    +   * data source, we need to ask parquet data source to pick the first matched field - the same
    +   * behavior as hive parquet table - to keep behaviors consistent.
    +   */
    +  val duplicatedFieldsResolutionMode: String = {
    +    parameters.getOrElse(DUPLICATED_FIELDS_RESOLUTION_MODE,
    --- End diff --
    
    whether we have a SQL config for it or not, we must define an option here. The conversion happens per-query, so we must have a per-query option to switch the behavior, instead of a per-session SQL config.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    I agree that correctness is more important. If we should not make behaviors consistent when do the convertion, I will close this PR. @cloud-fan @gatorsmile what do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    What I asked was the following, wasn't it? 
    > In case-insensitive mode, when converting hive parquet table to parquet data source, we switch the duplicated fields resolution mode to ask parquet data source to pick the first matched field - the same behavior as hive parquet table - to keep behaviors consistent.
    
    Spark should not pick up the first matched field in any cases because it's considered as a correctness behavior in previous PR which is backported to `branch-2.3` https://github.com/apache/spark/pull/22183. I don't think we need to follow incorrect Hive behavior.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22343: [SPARK-25391][SQL] Make behaviors consistent when...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22343#discussion_r216212552
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala ---
    @@ -1390,7 +1395,11 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
         catalystSchema = new StructType(),
     
         expectedSchema = ParquetSchemaConverter.EMPTY_MESSAGE,
    -    caseSensitive = true)
    +    conf = {
    +      val conf = new Configuration()
    +      conf.setBoolean(SQLConf.CASE_SENSITIVE.key, true)
    --- End diff --
    
    @cloud-fan There is no default value for `spark.sql.caseSensitive` in Configuration. Let me explain in more details below.
    
    This is one of the overloaded methods of `testSchemaClipping`. I tried to give this `testSchemaClipping` method a default conf, however Scalac complains that 
    > in class ParquetSchemaSuite, multiple overloaded alternatives of testSchemaClipping define default arguments
    
    https://github.com/apache/spark/blob/95673cdacb1da65d7ac45c212a365e167ae0d713/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala#L1014-L1019
    
    https://github.com/apache/spark/blob/95673cdacb1da65d7ac45c212a365e167ae0d713/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala#L1028-L1033
    
    It seems a little confusing, because these two methods have different parameter types. After a brief investigation, I found Scala compiler simply disallows overloaded methods with default arguments even when these methods have different parameter types.
    
    https://stackoverflow.com/questions/4652095/why-does-the-scala-compiler-disallow-overloaded-methods-with-default-arguments


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    Thank YOU for your PR and open discussion on this, @seancxmao . Let's see in another PRs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    Sure, close this PR. Thank you all for your time and insights.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consi...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    Could we see this as a behavior change? We can add a legacy conf (e.g. `spark.sql.hive.legacy.convertMetastoreParquet`, may be defined in HiveUtils) to enable users to revert back to the previous behavior for backward compatibility. If this legacy conf is set to true, behaviors will be reverted both in case-sensitive and case-insensitive mode.
    
    |caseSensitive|legacy behavior|new behavior|
    |-|-|-|
    |true|convert anyway|skip conversion, log warning message|
    |false|convert, fail if there's ambiguity|convert, first match if there's ambiguity|


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22343: [SPARK-25391][SQL] Make behaviors consistent when...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao closed the pull request at:

    https://github.com/apache/spark/pull/22343


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consi...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    @seancxmao, mind fixing the PR title BTW? For instance, looks unclear which behaviour you mean in the PR title.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consi...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    @dongjoon-hyun does the orc conversion need the same fix?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22343: [SPARK-25391][SQL] Make behaviors consistent when...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22343#discussion_r216216422
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala ---
    @@ -69,12 +69,25 @@ class ParquetOptions(
         .get(MERGE_SCHEMA)
         .map(_.toBoolean)
         .getOrElse(sqlConf.isParquetSchemaMergingEnabled)
    +
    +  /**
    +   * How to resolve duplicated field names. By default, parquet data source fails when hitting
    +   * duplicated field names in case-insensitive mode. When converting hive parquet table to parquet
    +   * data source, we need to ask parquet data source to pick the first matched field - the same
    +   * behavior as hive parquet table - to keep behaviors consistent.
    +   */
    +  val duplicatedFieldsResolutionMode: String = {
    +    parameters.getOrElse(DUPLICATED_FIELDS_RESOLUTION_MODE,
    --- End diff --
    
    The conversion itself happens per query but my impression is that the different values don't usually happen in per-query. I mean, I was wondering if users want to set this query by query.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    **[Test build #95864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95864/testReport)** for PR 22343 at commit [`95673cd`](https://github.com/apache/spark/commit/95673cdacb1da65d7ac45c212a365e167ae0d713).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    @dongjoon-hyun It is a little complicated. There has been a discussion about this in #22184. Below are some key comments from @cloud-fan and @gatorsmile, just FYI.
    
    * https://github.com/apache/spark/pull/22184#discussion_r212834477
    * https://github.com/apache/spark/pull/22184#issuecomment-416885728


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consi...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22343
  
    Thank you for pinging me. I'll take a look tomorrow, @cloud-fan .
    
    BTW, @seancxmao . Can we handle this `convertMetastoreXXX` case in a new JIRA issue? `The behavior must be consistent to do the conversion` doesn't look good to me because it's not complete as a single patch title.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org