You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by seancxmao <gi...@git.apache.org> on 2018/08/22 09:31:12 UTC

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

GitHub user seancxmao opened a pull request:

    https://github.com/apache/spark/pull/22184

    [SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet

    ## What changes were proposed in this pull request?
    #22148 introduces a behavior change. We need to document it in the migration guide.
    
    ## How was this patch tested?
    N/A


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/seancxmao/spark SPARK-25132-DOC

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22184
    
----
commit eae8a3c98f146765d25bbf529421ce3c7a92639b
Author: seancxmao <se...@...>
Date:   2018-08-22T09:17:55Z

    [SPARK-25132][SQL][DOC] Case-insensitive field resolution when reading from Parquet

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22184#discussion_r212405373
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
       - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
       - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
     
    +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
    +
    +  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched.
    --- End diff --
    
    Following your advice, I did a thorough comparison between data source table and hive serde table. 
    
    Parquet data and tables are created via the following code:
    
    ```
    val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C")
    spark.conf.set("spark.sql.caseSensitive", true)
    data.write.format("parquet").mode("overwrite").save("/user/hive/warehouse/parquet_data")
    
    CREATE TABLE parquet_data_source_lower (a LONG, b LONG, c LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data'
    CREATE TABLE parquet_data_source_upper (A LONG, B LONG, C LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data'
    CREATE TABLE parquet_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data'
    CREATE TABLE parquet_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data'
    ```
    spark.sql.hive.convertMetastoreParquet is set to false:
    
    ```
    spark.conf.set("spark.sql.hive.convertMetastoreParquet", false)
    ```
    
    Below are the comparison results both without #22148 and with #22148.
    
    The comparison result without #22148:
    
    |no.|caseSensitive|table columns|select column|parquet column (select via data source table)|parquet column (select via hive serde table)|consistent?|resolved by SPARK-25132|
    | - | - | - | - | - | - | - | - |
    |1|true|a, b, c|a| a|a |Y | |
    |2| | |b|null|B|NG| |
    |3| | |c|c |c |Y | |
    |4| | |A|AnalysisException|AnalysisException|Y | |
    |5| | |B|AnalysisException|AnalysisException|Y | |
    |6| | |C|AnalysisException|AnalysisException|Y | |
    |7| |A, B, C|a|AnalysisException |AnalysisException|Y | |
    |8| | |b|AnalysisException |AnalysisException |Y | |
    |9| | |c|AnalysisException |AnalysisException |Y | |
    |10| | |A|null |a |NG | |
    |11| | |B|B |B|Y | |
    |12| | |C|C |c |NG | |
    |13|false|a, b, c|a|a |a |Y | |
    |14| | |b|null |B |NG |Y|
    |15| | |c|c |c |Y | |
    |16| | |A|a |a |Y | |
    |17| | |B|null |B |NG |Y|
    |18| | |C|c |c |Y | |
    |19| |A, B, C|a|null |a |NG |Y|
    |20| | |b|B |B |Y | |
    |21| | |c|C |c |NG | |
    |22| | |A|null |a |NG |Y|
    |23| | |B|B |B |Y | |
    |24| | |C|C |c |NG | |
    
    The comparison result with #22148 applied:
    
    |no.|caseSensitive|table columns|select column|parquet column (select via data source table)|parquet column (select via hive serde table)|consistent?|introduced by SPARK-25132|
    |---|---|---|---|---|---|---|---|
    |1|true|a, b, c|a|a |a |Y | |
    |2| | |b|null |B |NG | |
    |3| | |c|c |c |Y | |
    |4| | |A|AnalysisException |AnalysisException |Y | |
    |5| | |B|AnalysisException |AnalysisException |Y | |
    |6| | |C|AnalysisException |AnalysisException |Y | |
    |7| |A, B, C|a|AnalysisException |AnalysisException |Y | |
    |8| | |b|AnalysisException |AnalysisException |Y | |
    |9| | |c|AnalysisException |AnalysisException |Y | |
    |10| | |A|null |a |NG | |
    |11| | |B|B |B |Y | |
    |12| | |C|C |c |NG | |
    |13|false|a, b, c|a|a |a |Y | |
    |14| | |b|B |B |Y | |
    |15| | |c|RuntimeException |c |NG |Y|
    |16| | |A|a |a |Y | |
    |17| | |B|B |B |Y | |
    |18| | |C|RuntimeException |c |NG |Y|
    |19| |A, B, C|a|a |a |Y | |
    |20| | |b|B |B |Y | |
    |21| | |c|RuntimeException |c |NG | |
    |22| | |A|a |a |Y | |
    |23| | |B|B |B |Y | |
    |24| | |C|RuntimeException |c |NG | |
    
    We can see that data source table and hive serde table have two major differences about parquet field resolution
    
    * Whether respect spark.sql.caseSensitive. Without #22148, both data source tables and hive serde tables do NOT respect spark.sql.caseSensitive. However data source tables always do case-sensitive parquet field resolution, while hive serde tables always do case-insensitive parquet field resolution no matter whether spark.sql.caseSensitive is set to true or false. #22148 let data source tables respect spark.sql.caseSensitive while hive serde table behavior is not changed.
    * How to resolve ambiguity in case-insensitive mode. Withou #22148, data source tables do case-sensitive resolution and return columns with the corresponding letter cases, while hive serde tables always return columns with lower cases. #22148 let data source tables throw exception when this is ambiguity while hive serde table behavior is not changed.
    
    WRT parquet field resolution, shall we make hive serde table behavior consistent with data source table behavior? What do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213426538

--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.

+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+ - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched.
--- End diff --

Making 1, 2 consistent is enough. : )

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212533706

We should respect `spark.sql.caseSensitive` in both modes, but also add a legacy SQLConf to enable users to revert back to the previous behavior.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    @HyukjinKwon Thank you for your comments. Yes, this is only valid when upgrade Spark 2.3 to 2.4. I will do it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212849852

We rely on the hive parquet serde to read hive parquet tables, and I don't think we are able to change it. The only way I can think of to make it consistent between data source table and hive table is to make sure `spark.sql.hive.convertMetastoreParquet` always work.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212662840

First, we should not change the behavior of hive tables. It inherits many behaviors from Hive and let's keep it as it was.

Second, why we treat it as a behavior change? I think it's a bug that we don't respect `spark.sql.caseSensitive` in field resolution. In general we should not add a config to restore a bug.

I don't think this document is helpful. It explains a subtle and unreasonable behavior to users, which IMO just make them confused.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213581563

The testing is based on `spark.sql.hive.convertMetastoreParquet` is set false, so it should use Hive serde reader instead of Spark reader, sorry if it is too confusing here.
I guess you mean 1 and 3 :). I understand now.

If we are not going to backport the PR to 2.3, I can close SPARK-25206 also?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    > if spark.sql.hive.convertMetastoreParquet and spark.sql.caseSensitive are both set to true, we throw an exception
    
    I'd like to just skip the conversion and log a warning message to say why.
    
    > ... which is not consistent
    
    I think it's ok. At the end they are different data sources and can define their own behaviors.
    
    But you do have a point about `spark.sql.hive.convertMetastoreParquet`, the behavior must be consistent to do the conversion. My proposal is, parquet data source should provide an option(not SQL conf) to switch the behavior when hitting duplicated field names in case-insensitive mode. And when converting hive parquet table to parquet data source, set the option and ask parquet data source to pick the first matched field.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22184#discussion_r213569148
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
       - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
       - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
     
    +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
    +
    +  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched.
    --- End diff --
    
    https://github.com/apache/spark/pull/22184#discussion_r212405373 already shows they are not consistent, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213135626

For Hive tables, column resolution is always case insensitive. However, When `spark.sql.hive.convertMetastoreParquet` is true, users might face inconsistent behaviors when they use native parquet reader to resolve the columns in the case sensitive mode. We still introduce behavior changes. Better error messages sounds good enough, instead of disabling `spark.sql.hive.convertMetastoreParquet` when the mode is case sensitive. cc @cloud-fan

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    @srowen Sorry for the late reply! I'd like to close this PR and file a new one since our SQL doc has changed a lot. Thank you all for your comments and time!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    @seancxmao is this PR still live?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213386126

> For Spark native parquet tables that were created by us, this is a bug fix because the previous work does not respect spark.sql.caseSensitive; for the parquet tables created by Hive, the field resolution should be consistent no matter whether it is using our reader or Hive parquet reader.

@gatorsmile, need confirm with you, about consistent, we have some kinds of tables.

1. parquet table created by Spark (using parquet) read by Spark reader
2. parquet table created by Spark (using hive) read by Spark reader
3. parquet table created by Spark (using hive) read by Hive reader
4. parquet table created by Hive read by Spark reader
5. parquet table created by Hive read by Hive reader

Do you want all of them to be consitent? Or 2,3,4,5 consitent is enough?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213426988

BTW, the parquet table could be generated by our DataFrameWriter. Thus, the physical schema and logical schema could still have different cases.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao closed the pull request at:

    https://github.com/apache/spark/pull/22184


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    **[Test build #95262 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95262/testReport)** for PR 22184 at commit [`eae8a3c`](https://github.com/apache/spark/commit/eae8a3c98f146765d25bbf529421ce3c7a92639b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    @cloud-fan OK, I will do it. 
    
    Just to confirm, when reading from hive parquet table, if `spark.sql.hive.convertMetastoreParquet` and `spark.sql.caseSensitive` are both set to true, we throw an exception to tell users they should not do this because this could lead to inconsistent results. Is my understanding correct?
    
    Another thing to confirm is that when there is ambiguity in case-insensitive mode, native parquet reader throws exception while hive serde reader returns the first matched field, which is not consistent. Is it OK?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212834530

In general, my suggestion is to respect `spark.sql.caseSensitive` for both readers. Technically, is it possible?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    **[Test build #95262 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95262/testReport)** for PR 22184 at commit [`eae8a3c`](https://github.com/apache/spark/commit/eae8a3c98f146765d25bbf529421ce3c7a92639b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    Since we are not going to backport the PR to 2.3, do we still need this migration guide?
    
    Strictly speaking, we do have a behavior change here: hive table is always case insensitive and we should not read hive parquet table with native parquet reader if Spark is in case-sensitive mode.  @seancxmao can you send a followup PR to do it? thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    see `ParquetOptions`. Option can be specified per-query while SQL conf is per-session.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    @gatorsmile Could you kindly help trigger Jenkins and review?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22184#discussion_r212834477
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
       - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
       - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
     
    +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
    +
    +  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched.
    --- End diff --
    
    @cloud-fan We need to keep the behaviors consistent no matter whether we use Hive serde reader or our native parquet reader. In the PR https://github.com/apache/spark/pull/22148, we already introduced a change for hive table, if `spark.sql.hive.convertMetastoreParquet` is set to true, right?
    
    For Spark native parquet tables that were created by us, this is a bug fix because the previous work does not respect `spark.sql.caseSensitive`; for the parquet tables created by Hive, the field resolution should be consistent no matter whether it is using our reader or Hive parquet reader. To most of end users, they do not know the difference between Hive serde reader and native parquet reader


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212006137

This is a behavior change. I am not sure whether we should backport it to 2.3.2. How about sending a note to the dev mailing list?

BTW, this only affects data source table. How about hive serde table? Are they consistent?

Could you add a test case? Create a table by the syntax like `CREATE TABLE ... STORED AS PARQUET`. You also need to turn off `spark.sql.hive.convertMetastoreParquet` in the test case.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95262/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    @cloud-fan I've just sent a PR (#22343) for this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213519348

@gatorsmile I think 1 and 2 are always consistent. They both use Spark reader. Am I wrong?
> parquet table created by Spark (using parquet) read by Spark reader
> parquet table created by Spark (using hive) read by Spark reader

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    @cloud-fan @gatorsmile I think the old `Upgrading From Spark SQL 2.3.1 to 2.3.2 and above` is not needed since we do not backport SPARK-25132 to branch-2.3. I'm wondering if we need `Upgrading From Spark SQL 2.3 to 2.4 and above`. What do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22184#discussion_r213020789
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
       - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
       - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
     
    +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
    +
    +  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched.
    --- End diff --
    
    As a followup to cloud-fan's point, I did a deep dive into read path of parquet hive serde table. Following is a rough invocation chain:
    
    ```
    org.apache.spark.sql.hive.execution.HiveTableScanExec
    org.apache.spark.sql.hive.HadoopTableReader (extendes org.apache.spark.sql.hive.TableReader)
    org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat (extends org.apache.hadoop.mapred.FileInputFormat)
    org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper (extends org.apache.hadoop.mapred.RecordReader)
    parquet.hadoop.ParquetRecordReader
    parquet.hadoop.InternalParquetRecordReader
    org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport (extends parquet.hadoop.api.ReadSupport)
    ```
    
    Finally, `DataWritableReadSupport#getFieldTypeIgnoreCase` is invoked. 
    
    https://github.com/JoshRosen/hive/blob/release-1.2.1-spark2/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L79-L95
    
    This is why parquet hive serde table always do case-insensitive field resolution. However, this is a class inside `org.spark-project.hive:hive-exec:1.2.1.spark2`.
    
    I also found the related Hive JIRA ticket:
    [HIVE-7554: Parquet Hive should resolve column names in case insensitive manner](https://issues.apache.org/jira/browse/HIVE-7554)
    
    BTW:
    * org.apache.hadoop.hive.ql = org.spark-project.hive:hive-exec:1.2.1.spark2
    * parquet.hadoop = com.twitter:parquet-hadoop-bundle:1.6.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22184#discussion_r212533857
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
       - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
       - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
     
    +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
    +
    +  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched.
    --- End diff --
    
    Could you add a test case for the one you did?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    @seancxmao, so this behaviour changes description is only valid when we upgrade spark 2.3 to 2.4? Then we can add it in `Upgrading From Spark SQL 2.3 to 2.4`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212894532

As a followup, I also did investigation about ORC. Below are some results. Just FYI.

* https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593185&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593185
* https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593194&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593194

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

Posted by seancxmao <gi...@git.apache.org>.

Github user seancxmao commented on the issue:

    https://github.com/apache/spark/pull/22184
  
    > My proposal is, parquet data source should provide an option(not SQL conf) to ...
    You mentioned this option is not SQL conf. Could you give me some advice about where this option should be defined? I just thought to define this option in SQLConf as something like `spark.sql.parquet.onDuplicatedFields` = FAIL, FIRST_MATCH, as I see bunch of options starting with `spark.sql.parquet` in SQLConf.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org