You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by budde <gi...@git.apache.org> on 2017/02/03 22:41:01 UTC

[GitHub] spark pull request #16797: [SPARK-19455][SQL] Add option for case-insensitiv...

GitHub user budde opened a pull request:

    https://github.com/apache/spark/pull/16797

    [SPARK-19455][SQL] Add option for case-insensitive Parquet field resolution

    ## What changes were proposed in this pull request?
    
    **Summary**
    - Add spark.sql.parquet.caseInsensitiveResolution config option
    - Add caseInsensitive option to ParquetReadSupport.clipParquetType
    - Add ParquetIOSuite test
    - Disable Parquet filter push-down when using case-insensitive field resolution
    
    
    **Details**
    
    [*Copied from SPARK-19455*](https://issues.apache.org/jira/browse/SPARK-19455)
    
    [SPARK-16980](https://issues.apache.org/jira/browse/SPARK-16980) removed the schema inferrence from the HiveMetastoreCatalog class when converting a MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in favor of simply using the schema returend by the metastore. This results in an optimization as the underlying file status no longer need to be resolved until after the partition pruning step, reducing the number of files to be touched significantly in some cases. The downside is that the data schema used may no longer match the underlying file schema for case-sensitive formats such as Parquet.
    
    This change initially included a [patch to ParquetReadSupport](https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284) that attempted to remedy this conflict by using a case-insentive fallback mapping when resolving field names during the schema clipping step. [SPARK-1833](https://issues.apache.org/jira/browse/SPARK-18333) later removed this patch after [SPARK-17183](https://issues.apache.org/jira/browse/SPARK-17183) added support for embedding a case-sensitive schema as a Hive Metastore table property. AFAIK the assumption here was that the data schema obtained from the Metastore table property will be case sensitive and should match the Parquet schema exactly.
    
    The problem arises when dealing with Parquet-backed tables for which this schema has not been embedded as a table attributes and for which the underlying files contain case-sensitive field names. This will happen for any Hive table that was not created by Spark or created by a version prior to 2.1.0. We've seen Spark SQL return no results for any query containing a case-sensitive field name for such tables.
    
    The change we're proposing is to introduce a configuration parameter that will re-enable case-insensitive field name resolution in ParquetReadSupport. This option will also disable filter push-down for Parquet, as the filter predicate constructed by Spark SQL contains the case-insensitive field names which Parquet will return 0 records for when filtering against a case-sensitive column name. I was hoping to find a way to construct the filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the Configuration object passed to this class to the underlying InternalParquetRecordReader class.
    
    ## How was this patch tested?
    
    This test re-introduces a unit test to ParquetSchemaSuite.scala to test that case-insensitive schema clipping behaves as expected. It also introduces a ParquetIOSuite unit test that constructs a case-insensitive catalog table and ensures case-sensitive Parquet data can still be queried against.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/budde/spark SPARK-19455

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16797.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16797
    
----
commit 5426271946419a9defb59bb84575501bc8296578
Author: Budde <bu...@amazon.com>
Date:   2017-02-02T07:34:15Z

    [SPARK-19455][SQL] Add option for case-insensitive Parquet field resolution
    
    - Add spark.sql.parquet.caseInsensitiveResolution config option
    - Add caseInsensitive option to ParquetReadSupport.clipParquetType
    - Add ParquetIOSuite test
    - Disable Parquet filter push-down when using case-insensitive field resolution

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    **[Test build #72343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72343/testReport)** for PR 16797 at commit [`5426271`](https://github.com/apache/spark/commit/5426271946419a9defb59bb84575501bc8296578).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Relevant part of [Jenkins output](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/console) for SparkR tests:
    
    ```
    Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
    error in evaluating the argument 'object' in selecting a method for function 'summary': Error: object 'kmeansModel' not found
    ```
    
    Doesn't appear to be related to this change. I'll investigate and see if I can reproduce it locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16797: [SPARK-19455][SQL] Add option for case-insensitiv...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16797#discussion_r99455204
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -249,10 +249,18 @@ object SQLConf {
     
       val PARQUET_VECTORIZED_READER_ENABLED =
         SQLConfigBuilder("spark.sql.parquet.enableVectorizedReader")
    -      .doc("Enables vectorized parquet decoding.")
    +      .doc("Ingnores case sensitivity differences in field names while resolving Parquet columns " +
    --- End diff --
    
    Why change this doc?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    We are waiting for https://github.com/apache/spark/pull/16799. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    >> Like you said, users can still create a hive table with mixed-case-schema parquet/orc files, by hive or other systems like presto. This table is readable for hive, and for Spark prior to 2.1, because of the runtime schema inference But this is not intentional, and Spark should not support it as the data file schema and table schema mismatch.
    >
    > I will continue to argue strongly against reducing the number of usecases Spark SQL supports out of the box. While offering a migration command can offer a helpful optimization I don't think it is acceptable as the only option for the reasons I've detailed here.
    >
    > Simply put, I think relying on the presence of Spark-specific key/value pairs in the table properties in order for Spark SQL to function properly and assuming that Spark (or Spark users) can easily alter those properties to add the table schema is too brittle for large-scale production use.
    
    I would have to agree with @budde in this case. In versions of Spark prior to 2.1, an effort was made to reconcile metastore and file format case mismatching using the method `ParquetFileFormat.mergeMetastoreParquetSchema`. The code docs for that method state that here: https://github.com/apache/spark/blob/1b02f8820ddaf3f2a0e7acc9a7f27afc20683cca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L711-L719. I don't see anything here that suggests this was a "hack" or was intended to be removed in a later version. It seems we've simply broken compatibility with a certain class of Hive tables in Spark 2.1.
    
    Schema inference is very expensive, and doing it at query time on large tables was painful in versions prior to Spark 2.1 because all metadata files were read. But it seems some people were using it nonetheless and found it useful. At least in Spark 2.1, only the files for partitions read in a query will be read for schema inference. That would significantly enhance the schema inference performance at query time for partitioned tables.
    
    Incidentally, what happens when a program outside of Spark (such as Hive) updates the Hive metastore schema of a table with the embedded Spark SQL schema? Does Spark detect that change and update the embedded schema? Does it have to redo the schema inference across all files in the table?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    > BTW, what behavior do we expect if a parquet file has two columns whose lower-cased names are identical?
    
    Can we write such schema (conflicting columns after lower-casing) into metastore?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16797: [SPARK-19455][SQL] Add option for case-insensitiv...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16797#discussion_r99457604
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala ---
    @@ -268,13 +292,23 @@ private[parquet] object ParquetReadSupport {
        * @return A list of clipped [[GroupType]] fields, which can be empty.
        */
       private def clipParquetGroupFields(
    -      parquetRecord: GroupType, structType: StructType): Seq[Type] = {
    -    val parquetFieldMap = parquetRecord.getFields.asScala.map(f => f.getName -> f).toMap
    +      parquetRecord: GroupType,
    +      structType: StructType,
    +      caseInsensitive: Boolean): Seq[Type] = {
    +    val parquetFieldMap = {
    +      val pairs = parquetRecord.getFields.asScala.map(f => f.getName -> f)
    +      implicit val ordering = if (caseInsensitive) {
    +        Ordering.by[String, String](_.toLowerCase)
    +      } else {
    +        Ordering.String
    +      }
    +      TreeMap(pairs: _*)
    --- End diff --
    
    If we don't need sort, I think standard Map should be enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72343/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    The proposal to restore schema inference with finer grained control on when it is performed sounds reasonable to me. The case I'm most interested in is turning off schema inference entirely, because we do not use parquet files with upper-case characters in their column names.
    
    BTW, what behavior do we expect if a parquet file has two columns whose lower-cased names are identical?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16797: [SPARK-19455][SQL] Add option for case-insensitiv...

Posted by budde <gi...@git.apache.org>.

Github user budde closed the pull request at:

    https://github.com/apache/spark/pull/16797


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Thanks for all the feedback on this PR, folks. I'm going to close this PR/JIRA and open new ones for enabling configurable schema inference as a fallback. I'll ping each of you who has been active in this discussion on the new PR once it is open.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    I agree that bringing back schema inference would be cleaner. One problem with doing something parquet-specific is that this would need to be repeated with each file format, e.g. orc, csv, json, unless we made case sensitivity an API option as noted above.
    
    Per @viirya 's suggestion, it seems ideal to only do schema inference for older tables, and provide some command to update the table schema to preserve the case. This way we don't regress performance on 2.1 datasets or have to worry about schemas changing during optimization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16797

For better user experience, we should automatically infer the schema and write it back to metastore, if there is no case-sensitive table schema in metastore. This has the cost of detection the need of schema inference, and complicating table read code path.

If this is only a compatibility issue, I think it's fair to ask the cluster maintainers to run some commands after upgrade Spark cluster. Even there are a lot of tables, it's easy to write a script to automate it.

> I wouldn't make the assumption that Spark is being used to create and maintain the tables in the Hive Metastore that Spark is querying against. We're currently using Spark to add and update metastore tables in our usecase, but I don't think Spark should make any assumptions about how the table was created or what properties may be set.

Actually we do have this assumption... When we create a table with Spark, we will put a lot of Spark specific table properties. When we read a table, if the Spark specific table properties are there, we will treat it differently. If there is no Spark specific table properties, we assume this table is created by hive(not by external systems like Presto), so the schema of parquet files should be all lowercased.

Another proposal is to make parquet reader case-insensitive, so that we can solve this problem without schema inference. But the problem is, Spark can be configured to be case-sensitive, so that it's possible to write such a schema (conflicting columns after lower-casing) into metastore. I think this proposal is the best if we can totally make Spark case-insensitive.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Looks like SparkR unit tests have been failing for all or most PRs after [this commit.](https://github.com/apache/spark/commit/48aafeda7db879491ed36fff89d59ca7ec3136fa)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16797: [SPARK-19455][SQL] Add option for case-insensitiv...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16797#discussion_r99455967
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -249,10 +249,18 @@ object SQLConf {
     
       val PARQUET_VECTORIZED_READER_ENABLED =
         SQLConfigBuilder("spark.sql.parquet.enableVectorizedReader")
    -      .doc("Enables vectorized parquet decoding.")
    +      .doc("Ingnores case sensitivity differences in field names while resolving Parquet columns " +
    --- End diff --
    
    My bad. Don't know how that got in. Fixing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    > Can we write such schema (conflicting columns after lower-casing) into metastore?
    
    I think the scenario here would be that the metastore contains a single lower-case column name that could resolve to multiple case-sesnitive column names in the underlying Parquet file. This could've happened via the user manually executing a ```CREATE TABE...``` query with an explicit schema. Since the metastore itself isn't really defining expected behavior in this case I think we can just consider this undefined behavior and return the first field that matches alphabetically.
    
    I don't think this is very likely to be a legitimate usecase, but it's good to point out edge cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Pinging @ericl, @cloud-fan and @davies, committers who have all reviewed or submitted changes related to this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16797

@budde Spark does support mixed-case-schema tables, and it has always been. It's because we write table schema to metastore case-preserving, via table properties. When we read a table, we get schema from metastore and assume it's the schema of the table data files. So the data file schema must match the table schema, or Spark will fail, it has always been.

However, there is one exception. There are 2 kinds of tables in Spark: data source tables and hive serde tables(we have different SQL syntax to create them). Data source tables are totally managed by Spark, we read/write data files directly and only use hive metastore as a persistent layer, which means data source tables are not compatible with hive, hive can't read/write it.

For hive serde tables, it should be compatible with hive and we use hive api to read/write it. For any table, as long as hive can read it, Spark can read it. However, the exception is, for parquet and orc formats, we will read data files directly, as an optimization(reading using hive api is slow). Before Spark 2.1, we save schema to hive metastore directly, which means schema will be lowercased. w.r.t. this, ideally we should not support mixed-case-schema parquet/orc data files for this kind of table, or the data schema will mismatch the table schema. But we supported it, with the cost of runtime schema inference.

This problem was solved in Spark 2.1, by writing table schema to metastore case-preserving for hive serde tables. Now we can say that, the data schema must match the table schema, or Spark should fail.

Then comes to this problem: for parquet/orc format hive serde tables created by Spark prior to 2.1, the data file schema may not match the table schema, but we need to still support it for compatibility.

That's why I prefer the migration command approach, it keeps the concept clean: data schema must match table schema.

Like you said, users can still create a hive table with mixed-case-schema parquet/orc files, by hive or other systems like presto. This table is readable for hive, and for Spark prior to 2.1, because of the runtime schema inference. But this is not intentional, and Spark should not support it as the data file schema and table schema mismatch. We can make the migration command cover this case too.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    is it a completely compatibility issue? Seems like the only problem is, when we write out mixed-case-schema parquet files directly, and create an external table pointing to these files with Spark prior to 2.1, then read this table with Spark 2.1+.
    
    For tables in hive, as long as long hive can read it, Spark should be able to read it too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    **[Test build #72326 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/testReport)** for PR 16797 at commit [`5426271`](https://github.com/apache/spark/commit/5426271946419a9defb59bb84575501bc8296578).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797

I'll double check, but I don't think ```spark.sql.hive.manageFilesourcePartitions=false``` would solve this issue since we're still deriving the file relation's dataSchema parameter from the schema of MetastoreRelation. The call to ```fileFormat.inferSchema()``` has been removed entirely.

If Spark SQL is set on using a table property to store the case-sesnitive schema then I think having a way to backfill this property for existing < 2.1 tables as well as tables not created or managed by Spark will be a necessity. If the cleanest way to deal with this case sensitivity problem is to bring back schema inference then I think a good option would be to introduce a configuration param to indicate whether or not an inferred schema should be written back to the table as a property.

We could also introduce another config param that allows a user to bypass schema inference even if a case-sensitive schema can't be read from the table properties. This could be helpful for users who would like to query external Hive tables that aren't managed by Spark and that they know aren't backed by files containing case-sensitive field names.

This would basically allow us to support the following use cases:

1) The MetastoreRelation is able to read a case-sensitive schema from the table properties. No inference is necessary.
2) The MetastoreRelation can't read a case-sensitive schema from the table properties. A case-sensitive schema is inferred and, if configured, written back as a table property.
3) The MetastoreRelation can't read a case-sensitive schema from the table properties. The user knows the underlying data files don't contain case-sensitive field names and has explicitly set a config param to skip the inference step.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    **[Test build #72326 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72326/testReport)** for PR 16797 at commit [`5426271`](https://github.com/apache/spark/commit/5426271946419a9defb59bb84575501bc8296578).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by mallman <gi...@git.apache.org>.

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    BTW @budde, given that this represents a regression in behavior from previous versions of Spark, I think it is too generous of you to label the Jira issue as an "improvement" instead of a "bug". I would support you changing the type to "bug" if you want to.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    **[Test build #72352 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72352/testReport)** for PR 16797 at commit [`fd4c444`](https://github.com/apache/spark/commit/fd4c4440e34fedf1a5f08850da8b203c71d9071c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    **[Test build #72347 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72347/testReport)** for PR 16797 at commit [`a993928`](https://github.com/apache/spark/commit/a993928baba66e6ab6e592094012acdb4ce428ff).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    > I'll double check, but I don't think spark.sql.hive.manageFilesourcePartitions=false would solve this issue since we're still deriving the file relation's dataSchema parameter from the schema of MetastoreRelation. The call to fileFormat.inferSchema() has been removed entirely.
    
    Makes sense. I guess that would have to be restored as an option.
    
    > If Spark SQL is set on using a table property to store the case-sesnitive schema then I think having a way to backfill this property for existing < 2.1 tables as well as tables not created or managed by Spark will be a necessity. If the cleanest way to deal with this case sensitivity problem is to bring back schema inference then I think a good option would be to introduce a configuration param to indicate whether or not an inferred schema should be written back to the table as a property.
    
    Yeah, the table property was added in 2.1 (see HiveExternalCatalog:getSchemaFromTableProperties).
    
    > We could also introduce another config param that allows a user to bypass schema inference even if a case-sensitive schema can't be read from the table properties. This could be helpful for users who would like to query external Hive tables that aren't managed by Spark and that they know aren't backed by files containing case-sensitive field names.
    > 
    > This would basically allow us to support the following use cases:
    > 
    > The MetastoreRelation is able to read a case-sensitive schema from the table properties. No inference is necessary.
    > The MetastoreRelation can't read a case-sensitive schema from the table properties. A case-sensitive schema is inferred and, if configured, written back as a table property.
    > The MetastoreRelation can't read a case-sensitive schema from the table properties. The user knows the underlying data files don't contain case-sensitive field names and has explicitly set a config param to skip the inference step.
    
    Should we roll these behaviors into one flag?
    e.g. `spark.sql.mixedCaseSchemaSupport` could have a few modes: INFER_IF_NEEDED (default), NEVER_INFER (the third use case above), and FAIL_FAST (instead of falling back to inference, throw an exception).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    ok if we think we should support tables created by hive(or other systems) even the data schema mismatches the table schema(and matches if lowercased), I'm ok to fall back to schema inference when reading.
    
    > Incidentally, what happens when a program outside of Spark (such as Hive) updates the Hive metastore schema of a table with the embedded Spark SQL schema?
    
    I think we will have a bug. But this is not a regression right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    how about we add a new SQL command to refresh the table schema in metastore by inferring schema with data files? This is a compatibility issue and we should have provided a way for users to migrate, before the 2.1 release. I think this approach is much simpler than adding a flag.
    
    For tables created by hive, as hive is a case-insensitive system, will the parquet files have mixed-case schema?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    **[Test build #72347 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72347/testReport)** for PR 16797 at commit [`a993928`](https://github.com/apache/spark/commit/a993928baba66e6ab6e592094012acdb4ce428ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797

> how about we add a new SQL command to refresh the table schema in metastore by inferring schema with data files? This is a compatibility issue and we should have provided a way for users to migrate, before the 2.1 release. I think this approach is much simpler than adding a flag.

While I think introducing a command for inferring and storing the table's case-sensitive schema as a property would be a welcome addition, I think requiring this property to be there in order for Spark SQL to function properly with case-sensitive data files could really restrict the settings Spark SQL can be used in.

If a user wanted to use Spark SQL to query over an existing warehouse containing hundreds or even thousands of tables, under the suggested approach a Spark job would have to be run to infer the schema of each and every table. file formats such as Parquet store their schemas as metadata there still could potentially be millions of files to inspect for the warehouse. A less amenable format like JSON might require scanning all the data in the warehouse.

This also doesn't cover the use case @viirya pointed our where the user may not have write access to the metastore they are querying against. In this case, the user would have to rely on the warehouse administrator to create the Spark schema property for every table they wish to query.

> For tables created by hive, as hive is a case-insensitive system, will the parquet files have mixed-case schema?

I think the Hive Metastore has become a bit of an open standard for maintaining a data warehouse catalog since so many tools integrate with it. I wouldn't assume that the underlying data pointed to by an external metastore was created or managed by Hive itself. For example, we maintain a Hive Metastore that catalogs case-sensitive files written by our Spark-based ETL pipeline, which parses case classes from string data and writes them as Parquet.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797

> is it a completely compatibility issue? Seems like the only problem is, when we write out mixed-case-schema parquet files directly, and create an external table pointing to these files with Spark prior to 2.1, then read this table with Spark 2.1+.

Fundamentally, I wouldn't make the assumption that Spark is being used to create and maintain the tables in the Hive Metastore that Spark is querying against. We're currently using Spark to add and update metastore tables in our usecase, but I don't think Spark should make any assumptions about how the table was created with or what properties may be set.

In regard to the underlying issue, we've been using Spark in production for over two years and have several petabytes of case-sensitive Parquet data we've both written and queried using Spark. As of Spark 2.1, we are no longer able to use Spark to query any of this data as any query containing a case-sensitive field name will return 0 results. I would argue this is a compatibility regression.

> For tables in hive, as long as long hive can read it, Spark should be able to read it too.

In our case, other Hive-compatible query engines like Presto don't have a problem with case-sensitive Parquet files. I haven't tried Hive itself in a long time but as far as I remember we didn't have a problem there either.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    > I'll double check, but I don't think spark.sql.hive.manageFilesourcePartitions=false would solve this issue since we're still deriving the file relation's dataSchema parameter from the schema of MetastoreRelation. The call to fileFormat.inferSchema() has been removed entirely.
    
    I think It is correct. `spark.sql.hive.manageFilesourcePartitions` won't control whether to infer schema or not.
    
    The approach @ericl suggested (`spark.sql.hive.mixedCaseSchemaSupport`) sounds good to me.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    @cloud-fan:
    
    > Spark does support mixed-case-schema tables, and it has always been. It's because we write table schema to metastore case-preserving, via table properties.
    
    Spark prior to 2.1 supported *any* case-sensitive table, regardless of what table properties are set. Spark 2.1 supports these tables if and only if Spark 2.1 was used to create them and embedded the schema as a metadata property.
    
    > So the data file schema must match the table schema, or Spark will fail, it has always been.
    
    This absolutely not how it's always been. Spark would infer the schema from the source files and use that schema when constructing a logical relation. We've been relying on this behavior for years.
    
    > For any table, as long as hive can read it, Spark can read it.
    
    I've double checked this and Hive can query against tables backed by case-sensitive Parquet files. Spark 2.1 is currently the only Hive-compatible query engine I'm familiar with that won't support this usecase.
    
    > But we supported it, with the cost of runtime schema inference.
    
    My argument is that is should be possible to fall back to this level of support if the properties aren't present.
    
    > This problem was solved in Spark 2.1, by writing table schema to metastore case-preserving for hive serde tables. Now we can say that, the data schema must match the table schema, or Spark should fail.
    
    Spark does not explicitly fail in this case. It falls back to the downcased metastore schema, which will silently fail and return 0 results if a case-sensitive field name is used in your projection or filter predicate.
    
    > That's why I prefer the migration command approach, it keeps the concept clean: data schema must match table schema.
    
    This links Spark upgrades to potentially-costly data migrations. From an end-user perspective, prior to 2.1 you could simply point Spark SQL to an external Hive metastore and query any data in it. Now you have to make sure the table has been migrated to the appropriate version of Spark or your queries may silently return incorrect results.
    
    The migration approach also assumes that Spark has write access to the Hive metastore it is querying against. If you have read-only access to a metastore administered by another team or organization you are at their mercy to run migrations on your behalf against the latest version of Spark in order to allow you to query their tables from Spark. I think anybody who's ever found themselves in a similar situation can attest that it's never good to be beholden to someone else to enable a feature that only matters to you.
    
    And again, in some cases migrating all tables in a large Hive warehouse could be an extremely expensive operation that potentially touches petabytes of data.
    
    > Like you said, users can still create a hive table with mixed-case-schema parquet/orc files, by hive or other systems like presto. This table is readable for hive, and for Spark prior to 2.1, because of the runtime schema inference But this is not intentional, and Spark should not support it as the data file schema and table schema mismatch. We can make the migration command cover this case too.
    
    I will continue to argue strongly against reducing the number of usecases Spark SQL supports out of the box. While offering a migration command can offer a helpful optimization I don't think it is acceptable as the only option for the reasons I've detailed here.
    
    Simply put, I think relying on the presence of Spark-specific key/value pairs in the table properties in order for Spark SQL to function properly and assuming that Spark (or Spark users) can easily alter those properties to add the table schema is too brittle for large-scale production use.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    **[Test build #72343 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72343/testReport)** for PR 16797 at commit [`5426271`](https://github.com/apache/spark/commit/5426271946419a9defb59bb84575501bc8296578).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    **[Test build #72352 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72352/testReport)** for PR 16797 at commit [`fd4c444`](https://github.com/apache/spark/commit/fd4c4440e34fedf1a5f08850da8b203c71d9071c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    If the use case where we want to infer the schema but not attempt to write it back as a property as suggested by @budde, is making sense, then the new SQL command approach might not work for it. But actually in this case, it is needed to ask someone who is permitted to migrate and update the table property.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

https://github.com/apache/spark/pull/16797

Bringing back schema inference is certainly a much cleaner option, although I imagine doing this in the old manner would negate the performance improvements brought by #14690 for any non-Spark 2.1 dataset.

Ideally, I think we would infer the schema only from the pruned partition list for tables we can't read a case-sensitive schema for. Unless I'm mistaken, this would have to happen during optimization of the logical plan, after the PruneFileSourcePartitions rule has been applied. My thought is that we could write a rule that passes the pruned file list to the file format's inferSchema() method to replace the HadoopFsRelation's dataSchema with the result. I'm not very familiar with Catalyst though, so I'm not sure if changing the relation's schema during optimization will cause problems.

There is [an open PR to add support for case-insensitive schemas to Parquet](https://github.com/apache/parquet-mr/pull/210) which would be helpful here since it would provide a way to avoid schema inference when your Parquet files have case-sensitive fields but you don't care about case sensitivity when querying. Unfortunately the PR seems to be more or less abandoned though.

Pinging @mallman, the author of #14690, to see if he has any input on this.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    > Should we roll these behaviors into one flag? e.g. ```spark.sql.hive.mixedCaseSchemaSupport```
    
    That sounds reasonable to me. The only thing I wonder about is if there's any use case where we want to infer the schema but not attempt to write it back as a property, say if the external metastore doesn't permit table property updates from the user. We can always just log the failure, but this could be noisy for users expecting this behavior by default. This could be solved by adding an INFER_WITHOUT_SAVING mode.
    
    I'll leave the PR open for now so we can hear and discuss @mallman's input but if we're all on board with this approach I'll eventually close out this out in favor of a new PR adding configurable schema inference behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    > For better user experience, we should automatically infer the schema and write it back to metastore, if there is no case-sensitive table schema in metastore. This has the cost of detection the need of schema inference, and complicating table read code path.
    
    Totally agree. I think the default behavior should be to infer and backfill a case-sensitive schema into the table properties if one isn't already there. An option should also be provided to disable all inference and just fall back to the case-insensitive metastore schema if none is found (i.e. the current behavior in 2.1.0).
    
    > If this is only a compatibility issue, I think it's fair to ask the cluster maintainers to run some commands after upgrade Spark cluster. Even there are a lot of tables, it's easy to write a script to automate it.
    
    I don't think this is fair. For one, as I've mentioned, in some cases Spark may not be the tool being used to maintain the metastore. This will now require the warehouse admins to set up a Spark cluster and run these migration commands on every table with case-sensitive underlying data if they'd like them to be accessible from Spark. As a second point, while writing an automation script may be trivial the execution costs aren't, especially if the data is stored in a format like JSON where each and every record in the table must be read in order to infer the schema.
    
    > If there is no Spark specific table properties, we assume this table is created by hive(not by external systems like Presto), so the schema of parquet files should be all lowercased.
    
    This isn't an assumption made by Spark prior to 2.1.0, whether this was an explicit decision or not. All I'm asking for is a way to configure Spark to continue supporting a use case it has for years.
    
    Also, in our case, the table was created by Spark, not Presto. Presto is just an example of another execution engine we've put in front of our warehouse that hasn't had a problem with the underlying Parquet data being case-sensitive. We just used an older version of Spark to create the tables. I would think long and hard about whether requiring warehouse admins to run potentially-costly migrations between Spark versions to update table metadata is a preferable option to offering a way for being backwards-compatible with the old behavior.
    
    Again, I think introducing a mechanism to migrate the table properties is a good idea. I just don't think it should be the only option.
    
    > Another proposal is to make parquet reader case-insensitive, so that we can solve this problem without schema inference. But the problem is, Spark can be configured to be case-sensitive, so that it's possible to write such a schema (conflicting columns after lower-casing) into metastore. I think this proposal is the best if we can totally make Spark case-insensitive.
    
    I don't think this would be a bad option if this could be enabled at the Parquet level, but it seems as their work towards enabling case-insensitive file access has stalled. As @ericl pointed out above, moving this to the ParquetReadSupport level may make the situation better for Parquet but the behavior won't be consistent across file formats like ORC or JSON.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    > BTW, what behavior do we expect if a parquet file has two columns whose lower-cased names are identical?
    
    I can take a look at how Spark handled this prior to 2.1, although I'm not sure if the behavior we'll see there was the result of a conscious decision or "undefined" behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72326/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16797: [SPARK-19455][SQL] Add option for case-insensitiv...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16797#discussion_r99456138
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala ---
    @@ -268,13 +292,23 @@ private[parquet] object ParquetReadSupport {
        * @return A list of clipped [[GroupType]] fields, which can be empty.
        */
       private def clipParquetGroupFields(
    -      parquetRecord: GroupType, structType: StructType): Seq[Type] = {
    -    val parquetFieldMap = parquetRecord.getFields.asScala.map(f => f.getName -> f).toMap
    +      parquetRecord: GroupType,
    +      structType: StructType,
    +      caseInsensitive: Boolean): Seq[Type] = {
    +    val parquetFieldMap = {
    +      val pairs = parquetRecord.getFields.asScala.map(f => f.getName -> f)
    +      implicit val ordering = if (caseInsensitive) {
    +        Ordering.by[String, String](_.toLowerCase)
    +      } else {
    +        Ordering.String
    +      }
    +      TreeMap(pairs: _*)
    --- End diff --
    
    The implicit ordering val determines if the TreeMap behaves in a case-sensitive manner or not. I can rework this into using standard Maps if you feel thats appropriate or clearer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72347/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16797: [SPARK-19455][SQL] Add option for case-insensitiv...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16797#discussion_r99455758
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala ---
    @@ -268,13 +292,23 @@ private[parquet] object ParquetReadSupport {
        * @return A list of clipped [[GroupType]] fields, which can be empty.
        */
       private def clipParquetGroupFields(
    -      parquetRecord: GroupType, structType: StructType): Seq[Type] = {
    -    val parquetFieldMap = parquetRecord.getFields.asScala.map(f => f.getName -> f).toMap
    +      parquetRecord: GroupType,
    +      structType: StructType,
    +      caseInsensitive: Boolean): Seq[Type] = {
    +    val parquetFieldMap = {
    +      val pairs = parquetRecord.getFields.asScala.map(f => f.getName -> f)
    +      implicit val ordering = if (caseInsensitive) {
    +        Ordering.by[String, String](_.toLowerCase)
    +      } else {
    +        Ordering.String
    +      }
    +      TreeMap(pairs: _*)
    --- End diff --
    
    Any reason we need to use TreeMap here, do we need to sort the fields?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72352/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #16797: [SPARK-19455][SQL] Add option for case-insensitiv...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16797#discussion_r99458106
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala ---
    @@ -268,13 +292,23 @@ private[parquet] object ParquetReadSupport {
        * @return A list of clipped [[GroupType]] fields, which can be empty.
        */
       private def clipParquetGroupFields(
    -      parquetRecord: GroupType, structType: StructType): Seq[Type] = {
    -    val parquetFieldMap = parquetRecord.getFields.asScala.map(f => f.getName -> f).toMap
    +      parquetRecord: GroupType,
    +      structType: StructType,
    +      caseInsensitive: Boolean): Seq[Type] = {
    +    val parquetFieldMap = {
    +      val pairs = parquetRecord.getFields.asScala.map(f => f.getName -> f)
    +      implicit val ordering = if (caseInsensitive) {
    +        Ordering.by[String, String](_.toLowerCase)
    +      } else {
    +        Ordering.String
    +      }
    +      TreeMap(pairs: _*)
    --- End diff --
    
    In the current implementation, we exploit the sorted map keyspace to define an ordering where, if configured, lookups are case-insensitive due to sorting by the downcased key.
    
    We only use the values in the map at one point though, so I'll just roll it back to using a standard Map and add an if/else for downcasing the key when creating/accessing the map.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    Shall we just go to infer schema from files once accessing any Hive table that was not created by Spark or created by a version prior to 2.1.0, i.e., for which the schema has not been embedded as a table attributes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Posted by budde <gi...@git.apache.org>.

Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    @mallman The Parquet schema merging methods take me back to #5214 :)
    
    I haven't been following changes here very closely but I would guess use of this method was replaced to the previously-used call to ```ParquetFileFormat.inferSchema()```. But I think it is important to point out that this functionality was explicitly added to support case sensitivity differences.
    
    In regard to the JIRA, I'll either modify it or replace it with a new one for bringing back (configurable) inference. I can mark it a "bug" at that point.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org