You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Chenxiao Mao (JIRA)" <ji...@apache.org> on 2018/08/28 16:21:00 UTC

[jira] [Commented] (SPARK-25175) Field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC

    [ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595225#comment-16595225 ] 

Chenxiao Mao commented on SPARK-25175:
--------------------------------------

After a deep dive into ORC file read paths (data source native, data source hive, hive serde), I realized that this is a little complicated. I'm not sure whether it's technically possible to make all three read paths consistent with respect to case sensitivity, because we rely on hive InputFormat/SerDe which we might not be able to change.

Please also see [~cloud_fan]'s comment on Parquet: [https://github.com/apache/spark/pull/22184/files#r212849852]

So I changed the title of this Jira to reduce the scope. This ticket aims to make ORC data source native impl consistent with Parquet data source. The gap is that field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC. Does it make sense?

As for duplicate fields with different letter cases, we don't have real use cases. It's just for testing purpose.

 

> Field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25175
>                 URL: https://issues.apache.org/jira/browse/SPARK-25175
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Chenxiao Mao
>            Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet files. We found ORC files have similar issues, but not identical to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency. This hive OrcFileFormat always do case-insensitive field resolution regardless of case sensitivity mode. When there is ambiguity, hive OrcFileFormat always returns the first matched field, rather than failing the reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data file has more fields than table schema, we just can't read hive serde tables. If ORC data file does not have more fields, hive serde tables always do field resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc InputFormat/SerDe to read table. I'm not sure whether we can change underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org