You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2015/07/03 00:28:06 UTC

[jira] [Updated] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

     [ https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheng Lian updated SPARK-8501:
------------------------------
    Target Version/s: 1.4.1, 1.5.0  (was: 1.5.0, 1.4.2)

> ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-8501
>                 URL: https://issues.apache.org/jira/browse/SPARK-8501
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>         Environment: Hive 0.13.1
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>            Priority: Critical
>
> Not sure whether this should be considered as a bug of ORC bundled with Hive 0.13.1: for an ORC file containing zero rows, the schema written in its footer contains zero fields (e.g. {{struct<>}}).
> To reproduce this issue, let's first produce an empty ORC file.  Copy data file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the following lines in Hive 0.13.1 CLI:
> {noformat}
> $ hive
> hive> CREATE TABLE foo(key INT, value STRING);
> hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
> hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
> {noformat}
> Now inspect the empty ORC file we just wrote:
> {noformat}
> $ hive --orcfiledump /user/hive/warehouse_hive13/bar/000000_0
> Structure for /user/hive/warehouse_hive13/bar/000000_0
> 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from /user/hive/warehouse_hive13/bar/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
> Rows: 0
> Compression: ZLIB
> Compression size: 262144
> Type: struct<>
> Stripe Statistics:
> File Statistics:
>   Column 0: count: 0
> Stripes:
> {noformat}
> Notice the {{struct<>}} part.
> This "feature" is OK for Hive, which has a central metastore to save table schema.  But for users who read raw data files without Hive metastore with Spark SQL 1.4.0, it causes problem because currently the ORC data source just picks a random part-file whichever comes the first for schema discovery.
> Expected behavior can be:
> # Try all files one by one until we find a part-file with non-empty schema.
> # Throws {{AnalysisException}} if no such part-file can be found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org