You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2015/08/24 10:42:45 UTC
[jira] [Comment Edited] (SPARK-10177) Parquet support interprets timestamp values differently from Hive 0.14.0+

    [ https://issues.apache.org/jira/browse/SPARK-10177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14708957#comment-14708957 ] 

Cheng Lian edited comment on SPARK-10177 at 8/24/15 8:41 AM:
-------------------------------------------------------------

[~davies] I'm not sure whether this is a regression introduced in SPARK-8307. Saw [this PR comment of yours|https://github.com/apache/spark/pull/6759#discussion_r32387873]:
{quote}
I had verified this using the sample parquet file in SPARK-4768, it can read by exact the same value back (with timzone difference).
{quote}
Is the timezone difference expected? On the other hand, I tried to inspect the Parquet file {{5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq}} attached in SPARK-4768 with {{parquet-tools}}. Seems that it doesn't contain a proper timestamp value (please notice the {{<null>}} in the result of {{parquet-dump}})?
{noformat}
$ parquet-schema 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq
message schema {
  optional binary dummy;
  optional int96 timestamp1;
}



$ parquet-meta 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq
file:        file:/Users/lian/Desktop/5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq
creator:     impala version 2.0.0-cdh5 (build ecf30af0b4d6e56ea80297df2189367ada6b7da7)

file schema: schema
------------------------------------------------------------------------------------------------------------
dummy:       OPTIONAL BINARY R:0 D:1
timestamp1:  OPTIONAL INT96 R:0 D:1

row group 1: RC:1 TS:96 OFFSET:4
------------------------------------------------------------------------------------------------------------
dummy:        BINARY SNAPPY DO:4 FPO:33 SZ:57/53/0.93 VC:1 ENC:PLAIN,PLAIN_DICTIONARY,RLE
timestamp1:   INT96 SNAPPY DO:93 FPO:107 SZ:39/36/0.92 VC:1 ENC:PLAIN,PLAIN_DICTIONARY,RLE



$ parquet-dump 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq
row group 0
------------------------------------------------------------------------------------------------------------
dummy:       BINARY SNAPPY DO:4 FPO:33 SZ:57/53/0.93 VC:1 ENC:PLAIN_DICTIONARY,RLE,PLAIN
timestamp1:  INT96 SNAPPY DO:93 FPO:107 SZ:39/36/0.92 VC:1 ENC:PLAIN_DICTIONARY,RLE,PLAIN

    dummy TV=1 RL=0 DL=1 DS:      1 DE:PLAIN_DICTIONARY
    --------------------------------------------------------------------------------------------------------
    page 0:                        DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:9 VC:1

    timestamp1 TV=1 RL=0 DL=1 DS: 0 DE:PLAIN_DICTIONARY
    --------------------------------------------------------------------------------------------------------
    page 0:                        DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:6 VC:1

BINARY dummy
------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:1 V:test row 4

INT96 timestamp1
------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:0 V:<null>
{noformat}



was (Author: lian cheng):
[~davies] I'm not sure whether this is a regression introduced in SPARK-8307. Saw [this PR comment of yours|https://github.com/apache/spark/pull/6759#discussion_r32387873]:
{quote}
I had verified this using the sample parquet file in SPARK-4768, it can read by exact the same value back (with timzone difference).
{quote}
Is the timezone difference expected? On the other hand, I tried to inspect the Parquet file attached in SPARK-4768 with {{parquet-tools}}. Seems that it doesn't contain a proper timestamp value (please notice the {{<null>}} in the result of {{parquet-dump}})?
{noformat}
$ parquet-schema 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq
message schema {
  optional binary dummy;
  optional int96 timestamp1;
}



$ parquet-meta 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq
file:        file:/Users/lian/Desktop/5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq
creator:     impala version 2.0.0-cdh5 (build ecf30af0b4d6e56ea80297df2189367ada6b7da7)

file schema: schema
------------------------------------------------------------------------------------------------------------
dummy:       OPTIONAL BINARY R:0 D:1
timestamp1:  OPTIONAL INT96 R:0 D:1

row group 1: RC:1 TS:96 OFFSET:4
------------------------------------------------------------------------------------------------------------
dummy:        BINARY SNAPPY DO:4 FPO:33 SZ:57/53/0.93 VC:1 ENC:PLAIN,PLAIN_DICTIONARY,RLE
timestamp1:   INT96 SNAPPY DO:93 FPO:107 SZ:39/36/0.92 VC:1 ENC:PLAIN,PLAIN_DICTIONARY,RLE



$ parquet-dump 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq
row group 0
------------------------------------------------------------------------------------------------------------
dummy:       BINARY SNAPPY DO:4 FPO:33 SZ:57/53/0.93 VC:1 ENC:PLAIN_DICTIONARY,RLE,PLAIN
timestamp1:  INT96 SNAPPY DO:93 FPO:107 SZ:39/36/0.92 VC:1 ENC:PLAIN_DICTIONARY,RLE,PLAIN

    dummy TV=1 RL=0 DL=1 DS:      1 DE:PLAIN_DICTIONARY
    --------------------------------------------------------------------------------------------------------
    page 0:                        DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:9 VC:1

    timestamp1 TV=1 RL=0 DL=1 DS: 0 DE:PLAIN_DICTIONARY
    --------------------------------------------------------------------------------------------------------
    page 0:                        DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:6 VC:1

BINARY dummy
------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:1 V:test row 4

INT96 timestamp1
------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:0 V:<null>
{noformat}


> Parquet support interprets timestamp values differently from Hive 0.14.0+
> -------------------------------------------------------------------------
>
>                 Key: SPARK-10177
>                 URL: https://issues.apache.org/jira/browse/SPARK-10177
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>            Priority: Blocker
>         Attachments: 000000_0
>
>
> Running the following SQL under Hive 0.14.0+ (tested against 0.14.0 and 1.2.1):
> {code:sql}
> CREATE TABLE ts_test STORED AS PARQUET
> AS SELECT CAST("2015-01-01 00:00:00" AS TIMESTAMP);
> {code}
> Then read the Parquet file generated by Hive with Spark SQL:
> {noformat}
> scala> sqlContext.read.parquet("hdfs://localhost:9000/user/hive/warehouse_hive14/ts_test").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([2015-01-01 12:00:00.0])
> {noformat}
> This issue can be easily reproduced with [this test case in PR #8392|https://github.com/apache/spark/pull/8392/files#diff-1e55698cc579cbae676f827a89c2dc2eR116].
> Spark 1.4.1 works as expected in this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org