You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Jason Altekruse (JIRA)" <ji...@apache.org> on 2016/01/26 01:28:39 UTC

[jira] [Comment Edited] (DRILL-4203) Parquet File : Date is stored wrongly

    [ https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116281#comment-15116281 ] 

Jason Altekruse edited comment on DRILL-4203 at 1/26/16 12:28 AM:
------------------------------------------------------------------

[~zfong] That is correct. The only extra complexity is that I have added an option that allows users to optionally turn-off auto-correction for any files that are not certain to have been created by Drill.

The default behavior will be to check the file level created-by metadata, if we know it is a version of Drill after the fix, no correction will happen regardless of the setting of the option. Similarly for a file with a drill version string, that indicates the data was written before this fix, we will always correct the data, regardless of the setting of this flag.

The only complicated case is where there is not enough metadata to determine if it is a Drill file or not. In this case we will check the values in the file, either in the file level min/max statistics when the reader is initialized or when the file lacks min/max value statistics (it's a pre-1.0 drill file) we will have to defer detection until actually reading individual data pages. Checks at both of these levels can be disabled by the option.

The nature of the bug caused a really significant shift of the dates, putting them thousands of years into the future. Thus auto-correction as the default isn't high risk as it extremely unlikely users will have created a database full of dates in this range. That being said, the option is included to cover any such cases.


was (Author: jaltekruse):
[~zfong] That is correct. The only extra complexity is that I have added an option that allows users to optionally turn-off auto-correction for any files that are not certain to have been created by Drill.

The default behavior will be to check the file level created-by metadata, if we know it is a version of Drill after the fix, not correction will happen regardless of the setting of the option. Similarly for a file with a drill version string, that indicates the data was written before this fix, we will always correct the data, regardless of the setting of this flag.

The only complicated case is where there is not enough metadata to determine if it is a Drill file or not. In this case we will check the values in the file, either in the file level min/max statistics when the reader is initialized or when the file lacks min/max value statistics (it's a pre-1.0 drill file) we will have to defer detection until actually reading individual data pages. Checks at both of these levels can be disabled by the option.

The nature of the bug caused a really significant shift of the dates, putting them thousands of years into the future. Thus auto-correction as the default isn't high risk as it extremely unlikely users will have created a database full of dates in this range. That being said, the option is included to cover any such cases.

> Parquet File : Date is stored wrongly
> -------------------------------------
>
>                 Key: DRILL-4203
>                 URL: https://issues.apache.org/jira/browse/DRILL-4203
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Stéphane Trou
>            Assignee: Jason Altekruse
>            Priority: Critical
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with  Spark,  all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv 
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> |  name  | epoch_date  |
> +--------+-------------+
> | Epoch  | 1970-01-01  |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found  
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], epoch_date should be equals to 0.
> Meta : 
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file:        file:/tmp/buggy_parquet/0_0_0.parquet 
> creator:     parquet-mr version 1.8.1-drill-r0 (build 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
> extra:       drill.version = 1.4.0 
> file schema: root 
> --------------------------------------------------------------------------------
> name:        OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4 
> --------------------------------------------------------------------------------
> name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
> epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)