You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Vitalii Diravka (JIRA)" <ji...@apache.org> on 2017/03/03 09:00:54 UTC
[jira] [Updated] (DRILL-4203) Parquet File : Date is stored wrongly

     [ https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vitalii Diravka updated DRILL-4203:
-----------------------------------
    Description: 
Hello,

I have some problems when i try to read parquet files produce by drill with  Spark,  all dates are corrupted.

I think the problem come from drill :)

{code}
cat /tmp/date_parquet.csv 
Epoch,1970-01-01
{code}

{code}
0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
+--------+-------------+
|  name  | epoch_date  |
+--------+-------------+
| Epoch  | 1970-01-01  |
+--------+-------------+
{code}

{code}
0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+
{code}

When I read the file with parquet tools, i found  
{code}
java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
name = Epoch
epoch_date = 4881176
{code}

According to [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], epoch_date should be equals to 0.

Meta : 
{code}
java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
file:        file:/tmp/buggy_parquet/0_0_0.parquet 
creator:     parquet-mr version 1.8.1-drill-r0 (build 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
extra:       drill.version = 1.4.0 

file schema: root 
--------------------------------------------------------------------------------
name:        OPTIONAL BINARY O:UTF8 R:0 D:1
epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1

row group 1: RC:1 TS:93 OFFSET:4 
--------------------------------------------------------------------------------
name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
{code}




Implementation:

After the fix Drill can automatically determine date corruption in parquet files 
and convert it to correct values.

For the reason, when the user want to work with the dates over the 5 000 years,
an option is included to turn off the auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included for
completeness.
To disable "auto correction" you should use the parquet config in the plugin settings. Something like this:
{code}
  "formats": {
    "parquet": {
      "type": "parquet",
      "autoCorrectCorruptDates": false
    }
{code}
Or you can try to use the query like this:
{code}
select l_shipdate, l_commitdate from table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem` 
(type => 'parquet', autoCorrectCorruptDates => false)) limit 1;
{code}




  was:
Hello,

I have some problems when i try to read parquet files produce by drill with  Spark,  all dates are corrupted.

I think the problem come from drill :)

{code}
cat /tmp/date_parquet.csv 
Epoch,1970-01-01
{code}

{code}
0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
+--------+-------------+
|  name  | epoch_date  |
+--------+-------------+
| Epoch  | 1970-01-01  |
+--------+-------------+
{code}

{code}
0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+
{code}

When I read the file with parquet tools, i found  
{code}
java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
name = Epoch
epoch_date = 4881176
{code}

According to [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], epoch_date should be equals to 0.

Meta : 
{code}
java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
file:        file:/tmp/buggy_parquet/0_0_0.parquet 
creator:     parquet-mr version 1.8.1-drill-r0 (build 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
extra:       drill.version = 1.4.0 

file schema: root 
--------------------------------------------------------------------------------
name:        OPTIONAL BINARY O:UTF8 R:0 D:1
epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1

row group 1: RC:1 TS:93 OFFSET:4 
--------------------------------------------------------------------------------
name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
{code}




Implementation:

After the fix Drill can automatically determine date corruption in parquet files 
and convert it to correct values.

For the reason, when the user want to work with the dates over the 5 000 years,
an option is included to turn off the auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included for
completeness.
To disable "auto correction" you should use the parquet config in the plugin settings. Something like this:
{code}
  "formats": {
    "parquet": {
      "type": "parquet",
      "autoCorrectCorruptDates": false
    }
{code}
Or you can try to use the query like this:
{code}
select l_shipdate, l_commitdate from table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem` 
(type => 'parquet', autoCorrectCorruptDates => false)) limit 1;
{code}

After the fix the new files generated from drill will have "is.date.correct=true" extra property in parquet 
metadata, which defines that the file can't involve corrupted date values.




> Parquet File : Date is stored wrongly
> -------------------------------------
>
>                 Key: DRILL-4203
>                 URL: https://issues.apache.org/jira/browse/DRILL-4203
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Stéphane Trou
>            Assignee: Vitalii Diravka
>            Priority: Critical
>              Labels: doc-impacting
>             Fix For: 1.9.0
>
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with  Spark,  all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv 
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> |  name  | epoch_date  |
> +--------+-------------+
> | Epoch  | 1970-01-01  |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found  
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], epoch_date should be equals to 0.
> Meta : 
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file:        file:/tmp/buggy_parquet/0_0_0.parquet 
> creator:     parquet-mr version 1.8.1-drill-r0 (build 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
> extra:       drill.version = 1.4.0 
> file schema: root 
> --------------------------------------------------------------------------------
> name:        OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4 
> --------------------------------------------------------------------------------
> name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
> epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
> {code}
> Implementation:
> After the fix Drill can automatically determine date corruption in parquet files 
> and convert it to correct values.
> For the reason, when the user want to work with the dates over the 5 000 years,
> an option is included to turn off the auto-correction.
> Use of this option is assumed to be extremely unlikely, but it is included for
> completeness.
> To disable "auto correction" you should use the parquet config in the plugin settings. Something like this:
> {code}
>   "formats": {
>     "parquet": {
>       "type": "parquet",
>       "autoCorrectCorruptDates": false
>     }
> {code}
> Or you can try to use the query like this:
> {code}
> select l_shipdate, l_commitdate from table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem` 
> (type => 'parquet', autoCorrectCorruptDates => false)) limit 1;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)