You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/09/22 13:48:20 UTC

[jira] [Commented] (DRILL-4203) Parquet File : Date is stored wrongly

    [ https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513342#comment-15513342 ] 

ASF GitHub Bot commented on DRILL-4203:
---------------------------------------

GitHub user vdiravka opened a pull request:

    https://github.com/apache/drill/pull/595

    DRILL-4203: Parquet File. Date is stored wrongly

    Drill was writing non-standard dates into parquet files for all releases
    before this commit. The values have been read correctly by Drill, but
    external tools like Spark reading the files will see corrupted values for
    all dates that have been written by Drill.
    
    This change corrects the behavior of the Drill parquet writer to correctly
    store dates in the format given in the parquet specification.
    
    To maintain compatibility with old files, the parquet reader code has been
    updated to check for the old format and automatically shift the
    corrupted values into corrected ones automatically.
    
    The test cases included here should ensure that all files produced by
    historical versions of Drill will continue to return the same values
    they had in previous releases. For compatibility with external tools, any
    old files with corrupted dates can be re-written using the CREATE TABLE AS
    command (as the writer will now only produce the specification-compliant
    values, even if after reading out of older corrupt files, one
    new extra field "is.date.correct = true" will be included into the parquet meta
    information of files and into drill metadata cache files).
    
    While the old behavior was a consistent shift into an unlikely range to be
    used in a modern database (over 10,000 years in the future), these are
    still valid date values. In the case where these may have been written
    into files intentionally, and we cannot be certain from the metadata if
    Drill produced the files, an option is included to turn off the auto-correction.
    Use of this option is assumed to be extremely unlikely, but it is included for
    completeness.
    
    One small fix in the ParquetGroupScan to accommodate changes in master that changed
    when metadata is read.
    
    Added new tests for bugs (revealed by the regression suite) with old and new
    parquet (binary) files for new tests, updated metadata cache files accordingly.
    
    Removed unnecessary double conversion of value with Julian day.
    
    Added ability to correct corrupted dates for parquet files with the second
    version old metadata cache file as well.
    
    Fix DrillVersionInfo to make it provide a valid version number even during
    the unit tests. This is now a build-time generated class, rather than one
    that looks on the classpath for META-INF files. (This pattern
    for file generation with parameters passed from the POM files
    was borrowed from parquet-mr)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vdiravka/drill DRILL-4203

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/595.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #595
    
----
commit 6f816742d773a1696b5329472c2465a79e35140c
Author: Vitalii Diravka <vi...@gmail.com>
Date:   2016-09-22T13:44:37Z

    DRILL-4203: Parquet File. Date is stored wrongly
    
    Drill was writing non-standard dates into parquet files for all releases
    before this commit. The values have been read correctly by Drill, but
    external tools like Spark reading the files will see corrupted values for
    all dates that have been written by Drill.
    
    This change corrects the behavior of the Drill parquet writer to correctly
    store dates in the format given in the parquet specification.
    
    To maintain compatibility with old files, the parquet reader code has been
    updated to check for the old format and automatically shift the
    corrupted values into corrected ones automatically.
    
    The test cases included here should ensure that all files produced by
    historical versions of Drill will continue to return the same values
    they had in previous releases. For compatibility with external tools, any
    old files with corrupted dates can be re-written using the CREATE TABLE AS
    command (as the writer will now only produce the specification-compliant
    values, even if after reading out of older corrupt files, one
    new extra field "is.date.correct = true" will be included into the parquet meta
    information of files and into drill metadata cache files).
    
    While the old behavior was a consistent shift into an unlikely range to be
    used in a modern database (over 10,000 years in the future), these are
    still valid date values. In the case where these may have been written
    into files intentionally, and we cannot be certain from the metadata if
    Drill produced the files, an option is included to turn off the auto-correction.
    Use of this option is assumed to be extremely unlikely, but it is included for
    completeness.
    
    One small fix in the ParquetGroupScan to accommodate changes in master that changed
    when metadata is read.
    
    Added new tests for bugs (revealed by the regression suite) with old and new
    parquet (binary) files for new tests, updated metadata cache files accordingly.
    
    Removed unnecessary double conversion of value with Julian day.
    
    Added ability to correct corrupted dates for parquet files with the second
    version old metadata cache file as well.
    
    Fix DrillVersionInfo to make it provide a valid version number even during
    the unit tests. This is now a build-time generated class, rather than one
    that looks on the classpath for META-INF files. (This pattern
    for file generation with parameters passed from the POM files
    was borrowed from parquet-mr)

----


> Parquet File : Date is stored wrongly
> -------------------------------------
>
>                 Key: DRILL-4203
>                 URL: https://issues.apache.org/jira/browse/DRILL-4203
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Stéphane Trou
>            Assignee: Vitalii Diravka
>            Priority: Critical
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with  Spark,  all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv 
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> |  name  | epoch_date  |
> +--------+-------------+
> | Epoch  | 1970-01-01  |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found  
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], epoch_date should be equals to 0.
> Meta : 
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file:        file:/tmp/buggy_parquet/0_0_0.parquet 
> creator:     parquet-mr version 1.8.1-drill-r0 (build 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
> extra:       drill.version = 1.4.0 
> file schema: root 
> --------------------------------------------------------------------------------
> name:        OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4 
> --------------------------------------------------------------------------------
> name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
> epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)