You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/09/22 13:48:20 UTC
[jira] [Commented] (DRILL-4203) Parquet File : Date is stored
wrongly
[ https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513342#comment-15513342 ]
ASF GitHub Bot commented on DRILL-4203:
---------------------------------------
GitHub user vdiravka opened a pull request:
https://github.com/apache/drill/pull/595
DRILL-4203: Parquet File. Date is stored wrongly
Drill was writing non-standard dates into parquet files for all releases
before this commit. The values have been read correctly by Drill, but
external tools like Spark reading the files will see corrupted values for
all dates that have been written by Drill.
This change corrects the behavior of the Drill parquet writer to correctly
store dates in the format given in the parquet specification.
To maintain compatibility with old files, the parquet reader code has been
updated to check for the old format and automatically shift the
corrupted values into corrected ones automatically.
The test cases included here should ensure that all files produced by
historical versions of Drill will continue to return the same values
they had in previous releases. For compatibility with external tools, any
old files with corrupted dates can be re-written using the CREATE TABLE AS
command (as the writer will now only produce the specification-compliant
values, even if after reading out of older corrupt files, one
new extra field "is.date.correct = true" will be included into the parquet meta
information of files and into drill metadata cache files).
While the old behavior was a consistent shift into an unlikely range to be
used in a modern database (over 10,000 years in the future), these are
still valid date values. In the case where these may have been written
into files intentionally, and we cannot be certain from the metadata if
Drill produced the files, an option is included to turn off the auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included for
completeness.
One small fix in the ParquetGroupScan to accommodate changes in master that changed
when metadata is read.
Added new tests for bugs (revealed by the regression suite) with old and new
parquet (binary) files for new tests, updated metadata cache files accordingly.
Removed unnecessary double conversion of value with Julian day.
Added ability to correct corrupted dates for parquet files with the second
version old metadata cache file as well.
Fix DrillVersionInfo to make it provide a valid version number even during
the unit tests. This is now a build-time generated class, rather than one
that looks on the classpath for META-INF files. (This pattern
for file generation with parameters passed from the POM files
was borrowed from parquet-mr)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vdiravka/drill DRILL-4203
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/595.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #595
----
commit 6f816742d773a1696b5329472c2465a79e35140c
Author: Vitalii Diravka <vi...@gmail.com>
Date: 2016-09-22T13:44:37Z
DRILL-4203: Parquet File. Date is stored wrongly
Drill was writing non-standard dates into parquet files for all releases
before this commit. The values have been read correctly by Drill, but
external tools like Spark reading the files will see corrupted values for
all dates that have been written by Drill.
This change corrects the behavior of the Drill parquet writer to correctly
store dates in the format given in the parquet specification.
To maintain compatibility with old files, the parquet reader code has been
updated to check for the old format and automatically shift the
corrupted values into corrected ones automatically.
The test cases included here should ensure that all files produced by
historical versions of Drill will continue to return the same values
they had in previous releases. For compatibility with external tools, any
old files with corrupted dates can be re-written using the CREATE TABLE AS
command (as the writer will now only produce the specification-compliant
values, even if after reading out of older corrupt files, one
new extra field "is.date.correct = true" will be included into the parquet meta
information of files and into drill metadata cache files).
While the old behavior was a consistent shift into an unlikely range to be
used in a modern database (over 10,000 years in the future), these are
still valid date values. In the case where these may have been written
into files intentionally, and we cannot be certain from the metadata if
Drill produced the files, an option is included to turn off the auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included for
completeness.
One small fix in the ParquetGroupScan to accommodate changes in master that changed
when metadata is read.
Added new tests for bugs (revealed by the regression suite) with old and new
parquet (binary) files for new tests, updated metadata cache files accordingly.
Removed unnecessary double conversion of value with Julian day.
Added ability to correct corrupted dates for parquet files with the second
version old metadata cache file as well.
Fix DrillVersionInfo to make it provide a valid version number even during
the unit tests. This is now a build-time generated class, rather than one
that looks on the classpath for META-INF files. (This pattern
for file generation with parameters passed from the POM files
was borrowed from parquet-mr)
----
> Parquet File : Date is stored wrongly
> -------------------------------------
>
> Key: DRILL-4203
> URL: https://issues.apache.org/jira/browse/DRILL-4203
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.4.0
> Reporter: Stéphane Trou
> Assignee: Vitalii Diravka
> Priority: Critical
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with Spark, all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> | name | epoch_date |
> +--------+-------------+
> | Epoch | 1970-01-01 |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select columns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], epoch_date should be equals to 0.
> Meta :
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file: file:/tmp/buggy_parquet/0_0_0.parquet
> creator: parquet-mr version 1.8.1-drill-r0 (build 6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
> extra: drill.version = 1.4.0
> file schema: root
> --------------------------------------------------------------------------------
> name: OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date: OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4
> --------------------------------------------------------------------------------
> name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
> epoch_date: INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 ENC:RLE,BIT_PACKED,PLAIN
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)