You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2016/07/04 18:05:11 UTC

[jira] [Updated] (DRILL-4763) Parquet file with DATE logical type produces wrong results for simple SELECT

     [ https://issues.apache.org/jira/browse/DRILL-4763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogers updated DRILL-4763:
-------------------------------
    Attachment: date.parquet

Parquet file created with the schema and values described in the bug. The first and last values are of primary interest, the other values simply probe interesting values.

> Parquet file with DATE logical type produces wrong results for simple SELECT
> ----------------------------------------------------------------------------
>
>                 Key: DRILL-4763
>                 URL: https://issues.apache.org/jira/browse/DRILL-4763
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Data Types
>    Affects Versions: 1.6.0
>            Reporter: Paul Rogers
>         Attachments: date.parquet
>
>
> Created a simple Parquet file with the following schema:
> message test { required int32 index; required int32 value (DATE); required int32 raw; }
> That is, a file with an int32 storage type and a DATE logical type. Then, created a number of test values:
> 0 (which should be interpreted as 1970-01-01) and
> (int) (System.currentTimeMillis() / (24*60*60*1000) ) Which should be interpreted as the number of days since 1970-01-01 and today.
> According to the Parquet spec (https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md), Parquet dates are expressed as "the number of days from the Unix epoch, 1 January 1970."
> Java timestamps are expressed as "measured in milliseconds, between the current time and midnight, January 1, 1970 UTC."
> There is ambiguity here: Parquet dates are presumably local times not absolute times, so the math above will actually tell us the date in London right now, but that's close enough.
> Generate the local file to date.parquet. Query it with:
> SELECT * from `local`.`root`.`date.parquet`;
> The results are incorrect:
> index value raw
> 1	-11395-10-18T00:00:00.000-07:52:58	0
> Here, we have a value of 0. The displayed date is decidedly not 1970-01-01T00:00:00. We actually have many problems:
> 1. The date is far off.
> 2. The output shows time. But, the Parquet DATE format explcitly does NOT include time, so it makes no sense to include it.
> 3. The output attempts to show a time zone, but a time zone of -07:52:58, while close to PST, is not right (there is no timezine that is of by 7:02 from UTC.)
> 4. The data has no time zone, Parquet DATE explicilty is a local time, so it is impossible to know the relationship between that date an UTC.
> The correct output (in ISO format) would be: 1970-01-01
> The last line should be today's date, but instead is:
> 6	-11348-04-20T00:00:00.000-07:52:58	16986
> Expected:
> 2016-07-04
> Note that all the information to produce the right information is available to Drill:
> 1. The DATE annotation says the meaning of the signed 32-bit integer.
> 2. Given the starting point and duration in days, the conversion to Drill's own internal date format is unambiguous.
> 3. The DATE annotation says that the date is local, so Drill should not attempt to convert to UTC. (That is, a Java Date object can't be used, instead a Joda/Java 8 LocalDate is necessary.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)