You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Jiang Wu (JIRA)" <ji...@apache.org> on 2018/05/01 01:27:00 UTC

[jira] [Commented] (DRILL-6242) Output format for nested date, time, timestamp values in an object hierarchy

    [ https://issues.apache.org/jira/browse/DRILL-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459369#comment-16459369 ] 

Jiang Wu commented on DRILL-6242:
---------------------------------

The resulting changes make use of java.time.Local<Date|Time|DateTime> as the Java object representation of values from respective Drill vector types: Date, Time, Timestamp.  With this change, accessing Drill date time vectors will return non time zone specific values.

Below lists out the behavior of Drill with examples to illustrate how the time values from various data sources (storage plugin, inline functions) are handled.  This represents existing behavior. 

For each data source, the example shows the original value in the data source, how such value is interpreted and converted into a value for the Drill Timestamp vector, how the value from the Timestamp vector is read.  And how the client of Drill can reproduce the original value from the LocalDateTime returned from the Drill Timestamp vector.  Any value that is different from the original value is highlighted in red.

A Timestamp represents an instant in time and in theory should not be timezone dependent.  We can interpret a Timestamp being made of 3 parts: date part, time part, and time zone/offset.  Based on the time zone/offset, the date part and the time part can be different for the same Timestamp instance.

*1. Date source: extended JSON file, TO_TIMESTAMP(), and CAST to TIMESTAMP.*  Any time zone associated with the original time value is ignored.  This means all timestamps are treated as though they are from the Drill server's local time zone.  E.g.

 
{code:java}
select case when t1 = t2 then 1 else 0 end 
from (
  select TO_TIMESTAMP('2015-03-30 20:49:59.0 UTC', 'YYYY-MM-dd HH:mm:ss.s z') as t1,
         TO_TIMESTAMP('2015-03-30 20:49:59.0 PST', 'YYYY-MM-dd HH:mm:ss.s z') as t2
  from (values(1))
)
{code}
 

returns

 
{code:java}
+---------+
| EXPR$0  |
+---------+
| 1       |
+---------+{code}
 

*2. Date source: parquet timestamp.* Treat date-part and time-part as though they are in the Drill server time zone.  Timestamp value is represented as a long in Parquet data source.  Produce a \{date, time, UTC} from the Timestamp, but then ignore the time zone.  The result is a data part and time part with the same values as seen from the UTC time zone.  Example:

 
|| ||Parquet Timestamp value||Write to Drill Timestamp Vector||Read from Drill Timestamp Vector||How to get back the original value||
|Actual value|123456789012|123456789012|{color:#FF0000}1973-11-29T21:33:09.012{color}
{color:#FF0000}(LocalDateTime){color}|"1973-11-29T21:33:09.012".atZone(OffsetZone.UTC).toInstant()|
|Interpretation in Drill Server Time Zone|1973-11-29T21:33:09Z|1973-11-29T21:33:09Z|{color:#FF0000}1973-11-29T21:33:09PST{color}|1973-11-29T21:33:09Z|

 

 

 
*3. Date Source: parquet int96 as timestamp.* Generate the date-part and time-part in the Drill server time zone that represent the same instant as the timestamp.  Produce a \{date1, time1, UTC} from the Parquet int96 value.  Convert this to another \{date2, time2, Drill Time Zone} representation.  Strip out the Drill Time Zone and replace with UTC resulting in a \{date2, time2, UTC} and store in vector.  Example:
 
|| ||Parquet int96 as Timestamp||Write to Drill Timestamp Vector||Read from Drill Timestamp Vector||How to get back the original value||
|Actual value|1312196153000|{color:#FF0000}1312170953000{color}|{color:#FF0000}2011-08-01T03:55:53{color}
{color:#FF0000}(LocalDateTime){color}|"2011-08-01T03:55:53".atZone(ZoneOffset.systemDefault()).toInstant()|
|Interpretation in Drill Server Time Zone|2011-08-01T10:55:53Z
 
which is the same as:
 
2011-08-01T03:55:53 PDT|{color:#FF0000}2011-08-01T03:55:53Z{color}|{color:#FF0000}2011-08-01T03:55:53 PDT{color}|2011-08-01T10:55:53Z|

*4. Date Source: BSON.* Same as Parquet int96 as timestamp type.  Preserve the correct time.  Produce a \{date1, time1, UTC} from the Parquet int96 value.  Convert this to another \{date2, time2, Drill Time Zone} representation.  Strip out the Drill Time Zone and replace with UTC resulting in a \{date2, time2, UTC} and store in vector.  Example:
|| ||BSON DateTime (long)||Write to Drill Timestamp Vector||Read from Drill Timestamp Vector||How to get back the original value||
|Actual Value|5262729712|{color:#d04437}5233929712{color}|{color:#d04437}1970-03-02T13:52:09.712{color}
{color:#d04437}(LocalDateTime){color}|"1970-03-02T13:52:09.712".atZone(ZoneOffset.systemDefault()).toInstant()|
|Interpretation in Drill Server Time Zone|1970-03-02T21:52:09Z
 
which is the same as:
 
1970-03-02T13:52:09 PST|{color:#d04437}1970-03-02T13:52:09.712Z{color}|{color:#d04437}1970-03-02T13:52:09.712 PST{color}
 
|1970-03-02T21:52:09Z|

 

 

> Output format for nested date, time, timestamp values in an object hierarchy
> ----------------------------------------------------------------------------
>
>                 Key: DRILL-6242
>                 URL: https://issues.apache.org/jira/browse/DRILL-6242
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Data Types
>    Affects Versions: 1.12.0
>            Reporter: Jiang Wu
>            Assignee: Jiang Wu
>            Priority: Major
>             Fix For: 1.14.0
>
>
> Some storages (mapr db, mongo db, etc.) have hierarchical objects that contain nested fields of date, time, timestamp types.  When a query returns these objects, the output format for the nested date, time, timestamp, are showing the internal object (org.joda.time.DateTime), rather than the logical data value.
> For example.  Suppose in MongoDB, we have a single object that looks like this:
> {code:java}
> > db.test.findOne();
> {
>     "_id" : ObjectId("5aa8487d470dd39a635a12f5"),
>     "name" : "orange",
>     "context" : {
>         "date" : ISODate("2018-03-13T21:52:54.940Z"),
>         "user" : "jack"
>     }
> }
> {code}
> Then connect Drill to the above MongoDB storage, and run the following query within Drill:
> {code:java}
> > select t.context.`date`, t.context from test t; 
> +--------+---------+ 
> | EXPR$0 | context | 
> +--------+---------+ 
> | 2018-03-13 | {"date":{"dayOfYear":72,"year":2018,"dayOfMonth":13,"dayOfWeek":2,"era":1,"millisOfDay":78774940,"weekOfWeekyear":11,"weekyear":2018,"monthOfYear":3,"yearOfEra":2018,"yearOfCentury":18,"centuryOfEra":20,"millisOfSecond":940,"secondOfMinute":54,"secondOfDay":78774,"minuteOfHour":52,"minuteOfDay":1312,"hourOfDay":21,"zone":{"fixed":true,"id":"UTC"},"millis":1520977974940,"chronology":{"zone":{"fixed":true,"id":"UTC"}},"afterNow":false,"beforeNow":true,"equalNow":false},"user":"jack"} |
> {code}
> We can see that from the above output, when the date field is retrieved as a top level column, Drill outputs a logical date value.  But when the same field is within an object hierarchy, Drill outputs the internal object used to hold the date value.
> The expected output is the same display for whether the date field is shown as a top level column or when it is within an object hierarchy:
> {code:java}
> > select t.context.`date`, t.context from test t; 
> +--------+---------+ 
> | EXPR$0 | context | 
> +--------+---------+ 
> | 2018-03-13 | {"date":"2018-03-13","user":"jack"} |
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)