You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by vdiravka <gi...@git.apache.org> on 2016/11/17 19:55:04 UTC

[GitHub] drill pull request #656: DRILL-5034: Select timestamp from hive generated pa...

GitHub user vdiravka opened a pull request:

    https://github.com/apache/drill/pull/656

    DRILL-5034: Select timestamp from hive generated parquet always return in UTC

    - TIMESTAMP_IMPALA function is reverted to retaine local timezone
    - TIMESTAMP_IMPALA_LOCALTIMEZONE is deleted
    - Retain local timezone for the INT96 timestamp values in the parquet files while
      PARQUET_READER_INT96_AS_TIMESTAMP option is on

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vdiravka/drill DRILL-5034

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/656.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #656
    
----
commit fa4029c493a25eccd6de0deadbaf3d3d749eafbe
Author: Vitalii Diravka <vi...@gmail.com>
Date:   2016-11-14T21:13:28Z

    DRILL-5034: Select timestamp from hive generated parquet always return in UTC
    - TIMESTAMP_IMPALA function is reverted to retaine local timezone
    - TIMESTAMP_IMPALA_LOCALTIMEZONE is deleted
    - Retain local timezone for the INT96 timestamp values in the parquet files while
      PARQUET_READER_INT96_AS_TIMESTAMP option is on

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #656: DRILL-5034: Select timestamp from hive generated pa...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/656#discussion_r102753732
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -969,12 +969,15 @@ public void testInt96TimeStampValueWidth() throws Exception {
         try {
           testBuilder()
               .ordered()
    -          .sqlQuery("select c, d from cp.`parquet/data.snappy.parquet` where d = '2015-07-18 13:52:51'")
    +          .sqlQuery("select c, d from cp.`parquet/data.snappy.parquet` " +
    +              "where `a` is not null and `c` is not null and `d` is not null")
    --- End diff --
    
    Done. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #656: DRILL-5034: Select timestamp from hive generated pa...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on a diff in the pull request:

    https://github.com/apache/drill/pull/656#discussion_r98358344
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java ---
    @@ -323,18 +323,28 @@ public static DateCorruptionStatus checkForCorruptDateValuesInStatistics(Parquet
        * @param binaryTimeStampValue
        *          hive, impala timestamp values with nanoseconds precision
        *          are stored in parquet Binary as INT96 (12 constant bytes)
    -   *
    +   * @param retainLocalTimezone
    +   *          parquet files don't keep local timeZone according to the
    +   *          <a href="https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#timestamp">Parquet spec</a>,
    +   *          but some tools (hive, for example) retain local timezone for parquet files by default
    +   *          Note: Impala doesn't retain local timezone by default
        * @return  Unix Timestamp - the number of milliseconds since January 1, 1970, 00:00:00 GMT
        *          represented by @param binaryTimeStampValue .
        */
    -    public static long getDateTimeValueFromBinary(Binary binaryTimeStampValue) {
    +    public static long getDateTimeValueFromBinary(Binary binaryTimeStampValue, boolean retainLocalTimezone) {
           // This method represents binaryTimeStampValue as ByteBuffer, where timestamp is stored as sum of
           // julian day number (32-bit) and nanos of day (64-bit)
           NanoTime nt = NanoTime.fromBinary(binaryTimeStampValue);
           int julianDay = nt.getJulianDay();
           long nanosOfDay = nt.getTimeOfDayNanos();
    -      return (julianDay - JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH) * DateTimeConstants.MILLIS_PER_DAY
    +      long dateTime = (julianDay - JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH) * DateTimeConstants.MILLIS_PER_DAY
               + nanosOfDay / NANOS_PER_MILLISECOND;
    +      if (retainLocalTimezone) {
    +        return new org.joda.time.DateTime(dateTime, org.joda.time.chrono.JulianChronology.getInstance())
    +            .withZoneRetainFields(org.joda.time.DateTimeZone.UTC).getMillis();
    --- End diff --
    
    `withZoneRetainFields` method calculates the difference between local timezone and UTC (parameter of that method) and returns original dateTime with a shift of that difference. This approach is used frequently in drill code.
    But thinking a little more on this I decided that it is possible to use more simpler statement, without creating DateTime object. 
    `DateTimeZone.getDefault().convertUTCToLocal(dateTime)`. I think it's more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #656: DRILL-5034: Select timestamp from hive generated parquet a...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:

    https://github.com/apache/drill/pull/656
  
    @bitblender 
    I've disabled one test, which relies on particular timezone (separate commit) and rebased the branch to the master version. Could you please review?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #656: DRILL-5034: Select timestamp from hive generated parquet a...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:

    https://github.com/apache/drill/pull/656
  
    I've rebased the branch to the latest master version.
    @bitblender Could you please review?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #656: DRILL-5034: Select timestamp from hive generated pa...

Posted by bitblender <gi...@git.apache.org>.
Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/656#discussion_r98070065
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java ---
    @@ -323,18 +323,28 @@ public static DateCorruptionStatus checkForCorruptDateValuesInStatistics(Parquet
        * @param binaryTimeStampValue
        *          hive, impala timestamp values with nanoseconds precision
        *          are stored in parquet Binary as INT96 (12 constant bytes)
    -   *
    +   * @param retainLocalTimezone
    +   *          parquet files don't keep local timeZone according to the
    +   *          <a href="https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#timestamp">Parquet spec</a>,
    +   *          but some tools (hive, for example) retain local timezone for parquet files by default
    +   *          Note: Impala doesn't retain local timezone by default
        * @return  Unix Timestamp - the number of milliseconds since January 1, 1970, 00:00:00 GMT
        *          represented by @param binaryTimeStampValue .
        */
    -    public static long getDateTimeValueFromBinary(Binary binaryTimeStampValue) {
    +    public static long getDateTimeValueFromBinary(Binary binaryTimeStampValue, boolean retainLocalTimezone) {
           // This method represents binaryTimeStampValue as ByteBuffer, where timestamp is stored as sum of
           // julian day number (32-bit) and nanos of day (64-bit)
           NanoTime nt = NanoTime.fromBinary(binaryTimeStampValue);
           int julianDay = nt.getJulianDay();
           long nanosOfDay = nt.getTimeOfDayNanos();
    -      return (julianDay - JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH) * DateTimeConstants.MILLIS_PER_DAY
    +      long dateTime = (julianDay - JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH) * DateTimeConstants.MILLIS_PER_DAY
               + nanosOfDay / NANOS_PER_MILLISECOND;
    +      if (retainLocalTimezone) {
    +        return new org.joda.time.DateTime(dateTime, org.joda.time.chrono.JulianChronology.getInstance())
    +            .withZoneRetainFields(org.joda.time.DateTimeZone.UTC).getMillis();
    --- End diff --
    
    Trying to understand this: Why are you calling .withZoneRetainFields(org.joda.time.DateTimeZone.UTC) if retainLocalTimezone is true ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #656: DRILL-5034: Select timestamp from hive generated parquet a...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:

    https://github.com/apache/drill/pull/656
  
    @bitblender 
    The branch is rebased to the master version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #656: DRILL-5034: Select timestamp from hive generated pa...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/656


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #656: DRILL-5034: Select timestamp from hive generated parquet a...

Posted by vdiravka <gi...@git.apache.org>.
Github user vdiravka commented on the issue:

    https://github.com/apache/drill/pull/656
  
    @bitblender Parquet files with INT96 TIMESTAMP values usually are generated with hive. Those TIMESTAMP values represent the local timezone of the host where the data was written. To read that values hive considers the local timezone (in fact shift between local and UTC timezones is adding to the timestamp values).
    The aim of this patch to make the same behaviour like in HIVE while reading parquet INT96 TIMESTAMP vals.
    As the result in different timezones we have different data after query or in other words this test depends from the local timezone and `baselineValues` for different timezones will be different. 
    
    However I applied the logic of converting timestamps to the local timezone for the test's `baselineValues`.
    So now this test works on the every timezone properly.
    
    The last commit is updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #656: DRILL-5034: Select timestamp from hive generated pa...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/656#discussion_r102732719
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java ---
    @@ -969,12 +969,15 @@ public void testInt96TimeStampValueWidth() throws Exception {
         try {
           testBuilder()
               .ordered()
    -          .sqlQuery("select c, d from cp.`parquet/data.snappy.parquet` where d = '2015-07-18 13:52:51'")
    +          .sqlQuery("select c, d from cp.`parquet/data.snappy.parquet` " +
    +              "where `a` is not null and `c` is not null and `d` is not null")
    --- End diff --
    
    If you have several baseline values and indicate that output is ordered `order by` clause must be present in query to ensure that result will always be returned in expected order, otherwise use `unOrdered()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---