You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sergio Pena <se...@cloudera.com> on 2016/01/03 01:36:54 UTC

Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41821/
-----------------------------------------------------------

(Updated Jan. 3, 2016, 12:36 a.m.)


Review request for hive, Ryan Blue, Mohammad Islam, Reuben Kuhnert, and Szehon Ho.


Changes
-------

Rebase latest changes from master.


Summary (updated)
-----------------

HIVE-12767: Implement table property to address Parquet int96 timestamp bug


Bugs: HIVE-12767
    https://issues.apache.org/jira/browse/HIVE-12767


Repository: hive-git


Description
-------

The following exit criteria is addressed in this patch:

* Hive will read Parquet MR int96 timestamp data and adjust values using a time zone from a table property, if set, or using the local time zone if it is absent. No adjustment will be applied to data written by Impala.

* Hive will write Parquet int96 timestamps using a time zone adjustment from the same table property, if set, or using the local time zone if it is absent. This keeps the data in the table consistent.

* New tables created by Hive will set the table property to UTC if the global option to set the property for new tables is enabled.
  * Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the property unless the global setting to do so is enabled.
  * Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the property of the table that is copied.


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 591c0ab097f1f384e287876fc25fd24c09006dfb 
  data/files/impala_int96_timestamp.parq PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ac0ecd9c31c0e84b8c4628c86b0fdab7d1842f28 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java bfb48a987ce89a373f3da63c9162546c6eda43a9 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java ec0dd818f688ab92feb46be4fb6040ede5ac756a 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java 53f3b72b790d87a75a7cd1d77d8f011c29c41188 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java 74a1a82047613189678716f765bfaa9ac39b7618 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java aace48ee7d145d199163286d21e4ee7694140d6f 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 69272dc41dbc5fe29ab4c98e730b591c28f3a297 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 70491390ba2b90f32ef9963be7b19e57672241f3 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java PRE-CREATION 
  ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
  ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/41821/diff/


Testing
-------

Added unit and q-tests:
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
  ql/src/test/queries/clientpositive/parquet_int96_timestamp.q


Thanks,

Sergio Pena


Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

Posted by Sergio Pena <se...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41821/
-----------------------------------------------------------

(Updated Feb. 2, 2016, 7:10 p.m.)


Review request for hive, Ryan Blue, Mohammad Islam, Reuben Kuhnert, and Szehon Ho.


Changes
-------

I forgot to add data/files/hive2.0_int96_timestamp-bug.parq


Bugs: HIVE-12767
    https://issues.apache.org/jira/browse/HIVE-12767


Repository: hive-git


Description
-------

The following exit criteria is addressed in this patch:

* Hive will read Parquet MR int96 timestamp data and adjust values using a time zone from a table property, if set, or using the local time zone if it is absent. No adjustment will be applied to data written by Impala.

* Hive will write Parquet int96 timestamps using a time zone adjustment from the same table property, if set, or using the local time zone if it is absent. This keeps the data in the table consistent.

* New tables created by Hive will set the table property to UTC if the global option to set the property for new tables is enabled.
  * Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the property unless the global setting to do so is enabled.
  * Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the property of the table that is copied.

To set the timezone table property, use this:
  create table tbl1 (ts timestamp) stored as parquet tblproperties ('parquet.mr.int96.write.zone'='PST');
  
To set UTC as default timezone table property on new tables created, use this: 
  set parquet.mr.int96.enable.utc.write.zone=true;
  create table tbl2 (ts timestamp) stored as parquet;


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java bfd88f82ee86740f60baed122ca050dd542d637c 
  data/files/hive2.0_int96_timestamp-bug.parq PRE-CREATION 
  data/files/impala_int96_timestamp.parq PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 2e4591313f38d9bd1bfe1b54968c835fdcd53717 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java bfb48a987ce89a373f3da63c9162546c6eda43a9 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java ec0dd818f688ab92feb46be4fb6040ede5ac756a 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java 53f3b72b790d87a75a7cd1d77d8f011c29c41188 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java 74a1a82047613189678716f765bfaa9ac39b7618 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java aace48ee7d145d199163286d21e4ee7694140d6f 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 69272dc41dbc5fe29ab4c98e730b591c28f3a297 
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 8c880c39823a9279e32ee191d9f250ea0014888f 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 70491390ba2b90f32ef9963be7b19e57672241f3 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
  ql/src/test/queries/clientnegative/parquet_int96_timestamp_errors.q PRE-CREATION 
  ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
  ql/src/test/results/clientnegative/parquet_int96_timestamp_errors.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/41821/diff/


Testing
-------

Added unit and q-tests:
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
  ql/src/test/queries/clientpositive/parquet_int96_timestamp.q


Thanks,

Sergio Pena


Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

Posted by Sergio Pena <se...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41821/
-----------------------------------------------------------

(Updated Jan. 28, 2016, 9:01 p.m.)


Review request for hive, Ryan Blue, Mohammad Islam, Reuben Kuhnert, and Szehon Ho.


Bugs: HIVE-12767
    https://issues.apache.org/jira/browse/HIVE-12767


Repository: hive-git


Description
-------

The following exit criteria is addressed in this patch:

* Hive will read Parquet MR int96 timestamp data and adjust values using a time zone from a table property, if set, or using the local time zone if it is absent. No adjustment will be applied to data written by Impala.

* Hive will write Parquet int96 timestamps using a time zone adjustment from the same table property, if set, or using the local time zone if it is absent. This keeps the data in the table consistent.

* New tables created by Hive will set the table property to UTC if the global option to set the property for new tables is enabled.
  * Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the property unless the global setting to do so is enabled.
  * Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the property of the table that is copied.

To set the timezone table property, use this:
  create table tbl1 (ts timestamp) stored as parquet tblproperties ('parquet.mr.int96.write.zone'='PST');
  
To set UTC as default timezone table property on new tables created, use this: 
  set parquet.mr.int96.enable.utc.write.zone=true;
  create table tbl2 (ts timestamp) stored as parquet;


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java bfd88f82ee86740f60baed122ca050dd542d637c 
  data/files/impala_int96_timestamp.parq PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 2e4591313f38d9bd1bfe1b54968c835fdcd53717 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java bfb48a987ce89a373f3da63c9162546c6eda43a9 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java ec0dd818f688ab92feb46be4fb6040ede5ac756a 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java 53f3b72b790d87a75a7cd1d77d8f011c29c41188 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java 74a1a82047613189678716f765bfaa9ac39b7618 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java aace48ee7d145d199163286d21e4ee7694140d6f 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 69272dc41dbc5fe29ab4c98e730b591c28f3a297 
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 8c880c39823a9279e32ee191d9f250ea0014888f 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 70491390ba2b90f32ef9963be7b19e57672241f3 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
  ql/src/test/queries/clientnegative/parquet_int96_timestamp_errors.q PRE-CREATION 
  ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
  ql/src/test/results/clientnegative/parquet_int96_timestamp_errors.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/41821/diff/


Testing
-------

Added unit and q-tests:
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
  ql/src/test/queries/clientpositive/parquet_int96_timestamp.q


Thanks,

Sergio Pena


Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

Posted by Sergio Pena <se...@cloudera.com>.

> On Jan. 22, 2016, 11:15 p.m., Ryan Blue wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java, line 53
> > <https://reviews.apache.org/r/41821/diff/3/?file=1195960#file1195960line53>
> >
> >     This is a little odd. Normally, I would consider GMT/UTC to be the calendar that applies no adjustment. But what is happening in this code is that the UTC millisecond value is being constructed using values in that zone and then a new timestamp is created from it that is in the local zone. So local->local produces no adjustment, while GMT->local does.
> >     
> >     I was thinking that GMT would always be used to create the milliseconds, then the calendar would be used to adjust the value by some amount. UTC by 0, EST by -5, and PST by -8.
> >     
> >     I think it may be easier to have a method that does a single conversion to UTC in milliseconds. Then, adjust the value using a static offset like the one from `TimeZone.getOffset(utc_ms)`. That would result in a bit less work done here and I think would make the adjustment easier to reason about.

I can fix this in a follow JIRA to keep the current code without adding extra bugs or tests.


> On Jan. 22, 2016, 11:15 p.m., Ryan Blue wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java, line 232
> > <https://reviews.apache.org/r/41821/diff/3/?file=1195963#file1195963line232>
> >
> >     It looks like this tests that the value written is the result of `getNanoTime(Timestamp,Calendar).toBinary()`, but doesn't test what that value is or should be. That is probably the intent, but I wanted to make sure.

It is testing that the Binary from of the Timestamp passed to addBinary() is the same of the one found on the ArrayWritable.

This is the ArrayWritable, the one that Hive makes when writing to Parquet:
    ArrayWritable hiveRecord = createGroup(
      createTimestamp(Timestamp.valueOf("2016-01-01 01:01:01"))
    );
    
DataWritableWriter then gets the Timestamp value, and converts it to a Binary form. Here we convert the same Timestamp to Binary
to check that the passed value to addBinary() matches.
    startMessage();
      startField("ts", 0);
        addBinary(NanoTimeUtils.getNanoTime(Timestamp.valueOf("2016-01-01 01:01:01"),
            Calendar.getInstance(TimeZone.getTimeZone("CST"))).toBinary());
      endField("ts", 0);
    endMessage();


> On Jan. 22, 2016, 11:15 p.m., Ryan Blue wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java, line 214
> > <https://reviews.apache.org/r/41821/diff/3/?file=1195966#file1195966line214>
> >
> >     This is a great test for round trip, but I think we should also have a test with a set of corrupt values from a file written with 5.5 in America/Los_Angeles (since that's convenient).

I added the negative tests to parquet_int96_timestamp_errors.q


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41821/#review115731
-----------------------------------------------------------


On Jan. 13, 2016, 8:36 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41821/
> -----------------------------------------------------------
> 
> (Updated Jan. 13, 2016, 8:36 p.m.)
> 
> 
> Review request for hive, Ryan Blue, Mohammad Islam, Reuben Kuhnert, and Szehon Ho.
> 
> 
> Bugs: HIVE-12767
>     https://issues.apache.org/jira/browse/HIVE-12767
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The following exit criteria is addressed in this patch:
> 
> * Hive will read Parquet MR int96 timestamp data and adjust values using a time zone from a table property, if set, or using the local time zone if it is absent. No adjustment will be applied to data written by Impala.
> 
> * Hive will write Parquet int96 timestamps using a time zone adjustment from the same table property, if set, or using the local time zone if it is absent. This keeps the data in the table consistent.
> 
> * New tables created by Hive will set the table property to UTC if the global option to set the property for new tables is enabled.
>   * Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the property unless the global setting to do so is enabled.
>   * Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the property of the table that is copied.
> 
> To set the timezone table property, use this:
>   create table tbl1 (ts timestamp) stored as parquet tblproperties ('parquet.mr.int96.write.zone'='PST');
>   
> To set UTC as default timezone table property on new tables created, use this: 
>   set parquet.mr.int96.enable.utc.write.zone=true;
>   create table tbl2 (ts timestamp) stored as parquet;
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1bcdc5f49e1a4a0f357842e88cf5fd359685b5ef 
>   data/files/impala_int96_timestamp.parq PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java deec8bba45c130c5dfdc482522c0825a71af9d2c 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java bfb48a987ce89a373f3da63c9162546c6eda43a9 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java ec0dd818f688ab92feb46be4fb6040ede5ac756a 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java 53f3b72b790d87a75a7cd1d77d8f011c29c41188 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java 74a1a82047613189678716f765bfaa9ac39b7618 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java aace48ee7d145d199163286d21e4ee7694140d6f 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 69272dc41dbc5fe29ab4c98e730b591c28f3a297 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 70491390ba2b90f32ef9963be7b19e57672241f3 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java PRE-CREATION 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41821/diff/
> 
> 
> Testing
> -------
> 
> Added unit and q-tests:
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>


Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

Posted by Ryan Blue <bl...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41821/#review115731
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java (line 139)
<https://reviews.apache.org/r/41821/#comment176776>

    One of the test cases listed in the scope doc is when the zone ID is unknown, like "Antarctica/South_Pole", that isn't valid. In that case the table should not be readable because the correct zone offset can't be determined.
    
    `TimeZone.getTimeZone(String)` will return GMT when the zone id isn't recognized, so I'm concerned that this will do the wrong thing and use GMT in some cases where the zone should be rejected.
    
    See https://docs.oracle.com/javase/7/docs/api/java/util/TimeZone.html#getTimeZone(java.lang.String)
    You might need to make a set of available IDs using https://docs.oracle.com/javase/7/docs/api/java/util/TimeZone.html#getAvailableIDs()



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java (line 168)
<https://reviews.apache.org/r/41821/#comment176778>

    It looks like this removes support for `HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_SKIP_CONVERSION.varname`. I think we should maintain support for that configuration setting since customers may already be using it and we don't want to change behavior. If it is set to true, it should cause the calendar to use UTC (so no adjustment is made). Howevre, the new table property should override the Hive setting. So I think the check should be in the Strings.isNullOrEmpty case.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java (line 57)
<https://reviews.apache.org/r/41821/#comment176779>

    Nit: We use UTC elsewhere. I believe that they are identical for our purposes, but I want to reduce confusion and would recommend UTC to GMT.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java (line 211)
<https://reviews.apache.org/r/41821/#comment176783>

    The file metadata is also passed in [via InitContext](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L173). If the file has the write zone property set, then it should be validated against the configured write zone (if set). When present, the file's value should override the table value and there should be a warning if the table value doesn't match the file value. That covers cases where files are moved from one table to another.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java (line 161)
<https://reviews.apache.org/r/41821/#comment176785>

    Shouldn't this use UTC, which will apply an offset of 0?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java (line 165)
<https://reviews.apache.org/r/41821/#comment176787>

    When will the zone property be set before this method? Is this how the table property is passed in?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java (line 21)
<https://reviews.apache.org/r/41821/#comment176788>

    I think calling this the default write zone is not quite correct. It is the default zone when the Hive config property is set, but the current zone is the default otherwise. What about calling this the PARQUET_INT96_NO_ADJUSTMENT_ZONE?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java (line 53)
<https://reviews.apache.org/r/41821/#comment176792>

    This is a little odd. Normally, I would consider GMT/UTC to be the calendar that applies no adjustment. But what is happening in this code is that the UTC millisecond value is being constructed using values in that zone and then a new timestamp is created from it that is in the local zone. So local->local produces no adjustment, while GMT->local does.
    
    I was thinking that GMT would always be used to create the milliseconds, then the calendar would be used to adjust the value by some amount. UTC by 0, EST by -5, and PST by -8.
    
    I think it may be easier to have a method that does a single conversion to UTC in milliseconds. Then, adjust the value using a static offset like the one from `TimeZone.getOffset(utc_ms)`. That would result in a bit less work done here and I think would make the adjustment easier to reason about.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java (line 35)
<https://reviews.apache.org/r/41821/#comment176981>

    What is the purpose of this property? Is this just a way to pass the time zone? Why not use the property used elsewhere?



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java (line 227)
<https://reviews.apache.org/r/41821/#comment176983>

    It looks like this tests that the value written is the result of `getNanoTime(Timestamp,Calendar).toBinary()`, but doesn't test what that value is or should be. That is probably the intent, but I wanted to make sure.



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java (line 214)
<https://reviews.apache.org/r/41821/#comment176985>

    This is a great test for round trip, but I think we should also have a test with a set of corrupt values from a file written with 5.5 in America/Los_Angeles (since that's convenient).


- Ryan Blue


On Jan. 13, 2016, 12:36 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41821/
> -----------------------------------------------------------
> 
> (Updated Jan. 13, 2016, 12:36 p.m.)
> 
> 
> Review request for hive, Ryan Blue, Mohammad Islam, Reuben Kuhnert, and Szehon Ho.
> 
> 
> Bugs: HIVE-12767
>     https://issues.apache.org/jira/browse/HIVE-12767
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The following exit criteria is addressed in this patch:
> 
> * Hive will read Parquet MR int96 timestamp data and adjust values using a time zone from a table property, if set, or using the local time zone if it is absent. No adjustment will be applied to data written by Impala.
> 
> * Hive will write Parquet int96 timestamps using a time zone adjustment from the same table property, if set, or using the local time zone if it is absent. This keeps the data in the table consistent.
> 
> * New tables created by Hive will set the table property to UTC if the global option to set the property for new tables is enabled.
>   * Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the property unless the global setting to do so is enabled.
>   * Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the property of the table that is copied.
> 
> To set the timezone table property, use this:
>   create table tbl1 (ts timestamp) stored as parquet tblproperties ('parquet.mr.int96.write.zone'='PST');
>   
> To set UTC as default timezone table property on new tables created, use this: 
>   set parquet.mr.int96.enable.utc.write.zone=true;
>   create table tbl2 (ts timestamp) stored as parquet;
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1bcdc5f49e1a4a0f357842e88cf5fd359685b5ef 
>   data/files/impala_int96_timestamp.parq PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java deec8bba45c130c5dfdc482522c0825a71af9d2c 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java bfb48a987ce89a373f3da63c9162546c6eda43a9 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java ec0dd818f688ab92feb46be4fb6040ede5ac756a 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java 53f3b72b790d87a75a7cd1d77d8f011c29c41188 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java 74a1a82047613189678716f765bfaa9ac39b7618 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java aace48ee7d145d199163286d21e4ee7694140d6f 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 69272dc41dbc5fe29ab4c98e730b591c28f3a297 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 70491390ba2b90f32ef9963be7b19e57672241f3 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java PRE-CREATION 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41821/diff/
> 
> 
> Testing
> -------
> 
> Added unit and q-tests:
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>


Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

Posted by Sergio Pena <se...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41821/
-----------------------------------------------------------

(Updated Jan. 13, 2016, 8:36 p.m.)


Review request for hive, Ryan Blue, Mohammad Islam, Reuben Kuhnert, and Szehon Ho.


Bugs: HIVE-12767
    https://issues.apache.org/jira/browse/HIVE-12767


Repository: hive-git


Description
-------

The following exit criteria is addressed in this patch:

* Hive will read Parquet MR int96 timestamp data and adjust values using a time zone from a table property, if set, or using the local time zone if it is absent. No adjustment will be applied to data written by Impala.

* Hive will write Parquet int96 timestamps using a time zone adjustment from the same table property, if set, or using the local time zone if it is absent. This keeps the data in the table consistent.

* New tables created by Hive will set the table property to UTC if the global option to set the property for new tables is enabled.
  * Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the property unless the global setting to do so is enabled.
  * Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the property of the table that is copied.

To set the timezone table property, use this:
  create table tbl1 (ts timestamp) stored as parquet tblproperties ('parquet.mr.int96.write.zone'='PST');
  
To set UTC as default timezone table property on new tables created, use this: 
  set parquet.mr.int96.enable.utc.write.zone=true;
  create table tbl2 (ts timestamp) stored as parquet;


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1bcdc5f49e1a4a0f357842e88cf5fd359685b5ef 
  data/files/impala_int96_timestamp.parq PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java deec8bba45c130c5dfdc482522c0825a71af9d2c 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java bfb48a987ce89a373f3da63c9162546c6eda43a9 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java ec0dd818f688ab92feb46be4fb6040ede5ac756a 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java 53f3b72b790d87a75a7cd1d77d8f011c29c41188 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java 74a1a82047613189678716f765bfaa9ac39b7618 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java aace48ee7d145d199163286d21e4ee7694140d6f 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 69272dc41dbc5fe29ab4c98e730b591c28f3a297 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 70491390ba2b90f32ef9963be7b19e57672241f3 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java PRE-CREATION 
  ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
  ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/41821/diff/


Testing
-------

Added unit and q-tests:
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
  ql/src/test/queries/clientpositive/parquet_int96_timestamp.q


Thanks,

Sergio Pena


Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

Posted by Sergio Pena <se...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41821/
-----------------------------------------------------------

(Updated Jan. 3, 2016, 12:38 a.m.)


Review request for hive, Ryan Blue, Mohammad Islam, Reuben Kuhnert, and Szehon Ho.


Bugs: HIVE-12767
    https://issues.apache.org/jira/browse/HIVE-12767


Repository: hive-git


Description (updated)
-------

The following exit criteria is addressed in this patch:

* Hive will read Parquet MR int96 timestamp data and adjust values using a time zone from a table property, if set, or using the local time zone if it is absent. No adjustment will be applied to data written by Impala.

* Hive will write Parquet int96 timestamps using a time zone adjustment from the same table property, if set, or using the local time zone if it is absent. This keeps the data in the table consistent.

* New tables created by Hive will set the table property to UTC if the global option to set the property for new tables is enabled.
  * Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set the property unless the global setting to do so is enabled.
  * Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the property of the table that is copied.

To set the timezone table property, use this:
  create table tbl1 (ts timestamp) stored as parquet tblproperties ('parquet.mr.int96.write.zone'='PST');
  
To set UTC as default timezone table property on new tables created, use this: 
  set parquet.mr.int96.enable.utc.write.zone=true;
  create table tbl2 (ts timestamp) stored as parquet;


Diffs
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 591c0ab097f1f384e287876fc25fd24c09006dfb 
  data/files/impala_int96_timestamp.parq PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ac0ecd9c31c0e84b8c4628c86b0fdab7d1842f28 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java bfb48a987ce89a373f3da63c9162546c6eda43a9 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java ec0dd818f688ab92feb46be4fb6040ede5ac756a 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java 53f3b72b790d87a75a7cd1d77d8f011c29c41188 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java 74a1a82047613189678716f765bfaa9ac39b7618 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java aace48ee7d145d199163286d21e4ee7694140d6f 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 69272dc41dbc5fe29ab4c98e730b591c28f3a297 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 70491390ba2b90f32ef9963be7b19e57672241f3 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java PRE-CREATION 
  ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
  ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/41821/diff/


Testing
-------

Added unit and q-tests:
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
  ql/src/test/queries/clientpositive/parquet_int96_timestamp.q


Thanks,

Sergio Pena