You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sergio Pena <se...@cloudera.com> on 2017/05/02 17:27:47 UTC

Re: Review Request 58501: HIVE-16469: Parquet timestamp table property is not always taken into account

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58501/#review173610
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
Lines 333 (patched)
<https://reviews.apache.org/r/58501/#comment246597>

    If getNextSplits() already sets the table property if it is not set, why are we doing it again here?



ql/src/java/org/apache/hadoop/hive/ql/exec/StatsNoJobTask.java
Lines 262 (patched)
<https://reviews.apache.org/r/58501/#comment246598>

    I've seen this check a few times on the code. Shouldn't be good to create a static method that wraps this check? like ParquetHiveSerDe.isParquetTable(table)?



ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapRedTask.java
Line 72 (original), 72 (patched)
<https://reviews.apache.org/r/58501/#comment246603>

    How does this work? I don't understand this change.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java
Lines 72 (patched)
<https://reviews.apache.org/r/58501/#comment246601>

    Can this new code be wrapped in another method?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java
Line 181 (original), 181 (patched)
<https://reviews.apache.org/r/58501/#comment246600>

    Is this compatible with old parquet tables? if the property is not set, then the validateTimeZone    might fail, right? If so, do we want to fail reading tables that do not have a property set?
    
    Something else to consider, if a user sets a timezone improperly in a different tool or something      happened that we got an invalid timezone, then do we want to fail when reading those files? Just      wondering this scenario, no need to fix it right away.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java
Lines 32 (patched)
<https://reviews.apache.org/r/58501/#comment246595>

    Could you write the information about parameters?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java
Lines 35 (patched)
<https://reviews.apache.org/r/58501/#comment246596>

    Why is Map<?, ?> used instead of Map<String, String>? Aren't all table properties key, value string pairs?
    
    Also, the ensureTablePropertySet() name seems not related to what we want to do. I thought it was going to throw an exception if the property was not set, but it is setting the value on the JobConf. Should we use a different name, such as setParquetTimeZoneIfNotSet(),      setParquetTimeZoneIfAbsent() or something like that helps us understand quickly without looking at the javadoc.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java
Lines 168 (patched)
<https://reviews.apache.org/r/58501/#comment246594>

    Shouldn't we throw an IllegalArgumentException here? The same for line 174?
    
    I think it makes more sense to use the above exception when arguments are not valid.



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/AbstractTestParquetDirect.java
Line 24 (original), 24 (patched)
<https://reviews.apache.org/r/58501/#comment246599>

    Keep the standard here. Let's import each class instead of all.


- Sergio Pena


On April 20, 2017, 2:11 p.m., Barna Zsombor Klara wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58501/
> -----------------------------------------------------------
> 
> (Updated April 20, 2017, 2:11 p.m.)
> 
> 
> Review request for hive, Sergio Pena and Zoltan Ivanfi.
> 
> 
> Bugs: HIVE-16469
>     https://issues.apache.org/jira/browse/HIVE-16469
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16469: Parquet timestamp table property is not always taken into account
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 917e565f28b2c9aaea18033ea3b6b20fa41fcd0a 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java 004bb2f60299a0635b8f9ca7649ead00b8e16d08 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/StatsNoJobTask.java 9c3a664b9aea2d6e050ffe2d7626127827dbc52a 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapRedTask.java 1bd4db7805689ae1f91921ffbb5ff7da59f4bf60 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java f4fadbb61bf45f62945700284c0b050f0984b696 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java 2954601ce5bb25905cdb29ca0ca4551c2ca12b95 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java b339cc4347eea143dca2f6d98f9aa7777afdc427 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java dbd6fb3d0bc8c753abf86e99b52377617f248b5a 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/AbstractTestParquetDirect.java c81499a91c84af3ba33f335506c1c44e7085f13d 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetRowGroupFilter.java bf363f32a3ac0a4d790e2925d802c6e210adfb4b 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java f2d79cf9d215e9a6e2a5e88cfc78378be860fd1f 
>   ql/src/test/queries/clientnegative/parquet_int96_alter_invalid_timezone.q PRE-CREATION 
>   ql/src/test/queries/clientnegative/parquet_int96_create_invalid_timezone.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q 6eadd1b0a3313cbba7a798890b802baae302749e 
>   ql/src/test/results/clientnegative/parquet_int96_alter_invalid_timezone.q.out PRE-CREATION 
>   ql/src/test/results/clientnegative/parquet_int96_create_invalid_timezone.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out b9a3664458a83f1856e4bc59eba5d56665df61cc 
>   ql/src/test/results/clientpositive/spark/parquet_int96_timestamp.q.out PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/58501/diff/3/
> 
> 
> Testing
> -------
> 
> Added qtests for the following cases:
> - order by clause
> - selfjoin
> - calling UDFs with the timestamp values
> - where clause with a constant cast as timestamp
> - test for HoS
> - implicit and explicit timestamp conversions in insert clause
> 
> Tested manually but no qtests:
> - join between 3 tables all parquet but with different/no timezone property
> - subselect in from/where clauses
> - exists / union / no exists
> 
> 
> Thanks,
> 
> Barna Zsombor Klara
> 
>


Re: Review Request 58501: HIVE-16469: Parquet timestamp table property is not always taken into account

Posted by Barna Zsombor Klara <zs...@cloudera.com>.

> On May 2, 2017, 5:27 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapRedTask.java
> > Line 72 (original), 72 (patched)
> > <https://reviews.apache.org/r/58501/diff/3/?file=1695509#file1695509line72>
> >
> >     How does this work? I don't understand this change.

The user.timezone system property is used to set the default timezone of the JVM. If this is set on the HS2 instance then we need to propagate it to the child VM spawned by a local task or timestamps read by the local task will be incorrect.


> On May 2, 2017, 5:27 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java
> > Line 181 (original), 181 (patched)
> > <https://reviews.apache.org/r/58501/diff/3/?file=1695511#file1695511line181>
> >
> >     Is this compatible with old parquet tables? if the property is not set, then the validateTimeZone    might fail, right? If so, do we want to fail reading tables that do not have a property set?
> >     
> >     Something else to consider, if a user sets a timezone improperly in a different tool or something      happened that we got an invalid timezone, then do we want to fail when reading those files? Just      wondering this scenario, no need to fix it right away.

At this point the timezone property had to be set by ParquetTableUtils#setParquetTimeZoneIfAbsent either from the table properties or using the default value TimeZone#getDefault. The core problem is that I found it very difficult to make sure that <every> execution path will check the table property.
- The FetchOperator works when we have a local task, but the MapRedParquetInputFormat does not (MapWork is null). 
- The FetchOperator will not work with a complex query or an order by clause, but the InputFormat should work in this case. 
- For statistics gathering only the StatNoJobTask is executed.
I wanted to make sure that if we have an execution path I forgot about, then we should rather fail than to read incorrect timestamp values silently.
Similarly in my opinion if the timezone value is incorrect (because it was set by another tool) then we should fail instead of reading illadjusted values.


> On May 2, 2017, 5:27 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java
> > Lines 35 (patched)
> > <https://reviews.apache.org/r/58501/diff/3/?file=1695512#file1695512line35>
> >
> >     Why is Map<?, ?> used instead of Map<String, String>? Aren't all table properties key, value string pairs?
> >     
> >     Also, the ensureTablePropertySet() name seems not related to what we want to do. I thought it was going to throw an exception if the property was not set, but it is setting the value on the JobConf. Should we use a different name, such as setParquetTimeZoneIfNotSet(),      setParquetTimeZoneIfAbsent() or something like that helps us understand quickly without looking at the javadoc.

We are calling this method with Properties objects (i.e. from the FetchOperator) and using Map<String, String> objects (i.e. from the StatsNoJobTask) and the common ancestor for these two is the Map<?,?>. While it is true that the table properties can only be Strings so the Properties should only contain String pairs I wanted to avoid the explicit cast.


- Barna Zsombor


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58501/#review173610
-----------------------------------------------------------


On May 3, 2017, 12:59 p.m., Barna Zsombor Klara wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58501/
> -----------------------------------------------------------
> 
> (Updated May 3, 2017, 12:59 p.m.)
> 
> 
> Review request for hive, Sergio Pena and Zoltan Ivanfi.
> 
> 
> Bugs: HIVE-16469
>     https://issues.apache.org/jira/browse/HIVE-16469
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16469: Parquet timestamp table property is not always taken into account
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 757b7fc0eaa39c956014aa446ab1b07fc4abf8d3 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java 13750cdc34711d22f2adf2f483a6773ad05fb8d2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/StatsNoJobTask.java 9c3a664b9aea2d6e050ffe2d7626127827dbc52a 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapRedTask.java 1bd4db7805689ae1f91921ffbb5ff7da59f4bf60 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java f4fadbb61bf45f62945700284c0b050f0984b696 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java 2954601ce5bb25905cdb29ca0ca4551c2ca12b95 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 6413c5add6db2e8c9298285b15dba33ee74379a8 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java b339cc4347eea143dca2f6d98f9aa7777afdc427 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java dbd6fb3d0bc8c753abf86e99b52377617f248b5a 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/AbstractTestParquetDirect.java c81499a91c84af3ba33f335506c1c44e7085f13d 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetRowGroupFilter.java bf363f32a3ac0a4d790e2925d802c6e210adfb4b 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java f2d79cf9d215e9a6e2a5e88cfc78378be860fd1f 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestNanoTimeUtils.java 1e10dbf18742524982606f1e6c6d447d683b2dc3 
>   ql/src/test/queries/clientnegative/parquet_int96_alter_invalid_timezone.q PRE-CREATION 
>   ql/src/test/queries/clientnegative/parquet_int96_create_invalid_timezone.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q 6eadd1b0a3313cbba7a798890b802baae302749e 
>   ql/src/test/results/clientnegative/parquet_int96_alter_invalid_timezone.q.out PRE-CREATION 
>   ql/src/test/results/clientnegative/parquet_int96_create_invalid_timezone.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out b9a3664458a83f1856e4bc59eba5d56665df61cc 
>   ql/src/test/results/clientpositive/spark/parquet_int96_timestamp.q.out PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/58501/diff/4/
> 
> 
> Testing
> -------
> 
> Added qtests for the following cases:
> - order by clause
> - selfjoin
> - calling UDFs with the timestamp values
> - where clause with a constant cast as timestamp
> - test for HoS
> - implicit and explicit timestamp conversions in insert clause
> 
> Tested manually but no qtests:
> - join between 3 tables all parquet but with different/no timezone property
> - subselect in from/where clauses
> - exists / union / no exists
> 
> 
> Thanks,
> 
> Barna Zsombor Klara
> 
>