You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/05/17 12:07:56 UTC

[GitHub] [hive] zabetak opened a new pull request #2282: HIVE-25104: Backward incompatible timestamp serialization in Parquet for certain timezones

zabetak opened a new pull request #2282:
URL: https://github.com/apache/hive/pull/2282


   ### What changes were proposed in this pull request?
       
   1. Add new read/write config properties to control legacy zone conversions in Parquet.
   2. Deprecate hive.parquet.timestamp.legacy.conversion.enabled property since it is not clear if it applies on conversion during read or write.
   3. Exploit file metadata and property to choose between new/old conversion rules.
   4. Update existing tests to remove usages of now deprecated hive.parquet.timestamp.legacy.conversion.enabled property.
   5. Simplify NanoTimeUtils#getTimestamp & NanoTimeUtils#getNanoTime by removing 'skipConversion' parameter
   
   ### Why are the changes needed?
   1. Provide the end-users the possibility to write backward compatible timestamps in Parquet files so that files can be read correctly by older versions.
   2. Improve code readability of NanoTimeUtils APIs.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   1. Add timestamp read/write compatibility test with Hive2 Parquet APIs (`TestParquetTimestampsHive2Compatibility`)
   2. Add qtest writing timestamps in Parquet using legacy zone conversions (`parquet_int96_legacy_compatibility_timestamp.q`)
   ```
   mvn test -Dtest=*Timestamp*
   cd itests/qtest
   mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile_regex=".*timestamp.*" -Dtest.output.overwrite
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on pull request #2282: HIVE-25104: Backward incompatible timestamp serialization in Parquet for certain timezones

Posted by GitBox <gi...@apache.org>.
jcamachor commented on pull request #2282:
URL: https://github.com/apache/hive/pull/2282#issuecomment-848834410


   @klcopp , could you take a look too? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #2282: HIVE-25104: Backward incompatible timestamp serialization in Parquet for certain timezones

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #2282:
URL: https://github.com/apache/hive/pull/2282#discussion_r633480041



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
##########
@@ -523,10 +532,30 @@ private static MessageType getRequestedPrunedSchema(
           configuration, HiveConf.ConfVars.HIVE_PARQUET_DATE_PROLEPTIC_GREGORIAN_DEFAULT)));
     }
 
-    String legacyConversion = ConfVars.HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED.varname;
-    if (!metadata.containsKey(legacyConversion)) {
-      metadata.put(legacyConversion, String.valueOf(HiveConf.getBoolVar(
-          configuration, HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED)));
+    if (!metadata.containsKey(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY)) {
+      final String legacyConversion;
+      if(keyValueMetaData.containsKey(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY)) {
+        // If there is meta about the legacy conversion then the file should be read in the same way it was written. 
+        legacyConversion = keyValueMetaData.get(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY);
+      } else if(keyValueMetaData.containsKey(DataWritableWriteSupport.WRITER_TIMEZONE)) {
+        // If there is no meta about the legacy conversion but there is meta about the timezone then we can infer the
+        // file was written with the new rules.
+        legacyConversion = "false";
+      } else {

Review comment:
       This `if` block makes the life of users in (3.1.2, 3.2.0) a bit easier since it determines automatically the appropriate conversion. It looks a bit weird though so we could possibly remove it and require from the users in these versions to set the respective property accordingly. I would prefer to keep the code more uniform than trying to cover edge cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #2282: HIVE-25104: Backward incompatible timestamp serialization in Parquet for certain timezones

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #2282:
URL: https://github.com/apache/hive/pull/2282#discussion_r640520780



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
##########
@@ -536,7 +542,8 @@ public void write(Object value) {
         Long int64value = ParquetTimestampUtils.getInt64(ts, timeUnit);
         recordConsumer.addLong(int64value);

Review comment:
       You're right, just ignore my previous comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #2282: HIVE-25104: Backward incompatible timestamp serialization in Parquet for certain timezones

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #2282:
URL: https://github.com/apache/hive/pull/2282#discussion_r640553855



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
##########
@@ -304,6 +305,14 @@ public static Boolean getWriterDateProleptic(Map<String, String> metadata) {
     return null;
   }
 
+  public static Boolean getWriterLegacyConversion(Map<String, String> metadata) {

Review comment:
       The method is not that useful after all. I removed it in https://github.com/apache/hive/pull/2282/commits/c9d13c4f87aeb5f442a399374935c4affc2804cc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor merged pull request #2282: HIVE-25104: Backward incompatible timestamp serialization in Parquet for certain timezones

Posted by GitBox <gi...@apache.org>.
jcamachor merged pull request #2282:
URL: https://github.com/apache/hive/pull/2282


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #2282: HIVE-25104: Backward incompatible timestamp serialization in Parquet for certain timezones

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #2282:
URL: https://github.com/apache/hive/pull/2282#discussion_r633499434



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
##########
@@ -536,7 +542,8 @@ public void write(Object value) {
         Long int64value = ParquetTimestampUtils.getInt64(ts, timeUnit);
         recordConsumer.addLong(int64value);

Review comment:
       The fact that we do not perform/control legacy conversion when we store timestamps in INT64 type can create problems if we end up comparing timestamps stored as INT96 and INT64. Shall we try to make the new property (`hive.parquet.timestamp.write.legacy.conversion.enabled`) affect also the INT64 storage type?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #2282: HIVE-25104: Backward incompatible timestamp serialization in Parquet for certain timezones

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #2282:
URL: https://github.com/apache/hive/pull/2282#discussion_r639979235



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2197,9 +2197,14 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     HIVE_PARQUET_DATE_PROLEPTIC_GREGORIAN_DEFAULT("hive.parquet.date.proleptic.gregorian.default", false,
       "This value controls whether date type in Parquet files was written using the hybrid or proleptic\n" +
       "calendar. Hybrid is the default."),
-    HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED("hive.parquet.timestamp.legacy.conversion.enabled", true,

Review comment:
       I explored various options (e.g., property validator, alternative property name)  but there does not seem to exist a reliable way of throwing an exception when the old property is used explicitly by the client.
   
   When property validation is on then we have some [additional checks](https://github.com/apache/hive/blob/95ab0c05ab68284355d431e352fe59a3e2dd9d6c/ql/src/java/org/apache/hadoop/hive/ql/processors/SetProcessor.java#L254) but even like that I think there are holes (e.g., property set in `hive-site.xml`).
   
   If we want to protect users who might set this property explicitly despite the fact that it is clearly noted to be for debugging purposes then I guess we need to retain the old property name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #2282: HIVE-25104: Backward incompatible timestamp serialization in Parquet for certain timezones

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #2282:
URL: https://github.com/apache/hive/pull/2282#discussion_r640553285



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2197,9 +2197,14 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     HIVE_PARQUET_DATE_PROLEPTIC_GREGORIAN_DEFAULT("hive.parquet.date.proleptic.gregorian.default", false,
       "This value controls whether date type in Parquet files was written using the hybrid or proleptic\n" +
       "calendar. Hybrid is the default."),
-    HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED("hive.parquet.timestamp.legacy.conversion.enabled", true,

Review comment:
       I restored the old property in https://github.com/apache/hive/pull/2282/commits/d24ac6815de37bf3966cee28805b594c6d48c114.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #2282: HIVE-25104: Backward incompatible timestamp serialization in Parquet for certain timezones

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #2282:
URL: https://github.com/apache/hive/pull/2282#discussion_r638140142



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2197,9 +2197,14 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     HIVE_PARQUET_DATE_PROLEPTIC_GREGORIAN_DEFAULT("hive.parquet.date.proleptic.gregorian.default", false,
       "This value controls whether date type in Parquet files was written using the hybrid or proleptic\n" +
       "calendar. Hybrid is the default."),
-    HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED("hive.parquet.timestamp.legacy.conversion.enabled", true,

Review comment:
       Although this was marked as being used for debugging purposes only, if we are deprecating this property, we should try to fail when it is set explicitly? Otherwise, we may be changing results for users silently.
   
   If capturing the set command for specific property proves complicated, another option could be to use the same name for one of the properties (even if then they do not contain .*read.*/.*write.*).

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
##########
@@ -523,10 +532,30 @@ private static MessageType getRequestedPrunedSchema(
           configuration, HiveConf.ConfVars.HIVE_PARQUET_DATE_PROLEPTIC_GREGORIAN_DEFAULT)));
     }
 
-    String legacyConversion = ConfVars.HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED.varname;
-    if (!metadata.containsKey(legacyConversion)) {
-      metadata.put(legacyConversion, String.valueOf(HiveConf.getBoolVar(
-          configuration, HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED)));
+    if (!metadata.containsKey(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY)) {
+      final String legacyConversion;
+      if(keyValueMetaData.containsKey(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY)) {
+        // If there is meta about the legacy conversion then the file should be read in the same way it was written. 
+        legacyConversion = keyValueMetaData.get(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY);
+      } else if(keyValueMetaData.containsKey(DataWritableWriteSupport.WRITER_TIMEZONE)) {
+        // If there is no meta about the legacy conversion but there is meta about the timezone then we can infer the
+        // file was written with the new rules.
+        legacyConversion = "false";
+      } else {

Review comment:
       @zabetak , I guess you mean this block?
   `else if(keyValueMetaData.containsKey(DataWritableWriteSupport.WRITER_TIMEZONE))`
   
   The question is whether the default value for the config property is going to give the desired results for those users. I believe that's not the case, i.e., if the value is not there, we assume it was written applying conversions with old APIs? Many users would fall in that bucket (3.1.2, 3.2.0) so maybe it makes sense to have this clause to preserve behavior, as you have done.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
##########
@@ -536,7 +542,8 @@ public void write(Object value) {
         Long int64value = ParquetTimestampUtils.getInt64(ts, timeUnit);
         recordConsumer.addLong(int64value);

Review comment:
       Isn't timestamp in INT64 Parquet already stored as UTC? I think we tried to keep all these conversions away from the new type.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
##########
@@ -304,6 +305,14 @@ public static Boolean getWriterDateProleptic(Map<String, String> metadata) {
     return null;
   }
 
+  public static Boolean getWriterLegacyConversion(Map<String, String> metadata) {

Review comment:
       `getWriterLegacyConversion` -> `getWriterTimeZoneLegacyConversion`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #2282: HIVE-25104: Backward incompatible timestamp serialization in Parquet for certain timezones

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #2282:
URL: https://github.com/apache/hive/pull/2282#discussion_r633480041



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
##########
@@ -523,10 +532,30 @@ private static MessageType getRequestedPrunedSchema(
           configuration, HiveConf.ConfVars.HIVE_PARQUET_DATE_PROLEPTIC_GREGORIAN_DEFAULT)));
     }
 
-    String legacyConversion = ConfVars.HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED.varname;
-    if (!metadata.containsKey(legacyConversion)) {
-      metadata.put(legacyConversion, String.valueOf(HiveConf.getBoolVar(
-          configuration, HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED)));
+    if (!metadata.containsKey(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY)) {
+      final String legacyConversion;
+      if(keyValueMetaData.containsKey(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY)) {
+        // If there is meta about the legacy conversion then the file should be read in the same way it was written. 
+        legacyConversion = keyValueMetaData.get(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY);
+      } else if(keyValueMetaData.containsKey(DataWritableWriteSupport.WRITER_TIMEZONE)) {
+        // If there is no meta about the legacy conversion but there is meta about the timezone then we can infer the
+        // file was written with the new rules.
+        legacyConversion = "false";
+      } else {

Review comment:
       This `if` block makes the life of users in (3.1.2, 3.2.0) a bit easier since it determines automatically the appropriate conversion. It looks a bit weird though so we could possibly remove it and require from the users in these versions to set the respective property accordingly. I would prefer to keep the code more uniform than trying to cover edge cases.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
##########
@@ -536,7 +542,8 @@ public void write(Object value) {
         Long int64value = ParquetTimestampUtils.getInt64(ts, timeUnit);
         recordConsumer.addLong(int64value);

Review comment:
       The fact that we do not perform/control legacy conversion when we store timestamps in INT64 type can create problems if we end up comparing timestamps stored as INT96 and INT64. Shall we try to make the new property (`hive.parquet.timestamp.write.legacy.conversion.enabled`) affect also the INT64 storage type?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org