You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/10 12:30:21 UTC
[GitHub] [hudi] xushiyan commented on a diff in pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

xushiyan commented on code in PR #3391:
URL: https://github.com/apache/hudi/pull/3391#discussion_r991105482


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -753,9 +753,10 @@ private FlinkOptions() {
   public static final ConfigOption<Boolean> HIVE_SYNC_SUPPORT_TIMESTAMP = ConfigOptions
       .key("hive_sync.support_timestamp")
       .booleanType()
-      .defaultValue(true)
-      .withDescription("INT64 with original type TIMESTAMP_MICROS is converted to hive timestamp type.\n"
-          + "Disabled by default for backward compatibility.");
+      .defaultValue(false)
+      .withDescription("'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. "
+          + "From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. "
+          + "Previous versions keep being disabled by default.");

Review Comment:
   needs update



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetInputFormat.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.avro;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.parquet.hadoop.ParquetInputFormat;
+import org.apache.parquet.hadoop.util.ContextUtil;
+
+import java.io.IOException;
+
+/**
+ * Replace default HoodieParquetInputFormat to HoodieAvroParquetInputFormat so that
+ * we can use HoodieAvroParquetReader. It is only used when `hoodie.datasource.hive_sync.support_timestamp` is true
+ * or hive table contains timestamp/int column.
+ */
+public class HoodieAvroParquetInputFormat extends ParquetInputFormat<ArrayWritable> {

Review Comment:
   the purpose is to support timestamp in hive and converting to avro is just the internal logic, we shouldn't name it avro. Better call it `HoodieTimestampAwareParquetInputFormat` or something. people don't need to know how internally timestamp is being recognized.



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -109,4 +109,22 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
         .collect(Collectors.toList());
   }
 
+  /**
+   * if schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields
+   * We expect 3 cases to use parquet-avro reader to read timestamp column:
+   *  1. read columns contain timestamp type
+   *  2. no read columns and exists original columns contain timestamp type
+   *  3. no read columns and no original columns, but avro schema contains type
+   */
+  public static boolean supportTimestamp(Configuration conf) {
+    List<String> reads = Arrays.asList(getReadColumnNames(conf));
+    if (reads.isEmpty()) {
+      return getIOColumnTypes(conf).contains("timestamp");
+    }
+    List<String> names = getIOColumns(conf);
+    List<String> types = getIOColumnTypes(conf);
+    return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> reads.contains(names.get(i)))

Review Comment:
   not quite sure about `types.isEmpty()` being true then support timestamp?



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java:
##########
@@ -333,7 +333,8 @@ public class FlinkStreamerConfig extends Configuration {
   public Boolean hiveSyncSkipRoSuffix = false;
 
   @Parameter(names = {"--hive-sync-support-timestamp"}, description = "INT64 with original type TIMESTAMP_MICROS is converted to hive timestamp type.\n"
-      + "Disabled by default for backward compatibility.")
+      + "From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. "
+      + "Previous versions keep being disabled by default.")

Review Comment:
   needs update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org