You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by fo...@apache.org on 2023/01/04 03:32:10 UTC

[hudi] branch release-0.12.1 updated (a5978cd230 -> 78fe5c73a4)

This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a change to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git


    from a5978cd230 [MINOR] Update release version to reflect published version  0.12.1
     new ac0d1d81a4 [MINOR] Adapt to tianqiong spark
     new 23412b2bee [MINOR] Add Zhiyan metrics reporter
     new d785b41f01 fix cherry pick err
     new ab5b4fa780 fix the bug, log file will not roll over to a new file
     new 553bb9eab4 [HUDI-4475] fix create table with not exists hoodie properties file
     new c1ceb628e5 [MINOR] fix Invalid value for YearOfEra
     new 1d029e668b add 'backup_invalid_parquet' procedure
     new 6dbe53e623 fix zhiyan reporter for metadata
     new f80900d91c [MINOR] Adapt to tianqiong spark
     new 9c94e388fe adapt tspark changes: backport 3.3 VectorizedParquetReader related code to 3.1
     new e45564102b fix file not exists for getFileSize
     new 895f260983 opt procedure backup_invalid_parquet
     new 4e66857849 fix RowDataProjection with project and projectAsValues's NPE
     new 8d692f38c1 [HUDI-5041] Fix lock metric register confict error (#6968)
     new ee07cc6a3b Remove proxy
     new 8ba01dc70a [HUDI-2624] Implement Non Index type for HUDI
     new 97ce2b7f7b temp_view_support (#6990)
     new 90c09053da [HUDI-5105] Add Call show_commit_extra_metadata for spark sql (#7091)
     new ecd39e3ad7 add log to print scanInternal's logFilePath
     new 5f6d6ae42d remove hudi-kafka-connect module
     new 3c364bdf72 [MINOR] add integrity check of merged parquet file for HoodieMergeHandle.
     new 738e2cce8f [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table (#6741)
     new 7d6654c1d0 [HUDI-5178] Add Call show_table_properties for spark sql (#7161)
     new e95a9f56ae [HUDI-4526] Improve spillableMapBasePath when disk directory is full (#6284)
     new 9fbf3b920d [HUDI-5095] Flink: Stores a special watermark(flag) to identify the current progress of writing data
     new f02fef936b fix none index partition format
     new 005e913403 [HUDI-5095] Flink: Stores a special watermark(flag) to identify the current progress of writing data
     new 619b7504ca Reduce the scope and duration of holding checkpoint lock in stream read
     new f7fe437faf Fix tauth issue (merge request !102)
     new 00c3443cb4 optimize schema settings
     new ab5deef087 Merge branch 'optimize_schema_settings' into 'release-0.12.1' (merge request !108)
     new c070e0963a [HUDI-5095] Flink: Stores a special watermark(flag) to identify the current progress of writing data
     new b1f204fa55 Merge branch 'release-0.12.1' of https://git.woa.com/data-lake-technology/hudi into release-0.12.1
     new f2256ec94c exclude hudi-kafka-connect & add some api to support FLIP-27 source
     new 0cf7d3dac8 fix database default error
     new bddf061a79 [HUDI-5223] Partial failover for flink (#7208)
     new adc8aa6ebd remove ZhiyanReporter's report print
     new 0d71705ec2 [MINOR] add integrity check of parquet file for HoodieRowDataParquetWriter.
     new e0faa0bbe1 [HUDI-5350] Fix oom cause compaction event lost problem (#7408)
     new 5488c951e4 [HUDI-5314] add call help procedure (#7361)
     new 4fe2aec44a fix read log not exist
     new 79abc24265 improve checkstyle
     new d0b3b36e96 check parquet file does not exist
     new 4f005ea5d4 improve DropHoodieTableCommand
     new 78fe5c73a4 [HUDI-3572] support DAY_ROLLING strategy in ClusteringPlanPartitionFilterMode (#4966)

The 45 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 dev/settings.xml                                   | 247 +++++++++++++
 dev/tencent-install.sh                             | 158 ++++++++
 dev/tencent-release.sh                             | 155 ++++++++
 hudi-cli/pom.xml                                   |   4 +-
 hudi-client/hudi-client-common/pom.xml             |   7 +
 .../apache/hudi/async/AsyncPostEventService.java   |  93 +++++
 .../apache/hudi/client/BaseHoodieWriteClient.java  |  28 +-
 .../lock/metrics/HoodieLockMetrics.java            |  19 +-
 .../org/apache/hudi/config/HoodieIndexConfig.java  |  43 +++
 .../org/apache/hudi/config/HoodieMemoryConfig.java |   9 +-
 .../org/apache/hudi/config/HoodieWriteConfig.java  | 108 +++++-
 .../hudi/config/metrics/HoodieMetricsConfig.java   |  23 +-
 .../config/metrics/HoodieMetricsZhiyanConfig.java  | 143 +++++++
 .../java/org/apache/hudi/index/HoodieIndex.java    |   2 +-
 .../java/org/apache/hudi/io/HoodieMergeHandle.java |  23 +-
 .../src/main/java/org/apache/hudi/io/IOUtils.java  |  14 +
 .../apache/hudi/keygen/EmptyAvroKeyGenerator.java  |  70 ++++
 .../keygen/TimestampBasedAvroKeyGenerator.java     |   4 +-
 .../metadata/HoodieBackedTableMetadataWriter.java  |  11 +
 .../org/apache/hudi/metrics/HoodieMetrics.java     | 120 +++++-
 .../main/java/org/apache/hudi/metrics/Metrics.java |   1 +
 .../hudi/metrics/MetricsReporterFactory.java       |   4 +
 .../apache/hudi/metrics/MetricsReporterType.java   |   2 +-
 .../hudi/metrics/zhiyan/ZhiyanHttpClient.java      | 129 +++++++
 .../hudi/metrics/zhiyan/ZhiyanMetricsReporter.java |  60 +++
 .../apache/hudi/metrics/zhiyan/ZhiyanReporter.java | 169 +++++++++
 .../cluster/ClusteringPlanPartitionFilter.java     |  23 ++
 .../cluster/ClusteringPlanPartitionFilterMode.java |   3 +-
 .../hudi/table/action/commit/BucketInfo.java       |   4 +
 .../hudi/table/action/commit/BucketType.java       |   2 +-
 .../java/org/apache/hudi/tdbank/TDBankClient.java  | 103 ++++++
 .../java/org/apache/hudi/tdbank/TdbankConfig.java  |  82 +++++
 .../hudi/tdbank/TdbankHoodieMetricsEvent.java      | 110 ++++++
 .../apache/hudi/client/HoodieFlinkWriteClient.java |   6 +
 .../apache/hudi/index/FlinkHoodieIndexFactory.java |   2 +
 .../apache/hudi/index/FlinkHoodieNonIndex.java}    |  47 ++-
 .../java/org/apache/hudi/io/FlinkMergeHandle.java  |   8 +-
 .../io/storage/row/HoodieRowDataParquetWriter.java |   4 +
 hudi-client/hudi-spark-client/pom.xml              |   4 +-
 .../apache/hudi/client/SparkRDDWriteClient.java    |  22 ++
 .../apache/hudi/index/SparkHoodieIndexFactory.java |   3 +
 .../hudi/index/nonindex/SparkHoodieNonIndex.java}  |  55 ++-
 .../hudi/io/storage/row/HoodieRowCreateHandle.java |   5 +-
 ...eteKeyGenerator.java => EmptyKeyGenerator.java} |  53 ++-
 .../commit/BaseSparkCommitActionExecutor.java      |  17 +
 .../table/action/commit/UpsertPartitioner.java     |  35 +-
 .../TestSparkClusteringPlanPartitionFilter.java    |  29 ++
 .../java/org/apache/hudi/avro/HoodieAvroUtils.java |  28 ++
 .../java/org/apache/hudi/common/fs/FSUtils.java    |  16 +-
 .../org/apache/hudi/common/model/FileSlice.java    |  13 +
 .../org/apache/hudi/common/model/HoodieKey.java    |   2 +
 .../hudi/common/table/HoodieTableConfig.java       |   6 +-
 .../table/log/AbstractHoodieLogRecordReader.java   |  41 ++-
 .../common/table/log/HoodieLogFormatReader.java    |   8 +-
 .../common/table/log/HoodieLogFormatWriter.java    |   9 +-
 .../table/log/HoodieMergedLogRecordScanner.java    |  12 +-
 .../table/timeline/HoodieActiveTimeline.java       |  18 +
 .../hudi/common/table/timeline/TimelineUtils.java  |   2 +-
 .../org/apache/hudi/common/util/DateTimeUtils.java |   8 +
 .../org/apache/hudi/common/util/FileIOUtils.java   |  36 ++
 .../metadata/FileSystemBackedTableMetadata.java    |   2 +
 hudi-examples/hudi-examples-spark/pom.xml          |   4 +-
 .../apache/hudi/configuration/FlinkOptions.java    |  22 +-
 .../apache/hudi/configuration/OptionsResolver.java |   4 +
 .../org/apache/hudi/sink/StreamWriteFunction.java  |  38 +-
 .../hudi/sink/StreamWriteOperatorCoordinator.java  |  39 ++
 .../hudi/sink/append/AppendWriteFunction.java      |   2 +-
 .../hudi/sink/bulk/BulkInsertWriteFunction.java    |  10 +-
 .../sink/common/AbstractStreamWriteFunction.java   |  23 +-
 .../hudi/sink/common/AbstractWriteFunction.java    | 103 ++++++
 .../hudi/sink/common/AbstractWriteOperator.java    |   9 +
 .../apache/hudi/sink/event/WriteMetadataEvent.java |  31 +-
 .../sink/nonindex/NonIndexStreamWriteFunction.java | 265 +++++++++++++
 .../NonIndexStreamWriteOperator.java}              |  12 +-
 .../apache/hudi/sink/utils/NonThrownExecutor.java  |   8 +-
 .../java/org/apache/hudi/sink/utils/Pipelines.java |   7 +
 .../hudi/source/StreamReadMonitoringFunction.java  |  33 +-
 .../apache/hudi/streamer/FlinkStreamerConfig.java  |   4 +
 .../apache/hudi/streamer/HoodieFlinkStreamer.java  |   4 +-
 .../org/apache/hudi/table/HoodieTableFactory.java  |  21 ++
 .../org/apache/hudi/table/HoodieTableSource.java   |  42 ++-
 .../table/format/mor/MergeOnReadInputFormat.java   |  29 ++
 .../table/format/mor/MergeOnReadInputSplit.java    |   8 +-
 .../java/org/apache/hudi/util/DataTypeUtils.java   | 141 +++++++
 .../java/org/apache/hudi/util/HoodiePipeline.java  |  14 +
 .../org/apache/hudi/util/RowDataProjection.java    |  21 +-
 .../apache/hudi/sink/ITTestDataStreamWrite.java    |  35 ++
 .../sink/TestWriteFunctionEventTimeExtract.java    | 232 ++++++++++++
 .../org/apache/hudi/sink/TestWriteMergeOnRead.java |  54 +++
 .../hudi/sink/utils/InsertFunctionWrapper.java     |   6 +
 .../sink/utils/StreamWriteFunctionWrapper.java     |  24 +-
 .../apache/hudi/sink/utils/TestDataTypeUtils.java  |  39 +-
 .../hudi/sink/utils/TestFunctionWrapper.java       |   6 +
 .../org/apache/hudi/sink/utils/TestWriteBase.java  |  48 +++
 .../test/java/org/apache/hudi/utils/TestData.java  |  34 ++
 .../hudi/utils/source/ContinuousFileSource.java    |   5 +
 .../realtime/AbstractRealtimeRecordReader.java     |  72 +++-
 .../realtime/HoodieHFileRealtimeInputFormat.java   |   2 +-
 .../realtime/HoodieParquetRealtimeInputFormat.java |  14 +-
 .../realtime/RealtimeCompactedRecordReader.java    |  25 +-
 .../hudi/hadoop/utils/HiveAvroSerializer.java      | 409 +++++++++++++++++++++
 .../utils/HoodieRealtimeInputFormatUtils.java      |  19 +-
 .../utils/HoodieRealtimeRecordReaderUtils.java     |   5 +
 .../hudi/hadoop/utils/TestHiveAvroSerializer.java  | 148 ++++++++
 hudi-integ-test/pom.xml                            |   4 +-
 hudi-spark-datasource/hudi-spark-common/pom.xml    |  12 +-
 .../main/java/org/apache/hudi/DataSourceUtils.java |  15 +-
 .../scala/org/apache/hudi/HoodieCLIUtils.scala     |   2 +-
 .../org/apache/hudi/HoodieSparkSqlWriter.scala     |  12 +-
 .../spark/sql/hudi/HoodieSqlCommonUtils.scala      |   6 +-
 .../AlterHoodieTableAddColumnsCommand.scala        |   1 +
 .../sql/hudi/command/DropHoodieTableCommand.scala  |  13 +-
 hudi-spark-datasource/hudi-spark/pom.xml           |  12 +-
 .../hudi/spark/sql/parser/HoodieSqlCommon.g4       |   6 +-
 .../hudi/command/MergeIntoHoodieTableCommand.scala |   3 +-
 ...e.scala => BackupInvalidParquetProcedure.scala} |  37 +-
 .../hudi/command/procedures/BaseProcedure.scala    |   4 +-
 ...ToTableProcedure.scala => CopyToTempView.scala} |  60 ++-
 .../hudi/command/procedures/HelpProcedure.scala    | 125 +++++++
 .../hudi/command/procedures/HoodieProcedures.scala |   9 +
 ...cala => ShowCommitExtraMetadataProcedure.scala} |  64 ++--
 ...re.scala => ShowTablePropertiesProcedure.scala} |  40 +-
 .../sql/parser/HoodieSqlCommonAstBuilder.scala     |  21 +-
 .../java/org/apache/hudi/TestDataSourceUtils.java  |   2 +-
 .../org/apache/hudi/TestHoodieSparkSqlWriter.scala |   5 +-
 .../test/scala/org/apache/hudi/TestNonIndex.scala  | 110 ++++++
 ...ala => TestBackupInvalidParquetProcedure.scala} |  24 +-
 .../sql/hudi/procedure/TestCommitsProcedure.scala  |  54 ++-
 .../procedure/TestCopyToTempViewProcedure.scala    | 168 +++++++++
 .../sql/hudi/procedure/TestHelpProcedure.scala     |  84 +++++
 ...cala => TestShowTablePropertiesProcedure.scala} |  15 +-
 hudi-spark-datasource/hudi-spark2/pom.xml          |  12 +-
 .../org/apache/hudi/internal/DefaultSource.java    |   1 +
 hudi-spark-datasource/hudi-spark3-common/pom.xml   |   2 +-
 .../apache/hudi/spark3/internal/DefaultSource.java |   4 +-
 hudi-spark-datasource/hudi-spark3.1.x/pom.xml      |   2 +-
 .../datasources/Spark31NestedSchemaPruning.scala   |  24 +-
 .../parquet/Spark31HoodieParquetFileFormat.scala   |  20 +-
 .../hudi/command/Spark31AlterTableCommand.scala    |   2 +-
 hudi-spark-datasource/hudi-spark3.2.x/pom.xml      |   6 +-
 hudi-spark-datasource/hudi-spark3.3.x/pom.xml      |   6 +-
 hudi-sync/hudi-hive-sync/pom.xml                   |   4 +-
 hudi-utilities/pom.xml                             |  10 +-
 .../org/apache/hudi/utilities/UtilHelpers.java     |  38 +-
 packaging/hudi-integ-test-bundle/pom.xml           |   8 +-
 pom.xml                                            | 130 ++++---
 146 files changed, 5170 insertions(+), 542 deletions(-)
 create mode 100644 dev/settings.xml
 create mode 100644 dev/tencent-install.sh
 create mode 100644 dev/tencent-release.sh
 create mode 100644 hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/AsyncPostEventService.java
 create mode 100644 hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsZhiyanConfig.java
 create mode 100644 hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/EmptyAvroKeyGenerator.java
 create mode 100644 hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanHttpClient.java
 create mode 100644 hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanMetricsReporter.java
 create mode 100644 hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanReporter.java
 create mode 100644 hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TDBankClient.java
 create mode 100644 hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TdbankConfig.java
 create mode 100644 hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TdbankHoodieMetricsEvent.java
 copy hudi-client/{hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/JavaUpsertCommitActionExecutor.java => hudi-flink-client/src/main/java/org/apache/hudi/index/FlinkHoodieNonIndex.java} (52%)
 copy hudi-client/{hudi-flink-client/src/main/java/org/apache/hudi/index/state/FlinkInMemoryStateIndex.java => hudi-spark-client/src/main/java/org/apache/hudi/index/nonindex/SparkHoodieNonIndex.java} (51%)
 copy hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/{GlobalDeleteKeyGenerator.java => EmptyKeyGenerator.java} (61%)
 create mode 100644 hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/nonindex/NonIndexStreamWriteFunction.java
 copy hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/{bucket/BucketStreamWriteOperator.java => nonindex/NonIndexStreamWriteOperator.java} (74%)
 create mode 100644 hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteFunctionEventTimeExtract.java
 copy hudi-cli/src/test/java/org/apache/hudi/cli/TestSparkUtil.java => hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestDataTypeUtils.java (50%)
 create mode 100644 hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HiveAvroSerializer.java
 create mode 100644 hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/utils/TestHiveAvroSerializer.java
 copy hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/{ShowInvalidParquetProcedure.scala => BackupInvalidParquetProcedure.scala} (66%)
 copy hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/{CopyToTableProcedure.scala => CopyToTempView.scala} (74%)
 create mode 100644 hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HelpProcedure.scala
 copy hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/{ShowCommitWriteStatsProcedure.scala => ShowCommitExtraMetadataProcedure.scala} (69%)
 copy hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/{ShowSavepointsProcedure.scala => ShowTablePropertiesProcedure.scala} (58%)
 create mode 100644 hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestNonIndex.scala
 copy hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/{TestShowInvalidParquetProcedure.scala => TestBackupInvalidParquetProcedure.scala} (78%)
 create mode 100644 hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCopyToTempViewProcedure.scala
 create mode 100644 hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestHelpProcedure.scala
 copy hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/{TestShowFsPathDetailProcedure.scala => TestShowTablePropertiesProcedure.scala} (72%)


[hudi] 07/45: add 'backup_invalid_parquet' procedure

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 1d029e668bde07f764d3781d51b6c18c6fc025e1
Author: jiimmyzhan <ji...@tencent.com>
AuthorDate: Wed Aug 24 22:27:11 2022 +0800

    add 'backup_invalid_parquet' procedure
---
 .../procedures/BackupInvalidParquetProcedure.scala | 89 ++++++++++++++++++++++
 .../hudi/command/procedures/HoodieProcedures.scala |  1 +
 .../TestBackupInvalidParquetProcedure.scala        | 83 ++++++++++++++++++++
 3 files changed, 173 insertions(+)

diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BackupInvalidParquetProcedure.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BackupInvalidParquetProcedure.scala
new file mode 100644
index 0000000000..fbbb1247fa
--- /dev/null
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BackupInvalidParquetProcedure.scala
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.SerializableConfiguration
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.parquet.format.converter.ParquetMetadataConverter.SKIP_ROW_GROUPS
+import org.apache.parquet.hadoop.ParquetFileReader
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
+
+import java.util.function.Supplier
+
+class BackupInvalidParquetProcedure extends BaseProcedure with ProcedureBuilder {
+  private val PARAMETERS = Array[ProcedureParameter](
+    ProcedureParameter.required(0, "path", DataTypes.StringType, None)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+    StructField("backup_path", DataTypes.StringType, nullable = true, Metadata.empty),
+    StructField("invalid_parquet_size", DataTypes.LongType, nullable = true, Metadata.empty))
+  )
+
+  def parameters: Array[ProcedureParameter] = PARAMETERS
+
+  def outputType: StructType = OUTPUT_TYPE
+
+  override def call(args: ProcedureArgs): Seq[Row] = {
+    super.checkArgs(PARAMETERS, args)
+
+    val srcPath = getArgValueOrDefault(args, PARAMETERS(0)).get.asInstanceOf[String]
+    val backupPath = new Path(srcPath, ".backup").toString
+    val fs = FSUtils.getFs(backupPath, jsc.hadoopConfiguration())
+    fs.mkdirs(new Path(backupPath))
+
+    val partitionPaths: java.util.List[String] = FSUtils.getAllPartitionPaths(new HoodieSparkEngineContext(jsc), srcPath, false, false)
+    val javaRdd: JavaRDD[String] = jsc.parallelize(partitionPaths, partitionPaths.size())
+    val serHadoopConf = new SerializableConfiguration(jsc.hadoopConfiguration())
+    val invalidParquetCount = javaRdd.rdd.map(part => {
+      val fs = FSUtils.getFs(new Path(srcPath), serHadoopConf.get())
+      FSUtils.getAllDataFilesInPartition(fs, FSUtils.getPartitionPath(srcPath, part))
+    }).flatMap(_.toList)
+      .filter(status => {
+        val filePath = status.getPath
+        var isInvalid = false
+        if (filePath.toString.endsWith(".parquet")) {
+          try ParquetFileReader.readFooter(serHadoopConf.get(), filePath, SKIP_ROW_GROUPS).getFileMetaData catch {
+            case e: Exception =>
+              isInvalid = e.getMessage.contains("is not a Parquet file")
+              filePath.getFileSystem(serHadoopConf.get()).rename(filePath, new Path(backupPath, filePath.getName))
+          }
+        }
+        isInvalid
+      })
+      .count()
+    Seq(Row(backupPath, invalidParquetCount))
+  }
+
+  override def build = new BackupInvalidParquetProcedure()
+}
+
+object BackupInvalidParquetProcedure {
+  val NAME = "backup_invalid_parquet"
+
+  def builder: Supplier[ProcedureBuilder] = new Supplier[ProcedureBuilder] {
+    override def get(): ProcedureBuilder = new BackupInvalidParquetProcedure()
+  }
+}
+
+
+
diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
index b2bbec8489..0917c2b70e 100644
--- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
@@ -80,6 +80,7 @@ object HoodieProcedures {
       ,(ValidateHoodieSyncProcedure.NAME, ValidateHoodieSyncProcedure.builder)
       ,(ShowInvalidParquetProcedure.NAME, ShowInvalidParquetProcedure.builder)
       ,(HiveSyncProcedure.NAME, HiveSyncProcedure.builder)
+      ,(BackupInvalidParquetProcedure.NAME, BackupInvalidParquetProcedure.builder)
     )
   }
 }
diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestBackupInvalidParquetProcedure.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestBackupInvalidParquetProcedure.scala
new file mode 100644
index 0000000000..2e54f40fb3
--- /dev/null
+++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestBackupInvalidParquetProcedure.scala
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.procedure
+
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.common.fs.FSUtils
+
+class TestBackupInvalidParquetProcedure extends HoodieSparkProcedureTestBase {
+  test("Test Call backup_invalid_parquet Procedure") {
+    withTempDir { tmp =>
+      val tableName = generateTableName
+      val basePath = s"${tmp.getCanonicalPath}/$tableName"
+      // create table
+      spark.sql(
+        s"""
+           |create table $tableName (
+           |  id int,
+           |  name string,
+           |  price double,
+           |  ts long
+           |) using hudi
+           | partitioned by (ts)
+           | location '$basePath'
+           | tblproperties (
+           |  primaryKey = 'id',
+           |  preCombineField = 'ts'
+           | )
+       """.stripMargin)
+      // insert data to table
+      spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
+      spark.sql(s"insert into $tableName select 2, 'a2', 20, 1500")
+
+      // Check required fields
+      checkExceptionContain(s"""call backup_invalid_parquet(limit => 10)""")(
+        s"Argument: path is required")
+
+      val fs = FSUtils.getFs(basePath, spark.sparkContext.hadoopConfiguration)
+      val invalidPath1 = new Path(basePath, "ts=1000/1.parquet")
+      val out1 = fs.create(invalidPath1)
+      out1.write(1)
+      out1.close()
+
+      val invalidPath2 = new Path(basePath, "ts=1500/2.parquet")
+      val out2 = fs.create(invalidPath2)
+      out2.write(1)
+      out2.close()
+
+      val result1 = spark.sql(
+        s"""call show_invalid_parquet(path => '$basePath')""".stripMargin).collect()
+      assertResult(2) {
+        result1.length
+      }
+
+      val result2 = spark.sql(
+        s"""call backup_invalid_parquet(path => '$basePath')""".stripMargin).collect()
+      assertResult(1) {
+        result2.length
+      }
+
+      val result3 = spark.sql(
+        s"""call show_invalid_parquet(path => '$basePath')""".stripMargin).collect()
+      assertResult(0) {
+        result3.length
+      }
+
+    }
+  }
+}


[hudi] 31/45: [HUDI-5095] Flink: Stores a special watermark(flag) to identify the current progress of writing data

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit c070e0963a08264ace559333baa5b96780ddc8df
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Tue Nov 29 23:33:22 2022 +0800

    [HUDI-5095] Flink: Stores a special watermark(flag) to identify the current progress of writing data
---
 .../java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java    | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
index 4a3674ec29..f4acc2e83a 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
@@ -511,6 +511,7 @@ public class StreamWriteOperatorCoordinator
     }
     setMinEventTime();
     doCommit(instant, writeResults);
+    resetMinEventTime();
     return true;
   }
 
@@ -532,6 +533,10 @@ public class StreamWriteOperatorCoordinator
     }
   }
 
+  public void resetMinEventTime() {
+    this.minEventTime = Long.MAX_VALUE;
+  }
+
   /**
    * Performs the actual commit action.
    */


[hudi] 28/45: [HUDI-5095] Flink: Stores a special watermark(flag) to identify the current progress of writing data

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 005e913403824fc6d5494bbefe8a370712656782
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Thu Nov 24 13:17:21 2022 +0800

    [HUDI-5095] Flink: Stores a special watermark(flag) to identify the current progress of writing data
---
 .../hudi/sink/StreamWriteOperatorCoordinator.java   | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
index 578bb10db5..4a3674ec29 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
@@ -511,28 +511,27 @@ public class StreamWriteOperatorCoordinator
     }
     setMinEventTime();
     doCommit(instant, writeResults);
-    resetMinEventTime();
     return true;
   }
 
   public void setMinEventTime() {
     if (commitEventTimeEnable) {
-      LOG.info("[setMinEventTime] receive event time for current commit: {} ", Arrays.stream(eventBuffer).map(WriteMetadataEvent::getMaxEventTime).map(String::valueOf)
-          .collect(Collectors.joining(", ")));
-      this.minEventTime = Arrays.stream(eventBuffer)
+      List<Long> eventTimes = Arrays.stream(eventBuffer)
           .filter(Objects::nonNull)
-          .filter(maxEventTime -> maxEventTime.getMaxEventTime() > 0)
           .map(WriteMetadataEvent::getMaxEventTime)
-          .min(Comparator.naturalOrder())
-          .map(aLong -> Math.min(aLong, this.minEventTime)).orElse(Long.MAX_VALUE);
+          .filter(maxEventTime -> maxEventTime > 0)
+          .collect(Collectors.toList());
+
+      if (!eventTimes.isEmpty()) {
+        LOG.info("[setMinEventTime] receive event time for current commit: {} ",
+            eventTimes.stream().map(String::valueOf).collect(Collectors.joining(", ")));
+        this.minEventTime = eventTimes.stream().min(Comparator.naturalOrder())
+            .map(aLong -> Math.min(aLong, this.minEventTime)).orElse(Long.MAX_VALUE);
+      }
       LOG.info("[setMinEventTime] minEventTime: {} ", this.minEventTime);
     }
   }
 
-  public void resetMinEventTime() {
-    this.minEventTime = Long.MAX_VALUE;
-  }
-
   /**
    * Performs the actual commit action.
    */


[hudi] 40/45: [HUDI-5314] add call help procedure (#7361)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 5488c951e41446a02df0f3d1060b3ac985f19370
Author: 苏承祥 <sc...@aliyun.com>
AuthorDate: Wed Dec 7 20:03:30 2022 +0800

    [HUDI-5314] add call help procedure (#7361)
    
    * add call help procedure
    
    Co-authored-by: 苏承祥 <su...@tuya.com>
    (cherry picked from commit 7dfe960415f8ead645a1fbdb711a14110c3265f2)
---
 .../hudi/spark/sql/parser/HoodieSqlCommon.g4       |   6 +-
 .../hudi/command/procedures/HelpProcedure.scala    | 125 +++++++++++++++++++++
 .../hudi/command/procedures/HoodieProcedures.scala |   5 +
 .../sql/parser/HoodieSqlCommonAstBuilder.scala     |  21 ++--
 .../sql/hudi/procedure/TestCommitsProcedure.scala  |   2 +-
 .../sql/hudi/procedure/TestHelpProcedure.scala     |  84 ++++++++++++++
 6 files changed, 231 insertions(+), 12 deletions(-)

diff --git a/hudi-spark-datasource/hudi-spark/src/main/antlr4/org/apache/hudi/spark/sql/parser/HoodieSqlCommon.g4 b/hudi-spark-datasource/hudi-spark/src/main/antlr4/org/apache/hudi/spark/sql/parser/HoodieSqlCommon.g4
index 8643170f89..8a3106f7a5 100644
--- a/hudi-spark-datasource/hudi-spark/src/main/antlr4/org/apache/hudi/spark/sql/parser/HoodieSqlCommon.g4
+++ b/hudi-spark-datasource/hudi-spark/src/main/antlr4/org/apache/hudi/spark/sql/parser/HoodieSqlCommon.g4
@@ -47,7 +47,7 @@
 
  statement
     : compactionStatement                                                       #compactionCommand
-    | CALL multipartIdentifier '(' (callArgument (',' callArgument)*)? ')'      #call
+    | CALL multipartIdentifier   callArgumentList?    #call
     | CREATE INDEX (IF NOT EXISTS)? identifier ON TABLE?
           tableIdentifier (USING indexType=identifier)?
           LEFT_PAREN columns=multipartIdentifierPropertyList RIGHT_PAREN
@@ -69,6 +69,10 @@
     : (db=IDENTIFIER '.')? table=IDENTIFIER
     ;
 
+ callArgumentList
+    : '(' (callArgument (',' callArgument)*)? ')'
+    ;
+
  callArgument
     : expression                    #positionalArgument
     | identifier '=>' expression    #namedArgument
diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HelpProcedure.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HelpProcedure.scala
new file mode 100644
index 0000000000..b17d068e81
--- /dev/null
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HelpProcedure.scala
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
+
+import java.util.function.Supplier
+
+class HelpProcedure extends BaseProcedure with ProcedureBuilder with Logging {
+
+  private val PARAMETERS = Array[ProcedureParameter](
+    ProcedureParameter.optional(0, "cmd", DataTypes.StringType, None)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+    StructField("result", DataTypes.StringType, nullable = true, Metadata.empty)
+  ))
+
+
+  /**
+   * Returns the description of this procedure.
+   */
+  override def description: String = s"The procedure help command allows you to view all the commands currently provided, as well as their parameters and output fields."
+
+  def parameters: Array[ProcedureParameter] = PARAMETERS
+
+  def outputType: StructType = OUTPUT_TYPE
+
+  override def call(args: ProcedureArgs): Seq[Row] = {
+    super.checkArgs(PARAMETERS, args)
+    val line = "\n"
+    val tab = "\t"
+    if (args.map.isEmpty) {
+      val procedures: Map[String, Supplier[ProcedureBuilder]] = HoodieProcedures.procedures()
+      val result = new StringBuilder
+      result.append("synopsis").append(line)
+        .append(tab).append("call [command]([key1]=>[value1],[key2]=>[value2])").append(line)
+      result.append("commands and description").append(line)
+      procedures.keySet.foreach(name => {
+        val builderSupplier: Option[Supplier[ProcedureBuilder]] = procedures.get(name)
+        if (builderSupplier.isDefined) {
+          val procedure: Procedure = builderSupplier.get.get().build
+          result.append(tab)
+            .append(name).append(tab)
+            .append(procedure.description).append(line)
+        }
+      })
+      result.append("You can use 'call help(cmd=>[command])' to view the detailed parameters of the command").append(line)
+      Seq(Row(result.toString()))
+    } else {
+      val cmdOpt: Option[Any] = getArgValueOrDefault(args, PARAMETERS(0))
+      assert(cmdOpt.isDefined, "The cmd parameter is required")
+      val cmd: String = cmdOpt.get.asInstanceOf[String]
+      val procedures: Map[String, Supplier[ProcedureBuilder]] = HoodieProcedures.procedures()
+      val builderSupplier: Option[Supplier[ProcedureBuilder]] = procedures.get(cmd.trim)
+      if (builderSupplier.isEmpty) {
+        throw new HoodieException(s"can not find $cmd command in procedures.")
+      }
+      val procedure: Procedure = builderSupplier.get.get().build
+      val result = new StringBuilder
+
+      result.append("parameters:").append(line)
+      // set parameters header
+      result.append(tab)
+        .append(lengthFormat("param")).append(tab)
+        .append(lengthFormat("type_name")).append(tab)
+        .append(lengthFormat("default_value")).append(tab)
+        .append(lengthFormat("required")).append(line)
+      procedure.parameters.foreach(param => {
+        result.append(tab)
+          .append(lengthFormat(param.name)).append(tab)
+          .append(lengthFormat(param.dataType.typeName)).append(tab)
+          .append(lengthFormat(param.default.toString)).append(tab)
+          .append(lengthFormat(param.required.toString)).append(line)
+      })
+      result.append("outputType:").append(line)
+      // set outputType header
+      result.append(tab)
+        .append(lengthFormat("name")).append(tab)
+        .append(lengthFormat("type_name")).append(tab)
+        .append(lengthFormat("nullable")).append(tab)
+        .append(lengthFormat("metadata")).append(line)
+      procedure.outputType.map(field => {
+        result.append(tab)
+          .append(lengthFormat(field.name)).append(tab)
+          .append(lengthFormat(field.dataType.typeName)).append(tab)
+          .append(lengthFormat(field.nullable.toString)).append(tab)
+          .append(lengthFormat(field.metadata.toString())).append(line)
+      })
+      Seq(Row(result.toString()))
+    }
+  }
+
+  def lengthFormat(string: String): String = {
+    String.format("%-30s", string)
+  }
+
+  override def build = new HelpProcedure()
+}
+
+object HelpProcedure {
+  val NAME = "help"
+
+  def builder: Supplier[ProcedureBuilder] = new Supplier[ProcedureBuilder] {
+    override def get() = new HelpProcedure()
+  }
+}
diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
index d6131353c5..a2f2816746 100644
--- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
@@ -28,6 +28,10 @@ object HoodieProcedures {
     if (builderSupplier.isDefined) builderSupplier.get.get() else null
   }
 
+  def procedures(): Map[String, Supplier[ProcedureBuilder]] = {
+    BUILDERS
+  }
+
   private def initProcedureBuilders: Map[String, Supplier[ProcedureBuilder]] = {
     Map((RunCompactionProcedure.NAME, RunCompactionProcedure.builder)
       ,(ShowCompactionProcedure.NAME, ShowCompactionProcedure.builder)
@@ -84,6 +88,7 @@ object HoodieProcedures {
       ,(CopyToTempView.NAME, CopyToTempView.builder)
       ,(ShowCommitExtraMetadataProcedure.NAME, ShowCommitExtraMetadataProcedure.builder)
       ,(ShowTablePropertiesProcedure.NAME, ShowTablePropertiesProcedure.builder)
+      ,(HelpProcedure.NAME, HelpProcedure.builder)
     )
   }
 }
diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieSqlCommonAstBuilder.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieSqlCommonAstBuilder.scala
index d0e5ed6133..4005ef97e4 100644
--- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieSqlCommonAstBuilder.scala
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieSqlCommonAstBuilder.scala
@@ -27,7 +27,7 @@ import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.{UnresolvedAttribute, UnresolvedRelation}
 import org.apache.spark.sql.catalyst.expressions.{Expression, Literal}
-import org.apache.spark.sql.catalyst.parser.{ParseException, ParserInterface, ParserUtils}
+import org.apache.spark.sql.catalyst.parser.{ParserInterface, ParserUtils}
 import org.apache.spark.sql.catalyst.plans.logical._
 
 import java.util.Locale
@@ -92,13 +92,14 @@ class HoodieSqlCommonAstBuilder(session: SparkSession, delegate: ParserInterface
   }
 
   override def visitCall(ctx: CallContext): LogicalPlan = withOrigin(ctx) {
-    if (ctx.callArgument().isEmpty) {
-      throw new ParseException(s"Procedures arguments is empty", ctx)
+    if (ctx.callArgumentList() == null || ctx.callArgumentList().callArgument() == null || ctx.callArgumentList().callArgument().size() == 0) {
+      val name: Seq[String] = ctx.multipartIdentifier().parts.asScala.map(_.getText)
+      CallCommand(name, Seq())
+    } else {
+      val name: Seq[String] = ctx.multipartIdentifier().parts.asScala.map(_.getText)
+      val args: Seq[CallArgument] = ctx.callArgumentList().callArgument().asScala.map(typedVisit[CallArgument])
+      CallCommand(name, args)
     }
-
-    val name: Seq[String] = ctx.multipartIdentifier().parts.asScala.map(_.getText)
-    val args: Seq[CallArgument] = ctx.callArgument().asScala.map(typedVisit[CallArgument])
-    CallCommand(name, args)
   }
 
   /**
@@ -167,9 +168,9 @@ class HoodieSqlCommonAstBuilder(session: SparkSession, delegate: ParserInterface
     }
 
     val columns = ctx.columns.multipartIdentifierProperty.asScala
-        .map(_.multipartIdentifier).map(typedVisit[Seq[String]]).toSeq
+      .map(_.multipartIdentifier).map(typedVisit[Seq[String]]).toSeq
     val columnsProperties = ctx.columns.multipartIdentifierProperty.asScala
-        .map(x => (Option(x.options).map(visitPropertyKeyValues).getOrElse(Map.empty))).toSeq
+      .map(x => (Option(x.options).map(visitPropertyKeyValues).getOrElse(Map.empty))).toSeq
     val options = Option(ctx.indexOptions).map(visitPropertyKeyValues).getOrElse(Map.empty)
 
     CreateIndex(
@@ -223,7 +224,7 @@ class HoodieSqlCommonAstBuilder(session: SparkSession, delegate: ParserInterface
    * This should be called through [[visitPropertyKeyValues]] or [[visitPropertyKeys]].
    */
   override def visitPropertyList(
-      ctx: PropertyListContext): Map[String, String] = withOrigin(ctx) {
+                                  ctx: PropertyListContext): Map[String, String] = withOrigin(ctx) {
     val properties = ctx.property.asScala.map { property =>
       val key = visitPropertyKey(property.key)
       val value = visitPropertyValue(property.value)
diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCommitsProcedure.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCommitsProcedure.scala
index 03cf26800d..febcd45279 100644
--- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCommitsProcedure.scala
+++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCommitsProcedure.scala
@@ -310,7 +310,7 @@ class TestCommitsProcedure extends HoodieSparkProcedureTestBase {
 
       // Check required fields
       checkExceptionContain(s"""call show_commit_extra_metadata()""")(
-        s"arguments is empty")
+        s"Argument: table is required")
 
       // collect commits for table
       val commits = spark.sql(s"""call show_commits(table => '$tableName', limit => 10)""").collect()
diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestHelpProcedure.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestHelpProcedure.scala
new file mode 100644
index 0000000000..2150682f89
--- /dev/null
+++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestHelpProcedure.scala
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.procedure
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.hudi.command.procedures.{HoodieProcedures, Procedure, ProcedureBuilder}
+
+import java.util
+import java.util.function.Supplier
+
+class TestHelpProcedure extends HoodieSparkProcedureTestBase {
+
+  test("Test Call help Procedure with no params") {
+    val help: util.List[Row] = spark.sql("call help").collectAsList()
+    assert(help.size() == 1)
+
+    val help2: util.List[Row] = spark.sql("call help()").collectAsList()
+    assert(help2.size() == 1)
+
+    assert(help.get(0).toString().equals(help2.get(0).toString()))
+
+
+    val helpStr: String = help.get(0).toString()
+    val procedures: Map[String, Supplier[ProcedureBuilder]] = HoodieProcedures.procedures()
+
+    // check all procedures
+    procedures.keySet.foreach(name => {
+      // check cmd contains all procedure name
+      assert(helpStr.contains(name))
+      // check cmd contains all procedure description
+      val builderSupplier: Option[Supplier[ProcedureBuilder]] = procedures.get(name)
+      assert(builderSupplier.isDefined)
+      val procedure: Procedure = builderSupplier.get.get().build
+      assert(helpStr.contains(procedure.description))
+    })
+  }
+
+
+  test("Test Call help Procedure with params") {
+
+    // check not valid params
+    checkExceptionContain("call help(not_valid=>true)")("The cmd parameter is required")
+
+    val procedures: Map[String, Supplier[ProcedureBuilder]] = HoodieProcedures.procedures()
+
+    // check all procedures
+    procedures.keySet.foreach(name => {
+      val help: util.List[Row] = spark.sql(s"call help(cmd=>'$name')").collectAsList()
+      assert(help.size() == 1)
+
+      val helpStr: String = help.get(0).toString()
+      val builderSupplier: Option[Supplier[ProcedureBuilder]] = procedures.get(name)
+
+      assert(builderSupplier.isDefined)
+
+      // check result contains params
+      val procedure: Procedure = builderSupplier.get.get().build
+      procedure.parameters.foreach(params => {
+        assert(helpStr.contains(params.name))
+      })
+
+      // check result contains outputType
+      procedure.outputType.foreach(output => {
+        assert(helpStr.contains(output.name))
+      })
+    })
+  }
+
+}


[hudi] 35/45: fix database default error

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 0cf7d3dac8b5392e9357c386c4c90d49be2a5b0d
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Thu Dec 8 11:20:46 2022 +0800

    fix database default error
---
 .../src/main/java/org/apache/hudi/config/HoodieWriteConfig.java     | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index 034519e64a..103123f980 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -46,7 +46,6 @@ import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
 import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.ReflectionUtils;
-import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.ValidationUtils;
 import org.apache.hudi.config.metrics.HoodieMetricsCloudWatchConfig;
 import org.apache.hudi.config.metrics.HoodieMetricsConfig;
@@ -92,6 +91,7 @@ import java.util.UUID;
 import java.util.function.Supplier;
 import java.util.stream.Collectors;
 
+import static org.apache.hudi.common.util.StringUtils.isNullOrEmpty;
 import static org.apache.hudi.config.HoodieCleanConfig.CLEANER_POLICY;
 
 /**
@@ -1710,7 +1710,7 @@ public class HoodieWriteConfig extends HoodieConfig {
 
   public CompressionCodecName getParquetCompressionCodec() {
     String codecName = getString(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC_NAME);
-    return CompressionCodecName.fromConf(StringUtils.isNullOrEmpty(codecName) ? null : codecName);
+    return CompressionCodecName.fromConf(isNullOrEmpty(codecName) ? null : codecName);
   }
 
   public boolean parquetDictionaryEnabled() {
@@ -2308,7 +2308,7 @@ public class HoodieWriteConfig extends HoodieConfig {
     }
 
     public Builder withDatabaseName(String dbName) {
-      writeConfig.setValue(DATABASE_NAME, dbName);
+      writeConfig.setValue(DATABASE_NAME, isNullOrEmpty(dbName) ? "default" : dbName);
       return this;
     }
 


[hudi] 04/45: fix the bug, log file will not roll over to a new file

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ab5b4fa780c12b781509e44f88f7d849a5467e89
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Mon Jun 20 10:21:45 2022 +0800

    fix the bug, log file will not roll over to a new file
---
 .../org/apache/hudi/common/table/log/HoodieLogFormatWriter.java  | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
index 8dbe85efd1..60c124784a 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
@@ -94,7 +94,8 @@ public class HoodieLogFormatWriter implements HoodieLogFormat.Writer {
       Path path = logFile.getPath();
       if (fs.exists(path)) {
         boolean isAppendSupported = StorageSchemes.isAppendSupported(fs.getScheme());
-        if (isAppendSupported) {
+        boolean needRollOverToNewFile = fs.getFileStatus(path).getLen() > sizeThreshold;
+        if (isAppendSupported && !needRollOverToNewFile) {
           LOG.info(logFile + " exists. Appending to existing file");
           try {
             // open the path for append and record the offset
@@ -116,10 +117,12 @@ public class HoodieLogFormatWriter implements HoodieLogFormat.Writer {
             }
           }
         }
-        if (!isAppendSupported) {
+        if (!isAppendSupported || needRollOverToNewFile) {
           rollOver();
           createNewFile();
-          LOG.info("Append not supported.. Rolling over to " + logFile);
+          if (isAppendSupported && needRollOverToNewFile) {
+            LOG.info(String.format("current Log file size > %s roll over to a new log file", sizeThreshold));
+          }
         }
       } else {
         LOG.info(logFile + " does not exist. Create a new file");


[hudi] 02/45: [MINOR] Add Zhiyan metrics reporter

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 23412b2bee1e05231f100f922fe3785c39f2891b
Author: simonssu <si...@tencent.com>
AuthorDate: Wed May 25 21:08:45 2022 +0800

    [MINOR] Add Zhiyan metrics reporter
---
 dev/tencent-release.sh                             |   4 +-
 hudi-client/hudi-client-common/pom.xml             |   7 +
 .../apache/hudi/async/AsyncPostEventService.java   |  93 +++++++++++
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  84 ++++++++++
 .../hudi/config/metrics/HoodieMetricsConfig.java   |  23 ++-
 .../config/metrics/HoodieMetricsZhiyanConfig.java  | 143 +++++++++++++++++
 .../org/apache/hudi/metrics/HoodieMetrics.java     | 120 ++++++++++++++-
 .../main/java/org/apache/hudi/metrics/Metrics.java |   1 +
 .../hudi/metrics/MetricsReporterFactory.java       |   4 +
 .../apache/hudi/metrics/MetricsReporterType.java   |   2 +-
 .../hudi/metrics/zhiyan/ZhiyanHttpClient.java      | 129 ++++++++++++++++
 .../hudi/metrics/zhiyan/ZhiyanMetricsReporter.java |  66 ++++++++
 .../apache/hudi/metrics/zhiyan/ZhiyanReporter.java | 170 +++++++++++++++++++++
 .../java/org/apache/hudi/tdbank/TDBankClient.java  | 103 +++++++++++++
 .../java/org/apache/hudi/tdbank/TdbankConfig.java  |  82 ++++++++++
 .../hudi/tdbank/TdbankHoodieMetricsEvent.java      | 110 +++++++++++++
 .../apache/hudi/client/HoodieFlinkWriteClient.java |   6 +
 .../hudi/common/table/HoodieTableConfig.java       |   6 +-
 .../apache/hudi/configuration/FlinkOptions.java    |  40 +++--
 .../apache/hudi/streamer/FlinkStreamerConfig.java  |   4 +
 .../apache/hudi/streamer/HoodieFlinkStreamer.java  |   4 +-
 .../org/apache/hudi/table/HoodieTableFactory.java  |   2 +
 .../main/java/org/apache/hudi/DataSourceUtils.java |  12 +-
 .../scala/org/apache/hudi/HoodieCLIUtils.scala     |   2 +-
 .../org/apache/hudi/HoodieSparkSqlWriter.scala     |  12 +-
 .../AlterHoodieTableAddColumnsCommand.scala        |   1 +
 .../hudi/command/MergeIntoHoodieTableCommand.scala |   3 +-
 .../java/org/apache/hudi/TestDataSourceUtils.java  |   2 +-
 .../org/apache/hudi/TestHoodieSparkSqlWriter.scala |   5 +-
 .../org/apache/hudi/internal/DefaultSource.java    |   1 +
 .../apache/hudi/spark3/internal/DefaultSource.java |   4 +-
 .../hudi/command/Spark31AlterTableCommand.scala    |   2 +-
 32 files changed, 1200 insertions(+), 47 deletions(-)

diff --git a/dev/tencent-release.sh b/dev/tencent-release.sh
index 944f497070..b788d62dc7 100644
--- a/dev/tencent-release.sh
+++ b/dev/tencent-release.sh
@@ -116,9 +116,9 @@ function deploy_spark(){
   FLINK_VERSION=$3
 
   if [ ${release_repo} = "Y" ]; then
-    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -s dev/settings.xml -DretryFailedDeploymentCount=30"
+    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -s dev/settings.xml -DretryFailedDeploymentCount=30 -T 2.5C"
   else
-    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -s dev/settings.xml -DretryFailedDeploymentCount=30"
+    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -s dev/settings.xml -DretryFailedDeploymentCount=30 -T 2.5C"
   fi
 
 #  INSTALL_OPTIONS="-U -Drat.skip=true -Djacoco.skip=true -Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -DskipTests -s dev/settings.xml -T 2.5C"
diff --git a/hudi-client/hudi-client-common/pom.xml b/hudi-client/hudi-client-common/pom.xml
index 735b62957d..81bf645427 100644
--- a/hudi-client/hudi-client-common/pom.xml
+++ b/hudi-client/hudi-client-common/pom.xml
@@ -72,6 +72,13 @@
       <version>0.2.2</version>
     </dependency>
 
+    <!-- Tdbank -->
+    <dependency>
+      <groupId>com.tencent.tdbank</groupId>
+      <artifactId>TDBusSDK</artifactId>
+      <version>1.2.9</version>
+    </dependency>
+
     <!-- Dropwizard Metrics -->
     <dependency>
       <groupId>io.dropwizard.metrics</groupId>
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/AsyncPostEventService.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/AsyncPostEventService.java
new file mode 100644
index 0000000000..84cf82c913
--- /dev/null
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/AsyncPostEventService.java
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.async;
+
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.tdbank.TDBankClient;
+import org.apache.hudi.tdbank.TdbankHoodieMetricsEvent;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.LinkedBlockingQueue;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * Async service to post event to remote service..
+ */
+public class AsyncPostEventService extends HoodieAsyncService {
+
+  private static final Logger LOG = LoggerFactory.getLogger(AsyncPostEventService.class);
+
+  private final transient ExecutorService executor = Executors.newSingleThreadExecutor();
+  private final LinkedBlockingQueue<TdbankHoodieMetricsEvent> queue;
+  private final TDBankClient client;
+
+  public AsyncPostEventService(HoodieWriteConfig config, LinkedBlockingQueue<TdbankHoodieMetricsEvent> queue) {
+    this.client = new TDBankClient(config.getTdbankTdmAddr(),
+      config.getTdbankTdmPort(), config.getTdbankBid());
+    this.queue = queue;
+  }
+
+  @Override
+  protected Pair<CompletableFuture, ExecutorService> startService() {
+    LOG.info("Start async post event service...");
+    return Pair.of(CompletableFuture.supplyAsync(() -> {
+      sendEvent();
+      return true;
+    }, executor), executor);
+  }
+
+  private void sendEvent() {
+    try {
+      while (!isShutdownRequested()) {
+        TdbankHoodieMetricsEvent event = queue.poll(10, TimeUnit.SECONDS);
+        if (event != null) {
+          client.sendMessage(event);
+        }
+      }
+      LOG.info("Post event service shutdown properly.");
+    } catch (Exception e) {
+      LOG.error("Error when send event to tdbank", e);
+    }
+  }
+
+  // TODO simplfy codes here among async package.
+  public static void waitForCompletion(AsyncArchiveService asyncArchiveService) {
+    if (asyncArchiveService != null) {
+      LOG.info("Waiting for async archive service to finish");
+      try {
+        asyncArchiveService.waitForShutdown();
+      } catch (Exception e) {
+        throw new HoodieException("Error waiting for async archive service to finish", e);
+      }
+    }
+  }
+
+  public static void forceShutdown(AsyncArchiveService asyncArchiveService) {
+    if (asyncArchiveService != null) {
+      LOG.info("Shutting down async archive service...");
+      asyncArchiveService.shutdown(true);
+    }
+  }
+}
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index 55979b481b..23bc0ee329 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -54,6 +54,7 @@ import org.apache.hudi.config.metrics.HoodieMetricsDatadogConfig;
 import org.apache.hudi.config.metrics.HoodieMetricsGraphiteConfig;
 import org.apache.hudi.config.metrics.HoodieMetricsJmxConfig;
 import org.apache.hudi.config.metrics.HoodieMetricsPrometheusConfig;
+import org.apache.hudi.config.metrics.HoodieMetricsZhiyanConfig;
 import org.apache.hudi.exception.HoodieNotSupportedException;
 import org.apache.hudi.execution.bulkinsert.BulkInsertSortMode;
 import org.apache.hudi.index.HoodieIndex;
@@ -72,6 +73,7 @@ import org.apache.hudi.table.storage.HoodieStorageLayout;
 import org.apache.hadoop.hbase.io.compress.Compression;
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
+import org.apache.hudi.tdbank.TdbankConfig;
 import org.apache.orc.CompressionKind;
 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
 
@@ -86,6 +88,7 @@ import java.util.List;
 import java.util.Map;
 import java.util.Objects;
 import java.util.Properties;
+import java.util.UUID;
 import java.util.function.Supplier;
 import java.util.stream.Collectors;
 
@@ -108,11 +111,21 @@ public class HoodieWriteConfig extends HoodieConfig {
   // It is here so that both the client and deltastreamer use the same reference
   public static final String DELTASTREAMER_CHECKPOINT_KEY = "deltastreamer.checkpoint.key";
 
+  public static final ConfigProperty<String> DATABASE_NAME = ConfigProperty
+      .key(HoodieTableConfig.DATABASE_NAME.key())
+      .noDefaultValue()
+      .withDocumentation("Database name that will be used for identify table related to different databases.");
+
   public static final ConfigProperty<String> TBL_NAME = ConfigProperty
       .key(HoodieTableConfig.HOODIE_TABLE_NAME_KEY)
       .noDefaultValue()
       .withDocumentation("Table name that will be used for registering with metastores like HMS. Needs to be same across runs.");
 
+  public static final ConfigProperty<String> HOODIE_JOB_ID = ConfigProperty
+      .key("hoodie.job.id")
+      .noDefaultValue()
+      .withDocumentation("JobId use to identify a hoodie job. (e.g A spark job writes data to hoodie table.)");
+
   public static final ConfigProperty<String> PRECOMBINE_FIELD_NAME = ConfigProperty
       .key("hoodie.datasource.write.precombine.field")
       .defaultValue("ts")
@@ -962,6 +975,14 @@ public class HoodieWriteConfig extends HoodieConfig {
         HoodieTableConfig.TYPE, HoodieTableConfig.TYPE.defaultValue().name()).toUpperCase());
   }
 
+  public String getDatabaseName() {
+    return getString(DATABASE_NAME);
+  }
+
+  public String getHoodieJobId() {
+    return getString(HOODIE_JOB_ID);
+  }
+
   public String getPreCombineField() {
     return getString(PRECOMBINE_FIELD_NAME);
   }
@@ -1820,6 +1841,42 @@ public class HoodieWriteConfig extends HoodieConfig {
         HoodieMetricsDatadogConfig.METRIC_TAG_VALUES, ",").split("\\s*,\\s*")).collect(Collectors.toList());
   }
 
+  public int getZhiyanApiTimeoutSeconds() {
+    return getInt(HoodieMetricsZhiyanConfig.API_TIMEOUT_IN_SECONDS);
+  }
+
+  public int getZhiyanReportPeriodSeconds() {
+    return getInt(HoodieMetricsZhiyanConfig.REPORT_PERIOD_SECONDS);
+  }
+
+  public String getZhiyanReportServiceURL() {
+    return getString(HoodieMetricsZhiyanConfig.REPORT_SERVICE_URL);
+  }
+
+  public String getZhiyanReportServicePath() {
+    return getString(HoodieMetricsZhiyanConfig.REPORT_SERVICE_PATH);
+  }
+
+  public String getZhiyanHoodieJobName() {
+    String zhiyanJobName = getString(HoodieMetricsZhiyanConfig.ZHIYAN_JOB_NAME);
+    if (getBoolean(HoodieMetricsZhiyanConfig.ZHIYAN_RANDOM_JOBNAME_SUFFIX)) {
+      if (!zhiyanJobName.isEmpty()) {
+        return zhiyanJobName + "." + UUID.randomUUID();
+      } else {
+        return engineType + "." + UUID.randomUUID();
+      }
+    }
+    return zhiyanJobName;
+  }
+
+  public String getZhiyanAppMask() {
+    return getString(HoodieMetricsZhiyanConfig.ZHIYAN_METRICS_HOODIE_APPMASK);
+  }
+
+  public String getZhiyanSeclvlEnvName() {
+    return getString(HoodieMetricsZhiyanConfig.ZHIYAN_METRICS_HOODIE_SECLVLENNAME);
+  }
+
   public int getCloudWatchReportPeriodSeconds() {
     return getInt(HoodieMetricsCloudWatchConfig.REPORT_PERIOD_SECONDS);
   }
@@ -1872,6 +1929,10 @@ public class HoodieWriteConfig extends HoodieConfig {
     return getStringOrDefault(HoodieMetricsConfig.METRICS_REPORTER_PREFIX);
   }
 
+  public int getMetricEventQueueSize() {
+    return getIntOrDefault(HoodieMetricsConfig.METRICS_EVENT_QUEUE_SIZE);
+  }
+
   /**
    * memory configs.
    */
@@ -2135,6 +2196,21 @@ public class HoodieWriteConfig extends HoodieConfig {
     return metastoreConfig.enableMetastore();
   }
 
+  /**
+   * Tdbank configs
+   * */
+  public String getTdbankTdmAddr() {
+    return getString(TdbankConfig.TDBANK_TDM_ADDR);
+  }
+
+  public int getTdbankTdmPort() {
+    return getInt(TdbankConfig.TDBANK_TDM_PORT);
+  }
+
+  public String getTdbankBid() {
+    return getString(TdbankConfig.TDBANK_BID);
+  }
+
   public static class Builder {
 
     protected final HoodieWriteConfig writeConfig = new HoodieWriteConfig();
@@ -2159,6 +2235,7 @@ public class HoodieWriteConfig extends HoodieConfig {
     private boolean isMetricsJmxConfigSet = false;
     private boolean isMetricsGraphiteConfigSet = false;
     private boolean isLayoutConfigSet = false;
+    private boolean isTdbankConfigSet = false;
 
     public Builder withEngineType(EngineType engineType) {
       this.engineType = engineType;
@@ -2216,6 +2293,11 @@ public class HoodieWriteConfig extends HoodieConfig {
       return this;
     }
 
+    public Builder withDatabaseName(String dbName) {
+      writeConfig.setValue(DATABASE_NAME, dbName);
+      return this;
+    }
+
     public Builder withPreCombineField(String preCombineField) {
       writeConfig.setValue(PRECOMBINE_FIELD_NAME, preCombineField);
       return this;
@@ -2583,6 +2665,8 @@ public class HoodieWriteConfig extends HoodieConfig {
           HoodiePreCommitValidatorConfig.newBuilder().fromProperties(writeConfig.getProps()).build());
       writeConfig.setDefaultOnCondition(!isLayoutConfigSet,
           HoodieLayoutConfig.newBuilder().fromProperties(writeConfig.getProps()).build());
+      writeConfig.setDefaultOnCondition(!isTdbankConfigSet,
+          TdbankConfig.newBuilder().fromProperties(writeConfig.getProps()).build());
       writeConfig.setDefaultValue(TIMELINE_LAYOUT_VERSION_NUM, String.valueOf(TimelineLayoutVersion.CURR_VERSION));
 
       // isLockProviderPropertySet must be fetched before setting defaults of HoodieLockConfig
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java
index 957b439051..787819be12 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java
@@ -47,14 +47,14 @@ public class HoodieMetricsConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> TURN_METRICS_ON = ConfigProperty
       .key(METRIC_PREFIX + ".on")
-      .defaultValue(false)
+      .defaultValue(true)
       .sinceVersion("0.5.0")
       .withDocumentation("Turn on/off metrics reporting. off by default.");
 
   public static final ConfigProperty<MetricsReporterType> METRICS_REPORTER_TYPE_VALUE = ConfigProperty
       .key(METRIC_PREFIX + ".reporter.type")
-      .defaultValue(MetricsReporterType.GRAPHITE)
-      .sinceVersion("0.5.0")
+      .defaultValue(MetricsReporterType.ZHIYAN)
+      .sinceVersion("0.11.0")
       .withDocumentation("Type of metrics reporter.");
 
   // User defined
@@ -69,10 +69,15 @@ public class HoodieMetricsConfig extends HoodieConfig {
       .defaultValue("")
       .sinceVersion("0.11.0")
       .withInferFunction(cfg -> {
+        StringBuilder sb = new StringBuilder();
+        if (cfg.contains(HoodieTableConfig.DATABASE_NAME)) {
+          sb.append(cfg.getString(HoodieTableConfig.DATABASE_NAME));
+          sb.append(".");
+        }
         if (cfg.contains(HoodieTableConfig.NAME)) {
-          return Option.of(cfg.getString(HoodieTableConfig.NAME));
+          sb.append(cfg.getString(HoodieTableConfig.NAME));
         }
-        return Option.empty();
+        return sb.length() == 0 ? Option.empty() : Option.of(sb.toString());
       })
       .withDocumentation("The prefix given to the metrics names.");
 
@@ -94,6 +99,12 @@ public class HoodieMetricsConfig extends HoodieConfig {
       })
       .withDocumentation("Enable metrics for locking infra. Useful when operating in multiwriter mode");
 
+  public static final ConfigProperty<Integer> METRICS_EVENT_QUEUE_SIZE = ConfigProperty
+      .key(METRIC_PREFIX + ".event.queue.size")
+      .defaultValue(10_000_000)
+      .sinceVersion("0.11.0")
+      .withDocumentation("The prefix given to the metrics names.");
+
   /**
    * @deprecated Use {@link #TURN_METRICS_ON} and its methods instead
    */
@@ -197,6 +208,8 @@ public class HoodieMetricsConfig extends HoodieConfig {
           HoodieMetricsGraphiteConfig.newBuilder().fromProperties(hoodieMetricsConfig.getProps()).build());
       hoodieMetricsConfig.setDefaultOnCondition(reporterType == MetricsReporterType.CLOUDWATCH,
             HoodieMetricsCloudWatchConfig.newBuilder().fromProperties(hoodieMetricsConfig.getProps()).build());
+      hoodieMetricsConfig.setDefaultOnCondition(reporterType == MetricsReporterType.ZHIYAN,
+          HoodieMetricsZhiyanConfig.newBuilder().fromProperties(hoodieMetricsConfig.getProps()).build());
       return hoodieMetricsConfig;
     }
   }
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsZhiyanConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsZhiyanConfig.java
new file mode 100644
index 0000000000..d090b19b2f
--- /dev/null
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsZhiyanConfig.java
@@ -0,0 +1,143 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.config.metrics;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.Properties;
+
+import static org.apache.hudi.config.metrics.HoodieMetricsConfig.METRIC_PREFIX;
+
+@ConfigClassProperty(name = "Metrics Configurations for Zhiyan",
+    groupName = ConfigGroups.Names.METRICS,
+    description = "Enables reporting on Hudi metrics using Zhiyan. "
+      + " Hudi publishes metrics on every commit, clean, rollback etc.")
+public class HoodieMetricsZhiyanConfig extends HoodieConfig {
+
+  public static final String ZHIYAN_PREFIX = METRIC_PREFIX + ".zhiyan";
+
+  public static final ConfigProperty<Integer> API_TIMEOUT_IN_SECONDS = ConfigProperty
+      .key(ZHIYAN_PREFIX + ".api.timeout.seconds")
+      .defaultValue(10)
+      .sinceVersion("0.10.0")
+      .withDocumentation("Zhiyan API timeout in seconds. Default to 10.");
+
+  public static final ConfigProperty<Integer> REPORT_PERIOD_SECONDS = ConfigProperty
+      .key(ZHIYAN_PREFIX + ".report.period.seconds")
+      .defaultValue(10)
+      .sinceVersion("0.10.0")
+      .withDocumentation("Zhiyan Report period seconds. Default to 10.");
+
+  public static final ConfigProperty<String> REPORT_SERVICE_URL = ConfigProperty
+      .key(ZHIYAN_PREFIX + ".report.service.url")
+      .defaultValue("http://zhiyan.monitor.access.inner.woa.com:8080")
+      .withDocumentation("Zhiyan Report service url.");
+
+  public static final ConfigProperty<String> REPORT_SERVICE_PATH = ConfigProperty
+      .key(ZHIYAN_PREFIX + ".report.service.path")
+      .defaultValue("/access_v1.http_service/HttpCurveReportRpc")
+      .withDocumentation("Zhiyan Report service path.");
+
+  public static final ConfigProperty<String> ZHIYAN_JOB_NAME = ConfigProperty
+      .key(ZHIYAN_PREFIX + ".job.name")
+      .defaultValue("")
+      .sinceVersion("0.10.0")
+      .withDocumentation("Name of Job using zhiyan metrics reporter.");
+
+  public static final ConfigProperty<Boolean> ZHIYAN_RANDOM_JOBNAME_SUFFIX = ConfigProperty
+      .key(ZHIYAN_PREFIX + ".random.job.name.suffix")
+      .defaultValue(true)
+      .sinceVersion("0.10.0")
+      .withDocumentation("Whether the Zhiyan job name need a random suffix , default true.");
+
+  public static final ConfigProperty<String> ZHIYAN_METRICS_HOODIE_APPMASK = ConfigProperty
+      .key(ZHIYAN_PREFIX + ".hoodie.appmask")
+      .defaultValue("1701_36311_HUDI")
+      .sinceVersion("0.10.0")
+      .withDocumentation("Zhiyan appmask for hudi.");
+
+  public static final ConfigProperty<String> ZHIYAN_METRICS_HOODIE_SECLVLENNAME = ConfigProperty
+      .key(ZHIYAN_PREFIX + ".hoodie.seclvlenname")
+      .defaultValue("hudi_metrics")
+      .sinceVersion("0.10.0")
+      .withDocumentation("Zhiyan seclvlenvname for hudi, default hudi_metrics");
+
+  public static Builder newBuilder() {
+    return new HoodieMetricsZhiyanConfig.Builder();
+  }
+
+  public static class Builder {
+
+    private final HoodieMetricsZhiyanConfig hoodieMetricsZhiyanConfig = new HoodieMetricsZhiyanConfig();
+
+    public HoodieMetricsZhiyanConfig.Builder fromFile(File propertiesFile) throws IOException {
+      try (FileReader reader = new FileReader(propertiesFile)) {
+        this.hoodieMetricsZhiyanConfig.getProps().load(reader);
+        return this;
+      }
+    }
+
+    public HoodieMetricsZhiyanConfig.Builder fromProperties(Properties props) {
+      this.hoodieMetricsZhiyanConfig.getProps().putAll(props);
+      return this;
+    }
+
+    public HoodieMetricsZhiyanConfig.Builder withAppMask(String appMask) {
+      hoodieMetricsZhiyanConfig.setValue(ZHIYAN_METRICS_HOODIE_APPMASK, appMask);
+      return this;
+    }
+
+    public HoodieMetricsZhiyanConfig.Builder withSeclvlEnvName(String seclvlEnvName) {
+      hoodieMetricsZhiyanConfig.setValue(ZHIYAN_METRICS_HOODIE_SECLVLENNAME, seclvlEnvName);
+      return this;
+    }
+
+    public HoodieMetricsZhiyanConfig.Builder withReportServiceUrl(String url) {
+      hoodieMetricsZhiyanConfig.setValue(REPORT_SERVICE_URL, url);
+      return this;
+    }
+
+    public HoodieMetricsZhiyanConfig.Builder withApiTimeout(int apiTimeout) {
+      hoodieMetricsZhiyanConfig.setValue(API_TIMEOUT_IN_SECONDS, String.valueOf(apiTimeout));
+      return this;
+    }
+
+    public HoodieMetricsZhiyanConfig.Builder withJobName(String jobName) {
+      hoodieMetricsZhiyanConfig.setValue(ZHIYAN_JOB_NAME, jobName);
+      return this;
+    }
+
+    public HoodieMetricsZhiyanConfig.Builder withReportPeriodSeconds(int seconds) {
+      hoodieMetricsZhiyanConfig.setValue(REPORT_PERIOD_SECONDS, String.valueOf(seconds));
+      return this;
+    }
+
+    public HoodieMetricsZhiyanConfig build() {
+      hoodieMetricsZhiyanConfig.setDefaults(HoodieMetricsZhiyanConfig.class.getName());
+      return hoodieMetricsZhiyanConfig;
+    }
+  }
+
+}
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java
index 69ef7917b2..450f741586 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.metrics;
 
+import org.apache.hudi.async.AsyncPostEventService;
 import org.apache.hudi.common.model.HoodieCommitMetadata;
 import org.apache.hudi.common.table.timeline.HoodieTimeline;
 import org.apache.hudi.common.util.Option;
@@ -26,9 +27,13 @@ import org.apache.hudi.config.HoodieWriteConfig;
 
 import com.codahale.metrics.Counter;
 import com.codahale.metrics.Timer;
+import org.apache.hudi.tdbank.TdbankHoodieMetricsEvent;
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 
+import java.util.Locale;
+import java.util.concurrent.LinkedBlockingQueue;
+
 /**
  * Wrapper for metrics-related operations.
  */
@@ -49,6 +54,8 @@ public class HoodieMetrics {
   private String conflictResolutionFailureCounterName = null;
   private HoodieWriteConfig config;
   private String tableName;
+  // Add a job id to identify job for each hoodie table.
+  private String hoodieJobId;
   private Timer rollbackTimer = null;
   private Timer cleanTimer = null;
   private Timer commitTimer = null;
@@ -61,10 +68,31 @@ public class HoodieMetrics {
   private Counter conflictResolutionSuccessCounter = null;
   private Counter conflictResolutionFailureCounter = null;
 
+  public static final String TOTAL_PARTITIONS_WRITTEN = "totalPartitionsWritten";
+  public static final String TOTAL_FILES_INSERT = "totalFilesInsert";
+  public static final String TOTAL_FILES_UPDATE = "totalFilesUpdate";
+  public static final String TOTAL_RECORDS_WRITTEN = "totalRecordsWritten";
+  public static final String TOTAL_UPDATE_RECORDS_WRITTEN = "totalUpdateRecordsWritten";
+  public static final String TOTAL_INSERT_RECORDS_WRITTEN = "totalInsertRecordsWritten";
+  public static final String TOTAL_BYTES_WRITTEN = "totalBytesWritten";
+  public static final String TOTAL_SCAN_TIME = "totalScanTime";
+  public static final String TOTAL_CREATE_TIME = "totalCreateTime";
+  public static final String TOTAL_UPSERT_TIME = "totalUpsertTime";
+  public static final String TOTAL_COMPACTED_RECORDS_UPDATED = "totalCompactedRecordsUpdated";
+  public static final String TOTAL_LOGFILES_COMPACTED = "totalLogFilesCompacted";
+  public static final String TOTAL_LOGFILES_SIZE = "totalLogFilesSize";
+
+  // a queue for buffer metrics event.
+  private final LinkedBlockingQueue<TdbankHoodieMetricsEvent> queue = new LinkedBlockingQueue<>();
+
   public HoodieMetrics(HoodieWriteConfig config) {
     this.config = config;
     this.tableName = config.getTableName();
+    this.hoodieJobId = config.getHoodieJobId();
     if (config.isMetricsOn()) {
+      // start post event service.
+      AsyncPostEventService postEventService = new AsyncPostEventService(config, queue);
+      postEventService.start(null);
       Metrics.init(config);
       this.rollbackTimerName = getMetricsName("timer", HoodieTimeline.ROLLBACK_ACTION);
       this.cleanTimerName = getMetricsName("timer", HoodieTimeline.CLEAN_ACTION);
@@ -165,6 +193,25 @@ public class HoodieMetrics {
     Metrics.registerGauge(getMetricsName(actionType, "totalCompactedRecordsUpdated"), 0);
     Metrics.registerGauge(getMetricsName(actionType, "totalLogFilesCompacted"), 0);
     Metrics.registerGauge(getMetricsName(actionType, "totalLogFilesSize"), 0);
+
+    TdbankHoodieMetricsEvent metricEvent = TdbankHoodieMetricsEvent.newBuilder()
+        .withDBName(config.getDatabaseName())
+        .withTableName(config.getTableName())
+        .withTableType(TdbankHoodieMetricsEvent.EventType.valueOf(actionType.toUpperCase(Locale.ROOT)))
+        .addMetrics("totalPartitionsWritten", 0)
+        .addMetrics("totalFilesUpdate", 0)
+        .addMetrics("totalRecordsWritten", 0)
+        .addMetrics("totalUpdateRecordsWritten", 0)
+        .addMetrics("totalInsertRecordsWritten", 0)
+        .addMetrics("totalBytesWritten", 0)
+        .addMetrics("totalScanTime", 0)
+        .addMetrics("totalCreateTime", 0)
+        .addMetrics("totalUpsertTime", 0)
+        .addMetrics("totalCompactedRecordsUpdated", 0)
+        .addMetrics("totalLogFilesCompacted", 0)
+        .addMetrics("totalLogFilesSize", 0)
+        .build();
+    postEvent(metricEvent);
   }
 
   public void updateCommitMetrics(long commitEpochTimeInMs, long durationInMs, HoodieCommitMetadata metadata,
@@ -197,23 +244,55 @@ public class HoodieMetrics {
       Metrics.registerGauge(getMetricsName(actionType, "totalCompactedRecordsUpdated"), totalCompactedRecordsUpdated);
       Metrics.registerGauge(getMetricsName(actionType, "totalLogFilesCompacted"), totalLogFilesCompacted);
       Metrics.registerGauge(getMetricsName(actionType, "totalLogFilesSize"), totalLogFilesSize);
+
+      TdbankHoodieMetricsEvent metricEvent = TdbankHoodieMetricsEvent.newBuilder()
+          .withDBName(config.getDatabaseName())
+          .withTableName(config.getTableName())
+          .withTableType(TdbankHoodieMetricsEvent.EventType.valueOf(actionType.toUpperCase(Locale.ROOT)))
+          .addMetrics("totalPartitionsWritten", totalPartitionsWritten)
+          .addMetrics("totalFilesUpdate", totalFilesUpdate)
+          .addMetrics("totalFilesInsert", totalFilesInsert)
+          .addMetrics("totalRecordsWritten", totalRecordsWritten)
+          .addMetrics("totalUpdateRecordsWritten", totalUpdateRecordsWritten)
+          .addMetrics("totalInsertRecordsWritten", totalInsertRecordsWritten)
+          .addMetrics("totalBytesWritten", totalBytesWritten)
+          .addMetrics("totalScanTime", totalTimeTakenByScanner)
+          .addMetrics("totalCreateTime", totalTimeTakenForInsert)
+          .addMetrics("totalUpsertTime", totalTimeTakenForUpsert)
+          .addMetrics("totalCompactedRecordsUpdated", totalCompactedRecordsUpdated)
+          .addMetrics("totalLogFilesCompacted", totalLogFilesCompacted)
+          .addMetrics("totalLogFilesSize", totalLogFilesSize)
+          .build();
+      postEvent(metricEvent);
     }
   }
 
   private void updateCommitTimingMetrics(long commitEpochTimeInMs, long durationInMs, HoodieCommitMetadata metadata,
       String actionType) {
     if (config.isMetricsOn()) {
+      TdbankHoodieMetricsEvent.Builder builder = TdbankHoodieMetricsEvent.newBuilder()
+          .withDBName(config.getDatabaseName())
+          .withTableName(config.getTableName())
+          .withTableType(TdbankHoodieMetricsEvent.EventType.valueOf(actionType.toUpperCase(Locale.ROOT)));
       Pair<Option<Long>, Option<Long>> eventTimePairMinMax = metadata.getMinAndMaxEventTime();
       if (eventTimePairMinMax.getLeft().isPresent()) {
         long commitLatencyInMs = commitEpochTimeInMs + durationInMs - eventTimePairMinMax.getLeft().get();
         Metrics.registerGauge(getMetricsName(actionType, "commitLatencyInMs"), commitLatencyInMs);
+        builder = builder.addMetrics("commitLatencyInMs", commitLatencyInMs);
       }
       if (eventTimePairMinMax.getRight().isPresent()) {
         long commitFreshnessInMs = commitEpochTimeInMs + durationInMs - eventTimePairMinMax.getRight().get();
         Metrics.registerGauge(getMetricsName(actionType, "commitFreshnessInMs"), commitFreshnessInMs);
+        builder = builder.addMetrics("commitFreshnessInMs", commitFreshnessInMs);
       }
       Metrics.registerGauge(getMetricsName(actionType, "commitTime"), commitEpochTimeInMs);
       Metrics.registerGauge(getMetricsName(actionType, "duration"), durationInMs);
+
+      TdbankHoodieMetricsEvent event = builder
+          .addMetrics("commitTime", commitEpochTimeInMs)
+          .addMetrics("duration", durationInMs)
+          .build();
+      postEvent(event);
     }
   }
 
@@ -223,6 +302,14 @@ public class HoodieMetrics {
           String.format("Sending rollback metrics (duration=%d, numFilesDeleted=%d)", durationInMs, numFilesDeleted));
       Metrics.registerGauge(getMetricsName("rollback", "duration"), durationInMs);
       Metrics.registerGauge(getMetricsName("rollback", "numFilesDeleted"), numFilesDeleted);
+      TdbankHoodieMetricsEvent event = TdbankHoodieMetricsEvent.newBuilder()
+          .withDBName(config.getDatabaseName())
+          .withTableName(config.getTableName())
+          .withTableType(TdbankHoodieMetricsEvent.EventType.valueOf("rollback".toUpperCase(Locale.ROOT)))
+          .addMetrics("duration", durationInMs)
+          .addMetrics("numFilesDeleted", numFilesDeleted)
+          .build();
+      postEvent(event);
     }
   }
 
@@ -232,6 +319,14 @@ public class HoodieMetrics {
           String.format("Sending clean metrics (duration=%d, numFilesDeleted=%d)", durationInMs, numFilesDeleted));
       Metrics.registerGauge(getMetricsName("clean", "duration"), durationInMs);
       Metrics.registerGauge(getMetricsName("clean", "numFilesDeleted"), numFilesDeleted);
+      TdbankHoodieMetricsEvent event = TdbankHoodieMetricsEvent.newBuilder()
+          .withDBName(config.getDatabaseName())
+          .withTableName(config.getTableName())
+          .withTableType(TdbankHoodieMetricsEvent.EventType.valueOf("clean".toUpperCase(Locale.ROOT)))
+          .addMetrics("duration", durationInMs)
+          .addMetrics("numFilesDeleted", numFilesDeleted)
+          .build();
+      postEvent(event);
     }
   }
 
@@ -241,6 +336,14 @@ public class HoodieMetrics {
           numFilesFinalized));
       Metrics.registerGauge(getMetricsName("finalize", "duration"), durationInMs);
       Metrics.registerGauge(getMetricsName("finalize", "numFilesFinalized"), numFilesFinalized);
+      TdbankHoodieMetricsEvent event = TdbankHoodieMetricsEvent.newBuilder()
+          .withDBName(config.getDatabaseName())
+          .withTableName(config.getTableName())
+          .withTableType(TdbankHoodieMetricsEvent.EventType.valueOf("finalize".toUpperCase(Locale.ROOT)))
+          .addMetrics("duration", durationInMs)
+          .addMetrics("numFilesFinalized", numFilesFinalized)
+          .build();
+      postEvent(event);
     }
   }
 
@@ -248,11 +351,21 @@ public class HoodieMetrics {
     if (config.isMetricsOn()) {
       LOG.info(String.format("Sending index metrics (%s.duration, %d)", action, durationInMs));
       Metrics.registerGauge(getMetricsName("index", String.format("%s.duration", action)), durationInMs);
+      TdbankHoodieMetricsEvent event = TdbankHoodieMetricsEvent.newBuilder()
+          .withDBName(config.getDatabaseName())
+          .withTableName(config.getTableName())
+          .withTableType(TdbankHoodieMetricsEvent.EventType.valueOf("index".toUpperCase(Locale.ROOT)))
+          .addMetrics(String.format("%s.duration", action), durationInMs)
+          .build();
+      postEvent(event);
     }
   }
 
   String getMetricsName(String action, String metric) {
-    return config == null ? null : String.format("%s.%s.%s", config.getMetricReporterMetricsNamePrefix(), action, metric);
+    // if using zhiyan, then we don't report metrics prefix because we will use tags to identify each metrics
+    return config == null ? null :
+      config.getMetricsReporterType() == MetricsReporterType.ZHIYAN ? String.format("%s.%s", action, metric) :
+        String.format("%s.%s.%s", config.getMetricReporterMetricsNamePrefix(), action, metric);
   }
 
   /**
@@ -284,4 +397,9 @@ public class HoodieMetrics {
     }
     return counter;
   }
+
+  private void postEvent(TdbankHoodieMetricsEvent event) {
+    LOG.info("Post metrics event to queue, queue size now is " + queue.size());
+    queue.add(event);
+  }
 }
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java
index 8f3e497481..10238a9c92 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java
@@ -120,6 +120,7 @@ public class Metrics {
 
   public static void registerGauge(String metricName, final long value) {
     try {
+      LOG.info("Register Metric Name: " + metricName);
       MetricRegistry registry = Metrics.getInstance().getRegistry();
       HoodieGauge guage = (HoodieGauge) registry.gauge(metricName, () -> new HoodieGauge<>(value));
       guage.setValue(value);
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsReporterFactory.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsReporterFactory.java
index d81e337b28..b67ab63f23 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsReporterFactory.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsReporterFactory.java
@@ -29,6 +29,7 @@ import org.apache.hudi.metrics.prometheus.PrometheusReporter;
 import org.apache.hudi.metrics.prometheus.PushGatewayMetricsReporter;
 
 import com.codahale.metrics.MetricRegistry;
+import org.apache.hudi.metrics.zhiyan.ZhiyanMetricsReporter;
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 
@@ -81,6 +82,9 @@ public class MetricsReporterFactory {
       case CLOUDWATCH:
         reporter = new CloudWatchMetricsReporter(config, registry);
         break;
+      case ZHIYAN:
+        reporter = new ZhiyanMetricsReporter(config, registry);
+        break;
       default:
         LOG.error("Reporter type[" + type + "] is not supported.");
         break;
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsReporterType.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsReporterType.java
index 3c86001592..29a8097a50 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsReporterType.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsReporterType.java
@@ -22,5 +22,5 @@ package org.apache.hudi.metrics;
  * Types of the reporter supported, hudi also supports user defined reporter.
  */
 public enum MetricsReporterType {
-  GRAPHITE, INMEMORY, JMX, DATADOG, CONSOLE, PROMETHEUS_PUSHGATEWAY, PROMETHEUS, CLOUDWATCH
+  GRAPHITE, INMEMORY, JMX, DATADOG, CONSOLE, PROMETHEUS_PUSHGATEWAY, PROMETHEUS, CLOUDWATCH, ZHIYAN
 }
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanHttpClient.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanHttpClient.java
new file mode 100644
index 0000000000..b358ce182b
--- /dev/null
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanHttpClient.java
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics.zhiyan;
+
+import com.fasterxml.jackson.annotation.JsonAutoDetect;
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.PropertyAccessor;
+import com.fasterxml.jackson.core.JsonParser;
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.DeserializationFeature;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.http.Consts;
+import org.apache.http.HttpEntity;
+import org.apache.http.HttpException;
+import org.apache.http.HttpStatus;
+import org.apache.http.client.config.RequestConfig;
+import org.apache.http.client.methods.CloseableHttpResponse;
+import org.apache.http.client.methods.HttpEntityEnclosingRequestBase;
+import org.apache.http.client.methods.HttpPost;
+import org.apache.http.client.methods.HttpRequestBase;
+import org.apache.http.entity.ContentType;
+import org.apache.http.entity.StringEntity;
+import org.apache.http.impl.client.CloseableHttpClient;
+import org.apache.http.impl.client.HttpClientBuilder;
+import org.apache.http.protocol.BasicHttpContext;
+import org.apache.http.protocol.HttpContext;
+import org.apache.http.util.EntityUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+public class ZhiyanHttpClient {
+
+  private static final Logger LOG = LogManager.getLogger(ZhiyanHttpClient.class);
+  private final CloseableHttpClient httpClient;
+  private final ObjectMapper mapper;
+  private final String serviceUrl;
+  private final String requestPath;
+
+  private static final String JSON_CONTENT_TYPE = "application/json";
+  private static final String CONTENT_TYPE = "Content-Type";
+
+  public ZhiyanHttpClient(String url, String path, int timeoutSeconds) {
+    httpClient = HttpClientBuilder.create()
+      .setDefaultRequestConfig(RequestConfig.custom()
+        .setConnectTimeout(timeoutSeconds * 1000)
+        .setConnectionRequestTimeout(timeoutSeconds * 1000)
+        .setSocketTimeout(timeoutSeconds * 1000).build())
+      .build();
+
+    serviceUrl = url;
+    requestPath = path;
+
+    mapper = new ObjectMapper();
+    mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
+    mapper.configure(JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES, true);
+    mapper.configure(JsonParser.Feature.IGNORE_UNDEFINED, true);
+    mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL);
+    mapper.setVisibility(PropertyAccessor.FIELD, JsonAutoDetect.Visibility.ANY);
+  }
+
+  public <T> String post(T input) throws Exception {
+    HttpPost postReq = new HttpPost(serviceUrl + requestPath);
+    postReq.setHeader(CONTENT_TYPE, JSON_CONTENT_TYPE);
+
+    try {
+      return requestWithEntity(postReq, input);
+    } catch (Exception e) {
+      LOG.warn(String.format("Failed to post to %s, cause by", serviceUrl + requestPath), e);
+      throw e;
+    } finally {
+      postReq.releaseConnection();
+    }
+  }
+
+  private <T> String requestWithEntity(HttpRequestBase request, T input) throws Exception {
+    if (input != null && request instanceof HttpEntityEnclosingRequestBase) {
+      HttpEntity entity = getEntity(input);
+      ((HttpEntityEnclosingRequestBase) request).setEntity(entity);
+    }
+
+    HttpContext httpContext = new BasicHttpContext();
+    try (CloseableHttpResponse response = httpClient.execute(request, httpContext)) {
+      int status = response.getStatusLine().getStatusCode();
+      if (status != HttpStatus.SC_OK && status != HttpStatus.SC_CREATED) {
+        throw new HttpException("Response code is " + status);
+      }
+      HttpEntity resultEntity = response.getEntity();
+      return EntityUtils.toString(resultEntity, Consts.UTF_8);
+    } catch (Exception ex) {
+      LOG.error("Error when request http.", ex);
+      throw ex;
+    }
+  }
+
+  private <T> HttpEntity getEntity(T input) throws JsonProcessingException {
+    HttpEntity entity;
+    if (input instanceof String) {
+      entity = new StringEntity((String) input, ContentType.APPLICATION_JSON);
+    } else if (input instanceof HttpEntity) {
+      return (HttpEntity) input;
+    } else {
+      try {
+        String json = mapper.writeValueAsString(input);
+        entity = new StringEntity(json, ContentType.APPLICATION_JSON);
+      } catch (JsonProcessingException e) {
+        LOG.error(String.format("Error when process %s due to ", input), e);
+        throw e;
+      }
+    }
+    return entity;
+  }
+
+}
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanMetricsReporter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanMetricsReporter.java
new file mode 100644
index 0000000000..323fe17106
--- /dev/null
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanMetricsReporter.java
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics.zhiyan;
+
+import com.codahale.metrics.MetricFilter;
+import com.codahale.metrics.MetricRegistry;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.metrics.MetricsReporter;
+
+import java.io.Closeable;
+import java.util.concurrent.TimeUnit;
+
+public class ZhiyanMetricsReporter extends MetricsReporter {
+
+  private final ZhiyanReporter reporter;
+  private final int reportPeriodSeconds;
+
+  public ZhiyanMetricsReporter(HoodieWriteConfig config, MetricRegistry registry) {
+    this.reportPeriodSeconds = config.getZhiyanReportPeriodSeconds();
+    ZhiyanHttpClient client = new ZhiyanHttpClient(
+        config.getZhiyanReportServiceURL(),
+        config.getZhiyanReportServicePath(),
+        config.getZhiyanApiTimeoutSeconds());
+    this.reporter = new ZhiyanReporter(registry, MetricFilter.ALL, client,
+        config.getZhiyanHoodieJobName(),
+        config.getTableName(),
+        config.getZhiyanAppMask(),
+        config.getZhiyanSeclvlEnvName());
+  }
+
+  @Override
+  public void start() {
+    reporter.start(reportPeriodSeconds, TimeUnit.SECONDS);
+  }
+
+  @Override
+  public void report() {
+    reporter.report();
+  }
+
+  @Override
+  public Closeable getReporter() {
+    return reporter;
+  }
+
+  @Override
+  public void stop() {
+    reporter.stop();
+  }
+}
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanReporter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanReporter.java
new file mode 100644
index 0000000000..4e5d416989
--- /dev/null
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanReporter.java
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics.zhiyan;
+
+import com.codahale.metrics.Counter;
+import com.codahale.metrics.Gauge;
+import com.codahale.metrics.Histogram;
+import com.codahale.metrics.Meter;
+import com.codahale.metrics.MetricFilter;
+import com.codahale.metrics.MetricRegistry;
+import com.codahale.metrics.ScheduledReporter;
+import com.codahale.metrics.Timer;
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.core.JsonParser;
+import com.fasterxml.jackson.databind.DeserializationFeature;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.fasterxml.jackson.databind.node.ArrayNode;
+import com.fasterxml.jackson.databind.node.ObjectNode;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import java.util.SortedMap;
+import java.util.concurrent.TimeUnit;
+
+public class ZhiyanReporter extends ScheduledReporter {
+
+  private static final Logger LOG = LoggerFactory.getLogger(ZhiyanReporter.class);
+  private final ZhiyanHttpClient client;
+  private final String jobName;
+  private final String hoodieTableName;
+  private final String appMask;
+  private final String seclvlEnName;
+
+  public ZhiyanReporter(MetricRegistry registry,
+                        MetricFilter filter,
+                        ZhiyanHttpClient client,
+                        String jobName,
+                        String hoodieTableName,
+                        String appMask,
+                        String seclvlEnName) {
+    super(registry, "hudi-zhiyan-reporter", filter, TimeUnit.SECONDS, TimeUnit.SECONDS);
+    this.client = client;
+    this.jobName = jobName;
+    this.hoodieTableName = hoodieTableName;
+    this.appMask = appMask;
+    this.seclvlEnName = seclvlEnName;
+  }
+
+  @Override
+  public void report(SortedMap<String, Gauge> gauges,
+                     SortedMap<String, Counter> counters,
+                     SortedMap<String, Histogram> histograms,
+                     SortedMap<String, Meter> meters,
+                     SortedMap<String, Timer> timers) {
+    final PayloadBuilder builder = new PayloadBuilder()
+        .withAppMask(appMask)
+        .withJobName(jobName)
+        .withSeclvlEnName(seclvlEnName)
+        .withTableName(hoodieTableName);
+
+    long timestamp = System.currentTimeMillis();
+
+    gauges.forEach((metricName, gauge) -> {
+      builder.addGauge(metricName, timestamp, gauge.getValue().toString());
+    });
+
+    String payload = builder.build();
+
+    LOG.info("Payload is:" + payload);
+    try {
+      client.post(payload);
+    } catch (Exception e) {
+      LOG.error("Payload is " + payload);
+      LOG.error("Error when report data to zhiyan", e);
+    }
+  }
+
+  static class PayloadBuilder {
+
+    private static final ObjectMapper MAPPER = new ObjectMapper();
+
+    private final ObjectNode payload;
+
+    private final ArrayNode reportData;
+
+    private String appMark;
+    // 指标组
+    private String seclvlEnName;
+
+    private String jobName;
+
+    private String tableName;
+
+    public PayloadBuilder() {
+      MAPPER.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
+      MAPPER.configure(JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES, true);
+      MAPPER.configure(JsonParser.Feature.IGNORE_UNDEFINED, true);
+      MAPPER.setSerializationInclusion(JsonInclude.Include.NON_NULL);
+      this.payload = MAPPER.createObjectNode();
+      this.reportData = MAPPER.createArrayNode();
+    }
+
+    PayloadBuilder withAppMask(String appMark) {
+      this.appMark = appMark;
+      this.payload.put("app_mark", appMark);
+      return this;
+    }
+
+    PayloadBuilder withJobName(String jobName) {
+      this.jobName = jobName;
+      return this;
+    }
+
+    PayloadBuilder withTableName(String tableName) {
+      this.tableName = tableName;
+      return this;
+    }
+
+    PayloadBuilder withSeclvlEnName(String seclvlEnName) {
+      this.seclvlEnName = seclvlEnName;
+      this.payload.put("sec_lvl_en_name", seclvlEnName);
+      return this;
+    }
+
+    PayloadBuilder addGauge(String metric, long timestamp, String gaugeValue) {
+      ObjectNode tmpData = MAPPER.createObjectNode();
+      tmpData.put("metric", metric);
+      tmpData.put("value", Long.parseLong(gaugeValue));
+      // tags means dimension in zhiyan.
+      ObjectNode tags = tmpData.objectNode();
+      tags.put("jobName", jobName);
+      tags.put("tableName", tableName);
+      tmpData.set("tags", tags);
+      this.reportData.add(tmpData);
+      return this;
+    }
+
+    PayloadBuilder addHistogram() {
+      return this;
+    }
+
+    PayloadBuilder addCounter() {
+      return this;
+    }
+
+    PayloadBuilder addMeters() {
+      return this;
+    }
+
+    String build() {
+      payload.put("report_data", reportData.toString());
+      return payload.toString();
+    }
+  }
+}
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TDBankClient.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TDBankClient.java
new file mode 100644
index 0000000000..85f7a9b0b9
--- /dev/null
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TDBankClient.java
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.tdbank;
+
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.tencent.tdbank.busapi.BusClientConfig;
+import com.tencent.tdbank.busapi.DefaultMessageSender;
+import com.tencent.tdbank.busapi.MessageSender;
+import com.tencent.tdbank.busapi.SendResult;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.net.InetAddress;
+import java.util.UUID;
+import java.util.concurrent.TimeUnit;
+
+public class TDBankClient implements Closeable {
+  private static final Logger LOG = LoggerFactory.getLogger(TDBankClient.class);
+  private static final Long TDBANK_SENDER_TIMEOUT_MS =
+      Long.parseLong(System.getProperty("tdbank.sender.timeout-ms", "20000"));
+  private static final ObjectMapper MAPPER = new ObjectMapper();
+  private static final String HUDI_EVENT_TID = "hudi_metric";
+
+  private final String bid;
+  private MessageSender sender;
+  private String tdmAddr;
+  private int tdmPort;
+  private volatile boolean hasInit = false;
+
+  private static final int RETRY_TIMES = 3;
+
+  public TDBankClient(String tdmAddr, int tdmPort, String bid) {
+    this.bid = bid;
+    this.tdmAddr = tdmAddr;
+    this.tdmPort = tdmPort;
+  }
+
+  /**
+   * send message to tdbank and return send result
+   */
+  public SendResult sendMessage(Object message) throws Exception {
+    init();
+    LOG.info("Send message to tdbank, bid: {}, tid: {}", bid, HUDI_EVENT_TID);
+    int retryTimes = 0;
+    while (retryTimes < RETRY_TIMES) {
+      try {
+        return sender.sendMessage(MAPPER.writeValueAsBytes(message),
+          bid, HUDI_EVENT_TID, 0, UUID.randomUUID().toString(), TDBANK_SENDER_TIMEOUT_MS, TimeUnit.MILLISECONDS);
+      } catch (Exception e) {
+        retryTimes++;
+        LOG.error("Error when send data to tdbank retry " + retryTimes, e);
+      }
+    }
+    return SendResult.UNKOWN_ERROR;
+  }
+
+  @Override
+  public void close() throws IOException {
+    sender.close();
+  }
+
+  private void init() throws Exception {
+    if (!hasInit) {
+      synchronized (this) {
+        if (!hasInit) {
+          try {
+            LOG.info("Init tdbank-client with tdmAddress: {}, tdmPort: {}, bid: {}", tdmAddr, tdmPort, bid);
+            String localhost = InetAddress.getLocalHost().getHostAddress();
+            BusClientConfig clientConfig =
+                new BusClientConfig(localhost, true, tdmAddr, tdmPort, bid, "all");
+            LOG.info("Before sender generated.");
+            sender = new DefaultMessageSender(clientConfig);
+            LOG.info("Successfully init sender.");
+          } catch (Exception e) {
+            LOG.warn("Failed to initialize tdbank client, using mock client instead. "
+                + "Warn: using mock client will ignore all the incoming events", e);
+            throw e;
+          }
+          hasInit = true;
+        }
+      }
+    }
+  }
+
+}
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TdbankConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TdbankConfig.java
new file mode 100644
index 0000000000..60a5e06a45
--- /dev/null
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TdbankConfig.java
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.tdbank;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+import java.util.Properties;
+
+@Immutable
+@ConfigClassProperty(name = "Tdbank Configs",
+    groupName = ConfigGroups.Names.WRITE_CLIENT,
+    description = "Tdbank configs")
+public class TdbankConfig extends HoodieConfig {
+  public static final ConfigProperty<String> TDBANK_TDM_ADDR = ConfigProperty
+      .key("hoodie.tdbank.tdm.addr")
+      .defaultValue("tl-tdbank-tdmanager.tencent-distribute.com")
+      .withDocumentation("tdbank manager address.");
+
+  public static final ConfigProperty<Integer> TDBANK_TDM_PORT = ConfigProperty
+      .key("hoodie.tdbank.tdm.port")
+      .defaultValue(8099)
+      .withDocumentation("tdbank manager port.");
+
+  public static final ConfigProperty<String> TDBANK_BID = ConfigProperty
+      .key("hoodie.tdbank.tdbank.bid")
+      .defaultValue("b_teg_iceberg_event_tdbank_mq")
+      .withDocumentation("tdbank bid, use iceberg's bid temporarily.");
+
+  public static Builder newBuilder() {
+    return new Builder();
+  }
+
+  public static class Builder {
+    private final TdbankConfig hoodieTdbankConfig = new TdbankConfig();
+
+    public Builder withTDMAddr(String tdmAddr) {
+      hoodieTdbankConfig.setValue(TDBANK_TDM_ADDR, tdmAddr);
+      return this;
+    }
+
+    public Builder fromProperties(Properties props) {
+      hoodieTdbankConfig.setAll(props);
+      return this;
+    }
+
+    public Builder withTDMPort(int tdmPort) {
+      hoodieTdbankConfig.setValue(TDBANK_TDM_PORT, String.valueOf(tdmPort));
+      return this;
+    }
+
+    public Builder withBID(String bid) {
+      hoodieTdbankConfig.setValue(TDBANK_BID, bid);
+      return this;
+    }
+
+    public TdbankConfig build() {
+      hoodieTdbankConfig.setDefaults(TdbankConfig.class.getName());
+      return hoodieTdbankConfig;
+    }
+  }
+
+}
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TdbankHoodieMetricsEvent.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TdbankHoodieMetricsEvent.java
new file mode 100644
index 0000000000..0be78386a1
--- /dev/null
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/tdbank/TdbankHoodieMetricsEvent.java
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.tdbank;
+
+import java.io.Serializable;
+import java.util.Map;
+import java.util.TreeMap;
+
+public class TdbankHoodieMetricsEvent implements Serializable {
+  private String dbName;
+  private String tableName;
+  private EventType type;
+  private Map<String, Object> metrics;
+
+  private TdbankHoodieMetricsEvent() {
+    this.metrics = new TreeMap<>();
+  }
+
+  public enum EventType {
+    INDEX, CLEAN, FINALIZE, ROLLBACK, COMPACTION, COMMIT, DELTACOMMIT, REPLACECOMMIT
+  }
+
+  public static TdbankHoodieMetricsEvent.Builder newBuilder() {
+    return new Builder();
+  }
+
+  public static class Builder {
+
+    private final TdbankHoodieMetricsEvent hoodieMetricsEvent = new TdbankHoodieMetricsEvent();
+
+    public Builder() {
+    }
+
+    public Builder withDBName(String dbName) {
+      hoodieMetricsEvent.setDbName(dbName);
+      return this;
+    }
+
+    public Builder withTableName(String tableName) {
+      hoodieMetricsEvent.setTableName(tableName);
+      return this;
+    }
+
+    public Builder withTableType(EventType type) {
+      hoodieMetricsEvent.setType(type);
+      return this;
+    }
+
+    public Builder addMetrics(String key, Object value) {
+      hoodieMetricsEvent.addMetrics(key, value);
+      return this;
+    }
+
+    public TdbankHoodieMetricsEvent build() {
+      return hoodieMetricsEvent;
+    }
+  }
+
+  public void setTableName(String tableName) {
+    this.tableName = tableName;
+  }
+
+  public void setType(EventType type) {
+    this.type = type;
+  }
+
+  public void addMetrics(String key, Object value) {
+    this.metrics.put(key, value);
+  }
+
+  public String getTableName() {
+    return tableName;
+  }
+
+  public EventType getType() {
+    return type;
+  }
+
+  public Map<String, Object> getMetrics() {
+    return metrics;
+  }
+
+  public Object getMetric(String key) {
+    return metrics.get(key);
+  }
+
+  public String getDbName() {
+    return dbName;
+  }
+
+  public void setDbName(String dbName) {
+    this.dbName = dbName;
+  }
+}
diff --git a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java
index 53a5799508..34198d456c 100644
--- a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java
+++ b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java
@@ -122,6 +122,11 @@ public class HoodieFlinkWriteClient<T extends HoodieRecordPayload> extends
   @Override
   public boolean commit(String instantTime, List<WriteStatus> writeStatuses, Option<Map<String, String>> extraMetadata, String commitActionType, Map<String, List<String>> partitionToReplacedFileIds) {
     List<HoodieWriteStat> writeStats = writeStatuses.parallelStream().map(WriteStatus::getStat).collect(Collectors.toList());
+    if (commitActionType.equals(HoodieTimeline.COMMIT_ACTION)) {
+      writeTimer = metrics.getCommitCtx();
+    } else {
+      writeTimer = metrics.getDeltaCommitCtx();
+    }
     return commitStats(instantTime, writeStats, extraMetadata, commitActionType, partitionToReplacedFileIds);
   }
 
@@ -436,6 +441,7 @@ public class HoodieFlinkWriteClient<T extends HoodieRecordPayload> extends
     // only used for metadata table, the compaction happens in single thread
     HoodieWriteMetadata<List<WriteStatus>> compactionMetadata = getHoodieTable().compact(context, compactionInstantTime);
     commitCompaction(compactionInstantTime, compactionMetadata.getCommitMetadata().get(), Option.empty());
+    compactionTimer = metrics.getCompactionCtx();
     return compactionMetadata;
   }
 
diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java b/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
index 89d01b53a6..85f970e7ec 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
@@ -90,8 +90,10 @@ public class HoodieTableConfig extends HoodieConfig {
   public static final ConfigProperty<String> DATABASE_NAME = ConfigProperty
       .key("hoodie.database.name")
       .noDefaultValue()
-      .withDocumentation("Database name that will be used for incremental query.If different databases have the same table name during incremental query, "
-          + "we can set it to limit the table name under a specific database");
+      .withDocumentation("Database name to identify a table, currently will be used for "
+          + "1. incremental query.If different databases have the same table name during incremental query "
+          + "we can set it to limit the table name under a specific database"
+          + "2. identify a table");
 
   public static final ConfigProperty<String> NAME = ConfigProperty
       .key(HOODIE_TABLE_NAME_KEY)
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
index a9e10d3e55..4a298839fb 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
@@ -82,6 +82,11 @@ public class FlinkOptions extends HoodieConfig {
   // ------------------------------------------------------------------------
   //  Common Options
   // ------------------------------------------------------------------------
+  public static final ConfigOption<String> DATABASE_NAME = ConfigOptions
+      .key(HoodieWriteConfig.DATABASE_NAME.key())
+      .stringType()
+      .noDefaultValue()
+      .withDescription("Database name to identify tables");
 
   public static final ConfigOption<String> TABLE_NAME = ConfigOptions
       .key(HoodieWriteConfig.TBL_NAME.key())
@@ -411,7 +416,7 @@ public class FlinkOptions extends HoodieConfig {
       .key("write.bucket_assign.tasks")
       .intType()
       .noDefaultValue()
-      .withDescription("Parallelism of tasks that do bucket assign, default same as the write task parallelism");
+      .withDescription("Parallelism of tasks that do bucket assign, default is the parallelism of the execution environment");
 
   public static final ConfigOption<Integer> WRITE_TASKS = ConfigOptions
       .key("write.tasks")
@@ -522,8 +527,8 @@ public class FlinkOptions extends HoodieConfig {
   public static final ConfigOption<Integer> COMPACTION_TASKS = ConfigOptions
       .key("compaction.tasks")
       .intType()
-      .noDefaultValue()
-      .withDescription("Parallelism of tasks that do actual compaction, default same as the write task parallelism");
+      .defaultValue(4) // default WRITE_TASKS * COMPACTION_DELTA_COMMITS * 0.2 (assumes 5 commits generate one bucket)
+      .withDescription("Parallelism of tasks that do actual compaction, default is 4");
 
   public static final String NUM_COMMITS = "num_commits";
   public static final String TIME_ELAPSED = "time_elapsed";
@@ -580,7 +585,7 @@ public class FlinkOptions extends HoodieConfig {
       .stringType()
       .defaultValue(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name())
       .withDescription("Clean policy to manage the Hudi table. Available option: KEEP_LATEST_COMMITS, KEEP_LATEST_FILE_VERSIONS, KEEP_LATEST_BY_HOURS."
-          + "Default is KEEP_LATEST_COMMITS.");
+          +  "Default is KEEP_LATEST_COMMITS.");
 
   public static final ConfigOption<Integer> CLEAN_RETAIN_COMMITS = ConfigOptions
       .key("clean.retain_commits")
@@ -589,14 +594,6 @@ public class FlinkOptions extends HoodieConfig {
       .withDescription("Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled).\n"
           + "This also directly translates into how much you can incrementally pull on this table, default 30");
 
-  public static final ConfigOption<Integer> CLEAN_RETAIN_HOURS = ConfigOptions
-      .key("clean.retain_hours")
-      .intType()
-      .defaultValue(24)// default 24 hours
-      .withDescription("Number of hours for which commits need to be retained. This config provides a more flexible option as"
-          + "compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group,"
-          + " corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.");
-
   public static final ConfigOption<Integer> CLEAN_RETAIN_FILE_VERSIONS = ConfigOptions
       .key("clean.retain_file_versions")
       .intType()
@@ -660,7 +657,7 @@ public class FlinkOptions extends HoodieConfig {
   public static final ConfigOption<String> CLUSTERING_PLAN_PARTITION_FILTER_MODE_NAME = ConfigOptions
       .key("clustering.plan.partition.filter.mode")
       .stringType()
-      .defaultValue(ClusteringPlanPartitionFilterMode.NONE.name())
+      .defaultValue("NONE")
       .withDescription("Partition filter mode used in the creation of clustering plan. Available values are - "
           + "NONE: do not filter table partition and thus the clustering plan will include all partitions that have clustering candidate."
           + "RECENT_DAYS: keep a continuous range of partitions, worked together with configs '" + DAYBASED_LOOKBACK_PARTITIONS.key() + "' and '"
@@ -668,16 +665,16 @@ public class FlinkOptions extends HoodieConfig {
           + "SELECTED_PARTITIONS: keep partitions that are in the specified range ['" + PARTITION_FILTER_BEGIN_PARTITION.key() + "', '"
           + PARTITION_FILTER_END_PARTITION.key() + "'].");
 
-  public static final ConfigOption<Long> CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES = ConfigOptions
+  public static final ConfigOption<Integer> CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES = ConfigOptions
       .key("clustering.plan.strategy.target.file.max.bytes")
-      .longType()
-      .defaultValue(1024 * 1024 * 1024L) // default 1 GB
+      .intType()
+      .defaultValue(1024 * 1024 * 1024) // default 1 GB
       .withDescription("Each group can produce 'N' (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file groups, default 1 GB");
 
-  public static final ConfigOption<Long> CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT = ConfigOptions
+  public static final ConfigOption<Integer> CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT = ConfigOptions
       .key("clustering.plan.strategy.small.file.limit")
-      .longType()
-      .defaultValue(600L) // default 600 MB
+      .intType()
+      .defaultValue(600) // default 600 MB
       .withDescription("Files smaller than the size specified here are candidates for clustering, default 600 MB");
 
   public static final ConfigOption<Integer> CLUSTERING_PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST = ConfigOptions
@@ -701,7 +698,6 @@ public class FlinkOptions extends HoodieConfig {
   // ------------------------------------------------------------------------
   //  Hive Sync Options
   // ------------------------------------------------------------------------
-
   public static final ConfigOption<Boolean> HIVE_SYNC_ENABLED = ConfigOptions
       .key("hive_sync.enable")
       .booleanType()
@@ -729,8 +725,8 @@ public class FlinkOptions extends HoodieConfig {
   public static final ConfigOption<String> HIVE_SYNC_MODE = ConfigOptions
       .key("hive_sync.mode")
       .stringType()
-      .defaultValue(HiveSyncMode.HMS.name())
-      .withDescription("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'hms'");
+      .defaultValue("jdbc")
+      .withDescription("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'jdbc'");
 
   public static final ConfigOption<String> HIVE_SYNC_USERNAME = ConfigOptions
       .key("hive_sync.username")
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java
index f022b04ea1..b2f72aed7d 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java
@@ -74,6 +74,9 @@ public class FlinkStreamerConfig extends Configuration {
       required = true)
   public String targetBasePath;
 
+  @Parameter(names = {"--target-db"}, description = "Name of target database")
+  public String targetDatabaseName;
+
   @Parameter(names = {"--target-table"}, description = "Name of the target table in Hive.", required = true)
   public String targetTableName;
 
@@ -351,6 +354,7 @@ public class FlinkStreamerConfig extends Configuration {
 
     conf.setString(FlinkOptions.PATH, config.targetBasePath);
     conf.setString(FlinkOptions.TABLE_NAME, config.targetTableName);
+    conf.setString(FlinkOptions.DATABASE_NAME, config.targetDatabaseName);
     // copy_on_write works same as COPY_ON_WRITE
     conf.setString(FlinkOptions.TABLE_TYPE, config.tableType.toUpperCase());
     conf.setBoolean(FlinkOptions.INSERT_CLUSTER, config.insertCluster);
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java
index b153b2273c..b08eb570ce 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java
@@ -107,6 +107,8 @@ public class HoodieFlinkStreamer {
       Pipelines.clean(conf, pipeline);
     }
 
-    env.execute(cfg.targetTableName);
+    String jobName = cfg.targetDatabaseName.isEmpty() ? cfg.targetTableName :
+        cfg.targetDatabaseName + "." + cfg.targetTableName;
+    env.execute(jobName);
   }
 }
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
index 1cf66ea343..1718175240 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
@@ -169,6 +169,8 @@ public class HoodieTableFactory implements DynamicTableSourceFactory, DynamicTab
       ObjectIdentifier tablePath,
       CatalogTable table,
       ResolvedSchema schema) {
+    // database name
+    conf.setString(FlinkOptions.DATABASE_NAME.key(), tablePath.getDatabaseName());
     // table name
     conf.setString(FlinkOptions.TABLE_NAME.key(), tablePath.getObjectName());
     // hoodie key about options
diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
index ee807f49da..0df3a0c8da 100644
--- a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
+++ b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
@@ -164,8 +164,13 @@ public class DataSourceUtils {
     });
   }
 
+  public static HoodieWriteConfig createHoodieConfig(String schemaStr, String basePath, String tblName,
+                                                     Map<String, String> parameters) {
+    return createHoodieConfig(schemaStr, basePath, "default_db", tblName, parameters);
+  }
+
   public static HoodieWriteConfig createHoodieConfig(String schemaStr, String basePath,
-      String tblName, Map<String, String> parameters) {
+      String dbName, String tblName, Map<String, String> parameters) {
     boolean asyncCompact = Boolean.parseBoolean(parameters.get(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE().key()));
     boolean inlineCompact = !asyncCompact && parameters.get(DataSourceWriteOptions.TABLE_TYPE().key())
         .equals(DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL());
@@ -178,6 +183,7 @@ public class DataSourceUtils {
     }
 
     return builder.forTable(tblName)
+        .withDatabaseName(dbName)
         .withCompactionConfig(HoodieCompactionConfig.newBuilder()
             .withInlineCompaction(inlineCompact).build())
         .withPayloadConfig(HoodiePayloadConfig.newBuilder()
@@ -189,8 +195,8 @@ public class DataSourceUtils {
   }
 
   public static SparkRDDWriteClient createHoodieClient(JavaSparkContext jssc, String schemaStr, String basePath,
-                                                       String tblName, Map<String, String> parameters) {
-    return new SparkRDDWriteClient<>(new HoodieSparkEngineContext(jssc), createHoodieConfig(schemaStr, basePath, tblName, parameters));
+                                                       String dbName, String tblName, Map<String, String> parameters) {
+    return new SparkRDDWriteClient<>(new HoodieSparkEngineContext(jssc), createHoodieConfig(schemaStr, basePath, dbName, tblName, parameters));
   }
 
   public static HoodieWriteResult doWriteOperation(SparkRDDWriteClient client, JavaRDD<HoodieRecord> hoodieRecords,
diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala
index 0d3edd592d..3b1caddb59 100644
--- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala
+++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala
@@ -47,7 +47,7 @@ object HoodieCLIUtils {
 
     val jsc = new JavaSparkContext(sparkSession.sparkContext)
     DataSourceUtils.createHoodieClient(jsc, schemaStr, basePath,
-      metaClient.getTableConfig.getTableName, finalParameters.asJava)
+      metaClient.getTableConfig.getDatabaseName, metaClient.getTableConfig.getTableName, finalParameters.asJava)
   }
 
   def extractPartitions(clusteringGroups: Seq[HoodieClusteringGroup]): String = {
diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index b9ff4c0d1a..61cb7ef961 100644
--- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -176,6 +176,7 @@ object HoodieSparkSqlWriter {
       // scalastyle:off
       if (hoodieConfig.getBoolean(ENABLE_ROW_WRITER) &&
         operation == WriteOperationType.BULK_INSERT) {
+        parameters.put(HoodieWriteConfig.DATABASE_NAME.key(), databaseName)
         val (success, commitTime: common.util.Option[String]) = bulkInsertAsRow(sqlContext, parameters, df, tblName,
           basePath, path, instantTime, partitionColumns)
         return (success, commitTime, common.util.Option.empty(), common.util.Option.empty(), hoodieWriteClient.orNull, tableConfig)
@@ -197,7 +198,7 @@ object HoodieSparkSqlWriter {
             // Create a HoodieWriteClient & issue the delete.
             val internalSchemaOpt = getLatestTableInternalSchema(fs, basePath, sparkContext)
             val client = hoodieWriteClient.getOrElse(DataSourceUtils.createHoodieClient(jsc,
-              null, path, tblName,
+              null, path, databaseName, tblName,
               mapAsJavaMap(addSchemaEvolutionParameters(parameters, internalSchemaOpt) - HoodieWriteConfig.AUTO_COMMIT_ENABLE.key)))
               .asInstanceOf[SparkRDDWriteClient[HoodieRecordPayload[Nothing]]]
 
@@ -228,7 +229,7 @@ object HoodieSparkSqlWriter {
             }
             // Create a HoodieWriteClient & issue the delete.
             val client = hoodieWriteClient.getOrElse(DataSourceUtils.createHoodieClient(jsc,
-              null, path, tblName,
+              null, path, databaseName, tblName,
               mapAsJavaMap(parameters - HoodieWriteConfig.AUTO_COMMIT_ENABLE.key)))
               .asInstanceOf[SparkRDDWriteClient[HoodieRecordPayload[Nothing]]]
             // Issue delete partitions
@@ -310,7 +311,7 @@ object HoodieSparkSqlWriter {
             // Create a HoodieWriteClient & issue the write.
 
             val client = hoodieWriteClient.getOrElse(DataSourceUtils.createHoodieClient(jsc, writerDataSchema.toString, path,
-              tblName, mapAsJavaMap(addSchemaEvolutionParameters(parameters, internalSchemaOpt) - HoodieWriteConfig.AUTO_COMMIT_ENABLE.key)
+              databaseName, tblName, mapAsJavaMap(addSchemaEvolutionParameters(parameters, internalSchemaOpt) - HoodieWriteConfig.AUTO_COMMIT_ENABLE.key)
             )).asInstanceOf[SparkRDDWriteClient[HoodieRecordPayload[Nothing]]]
 
             if (isAsyncCompactionEnabled(client, tableConfig, parameters, jsc.hadoopConfiguration())) {
@@ -441,6 +442,7 @@ object HoodieSparkSqlWriter {
 
     val (parameters, hoodieConfig) = mergeParamsAndGetHoodieConfig(optParams, tableConfig, mode)
     val tableName = hoodieConfig.getStringOrThrow(HoodieWriteConfig.TBL_NAME, s"'${HoodieWriteConfig.TBL_NAME.key}' must be set.")
+    val databaseName = hoodieConfig.getStringOrDefault(HoodieWriteConfig.DATABASE_NAME, "default")
     val tableType = hoodieConfig.getStringOrDefault(TABLE_TYPE)
     val bootstrapBasePath = hoodieConfig.getStringOrThrow(BASE_PATH,
       s"'${BASE_PATH.key}' is required for '${BOOTSTRAP_OPERATION_OPT_VAL}'" +
@@ -486,6 +488,7 @@ object HoodieSparkSqlWriter {
         HoodieTableMetaClient.withPropertyBuilder()
           .setTableType(HoodieTableType.valueOf(tableType))
           .setTableName(tableName)
+          .setDatabaseName(databaseName)
           .setRecordKeyFields(recordKeyFields)
           .setArchiveLogFolder(archiveLogFolder)
           .setPayloadClassName(hoodieConfig.getStringOrDefault(PAYLOAD_CLASS_NAME))
@@ -506,7 +509,7 @@ object HoodieSparkSqlWriter {
 
       val jsc = new JavaSparkContext(sqlContext.sparkContext)
       val writeClient = hoodieWriteClient.getOrElse(DataSourceUtils.createHoodieClient(jsc,
-        schema, path, tableName, mapAsJavaMap(parameters)))
+        schema, path, databaseName, tableName, mapAsJavaMap(parameters)))
       try {
         writeClient.bootstrap(org.apache.hudi.common.util.Option.empty())
       } finally {
@@ -555,6 +558,7 @@ object HoodieSparkSqlWriter {
     }
     val params: mutable.Map[String, String] = collection.mutable.Map(parameters.toSeq: _*)
     params(HoodieWriteConfig.AVRO_SCHEMA_STRING.key) = schema.toString
+    val dbName = parameters.getOrElse(HoodieWriteConfig.DATABASE_NAME.key(), "default")
     val writeConfig = DataSourceUtils.createHoodieConfig(schema.toString, path, tblName, mapAsJavaMap(params))
     val bulkInsertPartitionerRows: BulkInsertPartitioner[Dataset[Row]] = if (populateMetaFields) {
       val userDefinedBulkInsertPartitionerOpt = DataSourceUtils.createUserDefinedBulkInsertPartitionerWithRows(writeConfig)
diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableAddColumnsCommand.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableAddColumnsCommand.scala
index 1d65670f6d..69e120c2e3 100644
--- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableAddColumnsCommand.scala
+++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableAddColumnsCommand.scala
@@ -106,6 +106,7 @@ object AlterHoodieTableAddColumnsCommand {
       jsc,
       schema.toString,
       hoodieCatalogTable.tableLocation,
+      hoodieCatalogTable.table.identifier.database.getOrElse("default"),
       hoodieCatalogTable.tableName,
       HoodieWriterUtils.parametersWithWriteDefaults(hoodieCatalogTable.catalogProperties).asJava
     )
diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
index f0394ad379..b098ff3ea4 100644
--- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
@@ -21,9 +21,9 @@ import org.apache.avro.Schema
 import org.apache.hudi.DataSourceWriteOptions._
 import org.apache.hudi.common.util.StringUtils
 import org.apache.hudi.config.HoodieWriteConfig
-import org.apache.hudi.config.HoodieWriteConfig.TBL_NAME
 import org.apache.hudi.hive.HiveSyncConfigHolder
 import org.apache.hudi.sync.common.HoodieSyncConfig
+import org.apache.hudi.config.HoodieWriteConfig.{DATABASE_NAME, TBL_NAME}
 import org.apache.hudi.{AvroConversionUtils, DataSourceWriteOptions, HoodieSparkSqlWriter, SparkAdapterSupport}
 import org.apache.spark.sql.HoodieCatalystExpressionUtils.MatchCast
 import org.apache.spark.sql._
@@ -530,6 +530,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie
         RECORDKEY_FIELD.key -> tableConfig.getRecordKeyFieldProp,
         PRECOMBINE_FIELD.key -> preCombineField,
         TBL_NAME.key -> hoodieCatalogTable.tableName,
+        DATABASE_NAME.key -> targetTableDb,
         PARTITIONPATH_FIELD.key -> tableConfig.getPartitionFieldProp,
         HIVE_STYLE_PARTITIONING.key -> tableConfig.getHiveStylePartitioningEnable,
         URL_ENCODE_PARTITIONING.key -> tableConfig.getUrlEncodePartitioning,
diff --git a/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/TestDataSourceUtils.java b/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/TestDataSourceUtils.java
index 11f0fc9785..4d0d5aeef2 100644
--- a/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/TestDataSourceUtils.java
+++ b/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/TestDataSourceUtils.java
@@ -237,7 +237,7 @@ public class TestDataSourceUtils {
               DataSourceWriteOptions.PAYLOAD_CLASS_NAME().defaultValue());
       params.put(pair.left, pair.right.toString());
       HoodieWriteConfig hoodieConfig = DataSourceUtils
-              .createHoodieConfig(avroSchemaString, config.getBasePath(), "test", params);
+              .createHoodieConfig(avroSchemaString, config.getBasePath(), "testdb", "test", params);
       assertEquals(pair.right, hoodieConfig.isAsyncClusteringEnabled());
 
       TypedProperties prop = new TypedProperties();
diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
index 4e4fe43ff9..114c3d0f37 100644
--- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
+++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
@@ -62,6 +62,7 @@ class TestHoodieSparkSqlWriter {
   var tempPath: java.nio.file.Path = _
   var tempBootStrapPath: java.nio.file.Path = _
   var hoodieFooTableName = "hoodie_foo_tbl"
+  val hoodieDefaultDBName = "default_db"
   var tempBasePath: String = _
   var commonTableModifier: Map[String, String] = Map()
   case class StringLongTest(uuid: String, ts: Long)
@@ -490,6 +491,7 @@ class TestHoodieSparkSqlWriter {
   @MethodSource(Array("testDatasourceInsert"))
   def testDatasourceInsertForTableTypeBaseFileMetaFields(tableType: String, populateMetaFields: Boolean, baseFileFormat: String): Unit = {
     val hoodieFooTableName = "hoodie_foo_tbl"
+    val hoodieDefaultDBName = "default_db"
     val fooTableModifier = Map("path" -> tempBasePath,
       HoodieWriteConfig.TBL_NAME.key -> hoodieFooTableName,
       HoodieWriteConfig.BASE_FILE_FORMAT.key -> baseFileFormat,
@@ -510,7 +512,7 @@ class TestHoodieSparkSqlWriter {
     val df = spark.createDataFrame(sc.parallelize(recordsSeq), structType)
     initializeMetaClientForBootstrap(fooTableParams, tableType, addBootstrapPath = false, initBasePath = true)
     val client = spy(DataSourceUtils.createHoodieClient(
-      new JavaSparkContext(sc), modifiedSchema.toString, tempBasePath, hoodieFooTableName,
+      new JavaSparkContext(sc), modifiedSchema.toString, tempBasePath, hoodieDefaultDBName, hoodieFooTableName,
       mapAsJavaMap(fooTableParams)).asInstanceOf[SparkRDDWriteClient[HoodieRecordPayload[Nothing]]])
 
     HoodieSparkSqlWriter.write(sqlContext, SaveMode.Append, fooTableModifier, df, Option.empty, Option(client))
@@ -571,6 +573,7 @@ class TestHoodieSparkSqlWriter {
         new JavaSparkContext(sc),
         null,
         tempBasePath,
+        hoodieDefaultDBName,
         hoodieFooTableName,
         mapAsJavaMap(fooTableParams)).asInstanceOf[SparkRDDWriteClient[HoodieRecordPayload[Nothing]]])
 
diff --git a/hudi-spark-datasource/hudi-spark2/src/main/java/org/apache/hudi/internal/DefaultSource.java b/hudi-spark-datasource/hudi-spark2/src/main/java/org/apache/hudi/internal/DefaultSource.java
index 3b3b8eafb8..5746cacb0b 100644
--- a/hudi-spark-datasource/hudi-spark2/src/main/java/org/apache/hudi/internal/DefaultSource.java
+++ b/hudi-spark-datasource/hudi-spark2/src/main/java/org/apache/hudi/internal/DefaultSource.java
@@ -65,6 +65,7 @@ public class DefaultSource extends BaseDefaultSource implements DataSourceV2,
     String instantTime = options.get(DataSourceInternalWriterHelper.INSTANT_TIME_OPT_KEY).get();
     String path = options.get("path").get();
     String tblName = options.get(HoodieWriteConfig.TBL_NAME.key()).get();
+    String dbName = options.get(HoodieWriteConfig.DATABASE_NAME.key()).get();
     boolean populateMetaFields = options.getBoolean(HoodieTableConfig.POPULATE_META_FIELDS.key(),
         HoodieTableConfig.POPULATE_META_FIELDS.defaultValue());
     Map<String, String> properties = options.asMap();
diff --git a/hudi-spark-datasource/hudi-spark3-common/src/main/java/org/apache/hudi/spark3/internal/DefaultSource.java b/hudi-spark-datasource/hudi-spark3-common/src/main/java/org/apache/hudi/spark3/internal/DefaultSource.java
index ab2f16703b..90ae1e7377 100644
--- a/hudi-spark-datasource/hudi-spark3-common/src/main/java/org/apache/hudi/spark3/internal/DefaultSource.java
+++ b/hudi-spark-datasource/hudi-spark3-common/src/main/java/org/apache/hudi/spark3/internal/DefaultSource.java
@@ -52,6 +52,7 @@ public class DefaultSource extends BaseDefaultSource implements TableProvider {
     String instantTime = properties.get(DataSourceInternalWriterHelper.INSTANT_TIME_OPT_KEY);
     String path = properties.get("path");
     String tblName = properties.get(HoodieWriteConfig.TBL_NAME.key());
+    String dbName = properties.get(HoodieWriteConfig.DATABASE_NAME.key());
     boolean populateMetaFields = Boolean.parseBoolean(properties.getOrDefault(HoodieTableConfig.POPULATE_META_FIELDS.key(),
         Boolean.toString(HoodieTableConfig.POPULATE_META_FIELDS.defaultValue())));
     boolean arePartitionRecordsSorted = Boolean.parseBoolean(properties.getOrDefault(HoodieInternalConfig.BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED,
@@ -61,7 +62,8 @@ public class DefaultSource extends BaseDefaultSource implements TableProvider {
     // Auto set the value of "hoodie.parquet.writelegacyformat.enabled"
     tryOverrideParquetWriteLegacyFormatProperty(newProps, schema);
     // 1st arg to createHoodieConfig is not really required to be set. but passing it anyways.
-    HoodieWriteConfig config = DataSourceUtils.createHoodieConfig(newProps.get(HoodieWriteConfig.AVRO_SCHEMA_STRING.key()), path, tblName, newProps);
+    HoodieWriteConfig config = DataSourceUtils.createHoodieConfig(newProps.get(HoodieWriteConfig.AVRO_SCHEMA_STRING.key()), path,
+        dbName, tblName, newProps);
     return new HoodieDataSourceInternalTable(instantTime, config, schema, getSparkSession(),
         getConfiguration(), newProps, populateMetaFields, arePartitionRecordsSorted);
   }
diff --git a/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark31AlterTableCommand.scala b/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark31AlterTableCommand.scala
index 9a5366b12f..529b5bb49e 100644
--- a/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark31AlterTableCommand.scala
+++ b/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark31AlterTableCommand.scala
@@ -217,7 +217,7 @@ object Spark31AlterTableCommand extends Logging {
 
     val jsc = new JavaSparkContext(sparkSession.sparkContext)
     val client = DataSourceUtils.createHoodieClient(jsc, schema.toString,
-      path, table.identifier.table, parametersWithWriteDefaults(table.storage.properties).asJava)
+      path, table.identifier.database.getOrElse("default_db"), table.identifier.table, parametersWithWriteDefaults(table.storage.properties).asJava)
 
     val hadoopConf = sparkSession.sessionState.newHadoopConf()
     val metaClient = HoodieTableMetaClient.builder().setBasePath(path).setConf(hadoopConf).build()


[hudi] 10/45: adapt tspark changes: backport 3.3 VectorizedParquetReader related code to 3.1

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 9c94e388fecce8d02388a81f0da769d52fe5319c
Author: shaoxiong.zhan <sh...@gmail.com>
AuthorDate: Fri Sep 16 11:00:33 2022 +0800

    adapt tspark changes: backport 3.3 VectorizedParquetReader related code to 3.1
---
 .../parquet/Spark31HoodieParquetFileFormat.scala     | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark31HoodieParquetFileFormat.scala b/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark31HoodieParquetFileFormat.scala
index ca41490fc0..712c9c6d3e 100644
--- a/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark31HoodieParquetFileFormat.scala
+++ b/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark31HoodieParquetFileFormat.scala
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.datasources.parquet
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.Path
+import org.apache.hadoop.mapred.FileSplit
 import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 import org.apache.hadoop.mapreduce.{JobID, TaskAttemptID, TaskID, TaskType}
 import org.apache.hudi.HoodieSparkUtils
@@ -135,14 +136,11 @@ class Spark31HoodieParquetFileFormat(private val shouldAppendPartitionValues: Bo
       assert(!shouldAppendPartitionValues || file.partitionValues.numFields == partitionSchema.size)
 
       val filePath = new Path(new URI(file.filePath))
-      val split =
-        new org.apache.parquet.hadoop.ParquetInputSplit(
-          filePath,
-          file.start,
-          file.start + file.length,
-          file.length,
-          Array.empty,
-          null)
+      /**
+       * from https://github.com/apache/spark/pull/29542
+       * must use org.apache.hadoop.mapred.FileSplit
+       */
+      val split = new FileSplit(filePath, file.start, file.length, Array.empty[String])
 
       val sharedConf = broadcastedHadoopConf.value.value
 
@@ -170,7 +168,11 @@ class Spark31HoodieParquetFileFormat(private val shouldAppendPartitionValues: Bo
       // Try to push down filters when filter push-down is enabled.
       val pushed = if (enableParquetFilterPushDown) {
         val parquetSchema = footerFileMetaData.getSchema
-        val parquetFilters = if (HoodieSparkUtils.gteqSpark3_1_3) {
+        /**
+         * hard code for adaption, because tspark port 3.3 api to 3.1
+         */
+        val ctor = classOf[ParquetFilters].getConstructors.head
+        val parquetFilters = if (8.equals(ctor.getParameterCount) || HoodieSparkUtils.gteqSpark3_1_3) {
           createParquetFilters(
             parquetSchema,
             pushDownDate,


[hudi] 20/45: remove hudi-kafka-connect module

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 5f6d6ae42d4a4be1396c4cc585c503b1c23e9deb
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Wed Nov 2 14:44:30 2022 +0800

    remove hudi-kafka-connect module
---
 pom.xml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/pom.xml b/pom.xml
index 0adb64838b..01be2c1f89 100644
--- a/pom.xml
+++ b/pom.xml
@@ -58,9 +58,9 @@
     <module>packaging/hudi-trino-bundle</module>
     <module>hudi-examples</module>
     <module>hudi-flink-datasource</module>
-    <module>hudi-kafka-connect</module>
+<!--    <module>hudi-kafka-connect</module>-->
     <module>packaging/hudi-flink-bundle</module>
-    <module>packaging/hudi-kafka-connect-bundle</module>
+<!--    <module>packaging/hudi-kafka-connect-bundle</module>-->
     <module>hudi-tests-common</module>
   </modules>
 


[hudi] 25/45: [HUDI-5095] Flink: Stores a special watermark(flag) to identify the current progress of writing data

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 9fbf3b920d24e3e9f517beca10a2f4d892ae2d06
Author: jerryyue <je...@didiglobal.com>
AuthorDate: Fri Oct 28 19:16:46 2022 +0800

    [HUDI-5095] Flink: Stores a special watermark(flag) to identify the current progress of writing data
---
 .../java/org/apache/hudi/avro/HoodieAvroUtils.java |  28 +++
 .../hudi/common/table/timeline/TimelineUtils.java  |   2 +-
 .../org/apache/hudi/common/util/DateTimeUtils.java |   8 +
 .../apache/hudi/configuration/FlinkOptions.java    |   7 +
 .../org/apache/hudi/sink/StreamWriteFunction.java  |   6 +
 .../hudi/sink/StreamWriteOperatorCoordinator.java  |  30 +++
 .../hudi/sink/append/AppendWriteFunction.java      |   2 +-
 .../hudi/sink/bulk/BulkInsertWriteFunction.java    |  10 +-
 .../sink/common/AbstractStreamWriteFunction.java   |  11 +-
 .../hudi/sink/common/AbstractWriteFunction.java    | 103 +++++++++
 .../hudi/sink/common/AbstractWriteOperator.java    |   9 +
 .../apache/hudi/sink/event/WriteMetadataEvent.java |  31 ++-
 .../java/org/apache/hudi/util/DataTypeUtils.java   | 141 +++++++++++++
 .../apache/hudi/sink/ITTestDataStreamWrite.java    |  35 ++++
 .../sink/TestWriteFunctionEventTimeExtract.java    | 232 +++++++++++++++++++++
 .../sink/utils/StreamWriteFunctionWrapper.java     |   5 +-
 .../apache/hudi/sink/utils/TestDataTypeUtils.java  |  45 ++++
 .../hudi/utils/source/ContinuousFileSource.java    |   5 +
 18 files changed, 692 insertions(+), 18 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java b/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
index a352e86b96..735940eea4 100644
--- a/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
@@ -75,6 +75,7 @@ import java.util.List;
 import java.util.Map;
 import java.util.Deque;
 import java.util.LinkedList;
+import java.util.Objects;
 import java.util.Set;
 import java.util.TimeZone;
 import java.util.stream.Collectors;
@@ -657,6 +658,33 @@ public class HoodieAvroUtils {
     return fieldValue;
   }
 
+  public static Long getNestedFieldValAsLong(GenericRecord record, String fieldName,boolean consistentLogicalTimestampEnabled, Long defaultValue) {
+    GenericRecord valueNode = record;
+    Object fieldValue = valueNode.get(fieldName);
+    Schema fieldSchema = valueNode.getSchema().getField(fieldName).schema();
+    if (fieldSchema.getLogicalType() == LogicalTypes.date()) {
+      return LocalDate.ofEpochDay(Long.parseLong(fieldValue.toString())).toEpochDay();
+    } else if (fieldSchema.getLogicalType() == LogicalTypes.timestampMillis() && consistentLogicalTimestampEnabled) {
+      return new Timestamp(Long.parseLong(fieldValue.toString())).getTime();
+    } else if (fieldSchema.getLogicalType() == LogicalTypes.timestampMicros() && consistentLogicalTimestampEnabled) {
+      return new Timestamp(Long.parseLong(fieldValue.toString()) / 1000).getTime();
+    } else if (fieldSchema.getLogicalType() instanceof LogicalTypes.Decimal) {
+      Decimal dc = (Decimal) fieldSchema.getLogicalType();
+      DecimalConversion decimalConversion = new DecimalConversion();
+      if (fieldSchema.getType() == Schema.Type.FIXED) {
+        return decimalConversion.fromFixed((GenericFixed) fieldValue, fieldSchema,
+          LogicalTypes.decimal(dc.getPrecision(), dc.getScale())).longValue();
+      } else if (fieldSchema.getType() == Schema.Type.BYTES) {
+        ByteBuffer byteBuffer = (ByteBuffer) fieldValue;
+        BigDecimal convertedValue = decimalConversion.fromBytes(byteBuffer, fieldSchema,
+            LogicalTypes.decimal(dc.getPrecision(), dc.getScale()));
+        byteBuffer.rewind();
+        return convertedValue.longValue();
+      }
+    }
+    return Objects.isNull(fieldValue) ? defaultValue : Long.parseLong(fieldValue.toString());
+  }
+
   public static Schema getNullSchema() {
     return Schema.create(Schema.Type.NULL);
   }
diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
index 75493e7b46..1b8450eecc 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
@@ -173,7 +173,7 @@ public class TimelineUtils {
           HoodieInstant::getTimestamp, instant -> getMetadataValue(metaClient, extraMetadataKey, instant)));
   }
 
-  private static Option<String> getMetadataValue(HoodieTableMetaClient metaClient, String extraMetadataKey, HoodieInstant instant) {
+  public static Option<String> getMetadataValue(HoodieTableMetaClient metaClient, String extraMetadataKey, HoodieInstant instant) {
     try {
       LOG.info("reading checkpoint info for:"  + instant + " key: " + extraMetadataKey);
       HoodieCommitMetadata commitMetadata = HoodieCommitMetadata.fromBytes(
diff --git a/hudi-common/src/main/java/org/apache/hudi/common/util/DateTimeUtils.java b/hudi-common/src/main/java/org/apache/hudi/common/util/DateTimeUtils.java
index cf90eff8d6..1b1e845dfc 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/util/DateTimeUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/util/DateTimeUtils.java
@@ -22,7 +22,9 @@ package org.apache.hudi.common.util;
 import java.time.Duration;
 import java.time.Instant;
 import java.time.LocalDateTime;
+import java.time.OffsetDateTime;
 import java.time.ZoneId;
+import java.time.ZoneOffset;
 import java.time.format.DateTimeFormatter;
 import java.time.format.DateTimeParseException;
 import java.time.temporal.ChronoUnit;
@@ -39,6 +41,8 @@ public class DateTimeUtils {
   private static final Map<String, ChronoUnit> LABEL_TO_UNIT_MAP =
       Collections.unmodifiableMap(initMap());
 
+  public static final OffsetDateTime EPOCH = Instant.ofEpochSecond(0).atOffset(ZoneOffset.UTC);
+
   /**
    * Converts provided microseconds (from epoch) to {@link Instant}
    */
@@ -172,6 +176,10 @@ public class DateTimeUtils {
         .format(dtf);
   }
 
+  public static long millisFromTimestamp(LocalDateTime dateTime) {
+    return ChronoUnit.MILLIS.between(EPOCH, dateTime.atOffset(ZoneOffset.UTC));
+  }
+
   /**
    * Enum which defines time unit, mostly used to parse value from configuration file.
    */
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
index 31c8b554c0..aa1e3297bd 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
@@ -24,6 +24,7 @@ import org.apache.hudi.common.config.ConfigGroups;
 import org.apache.hudi.common.config.HoodieConfig;
 import org.apache.hudi.common.model.EventTimeAvroPayload;
 import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodiePayloadProps;
 import org.apache.hudi.common.model.HoodieTableType;
 import org.apache.hudi.common.model.WriteOperationType;
 import org.apache.hudi.config.HoodieIndexConfig;
@@ -255,6 +256,12 @@ public class FlinkOptions extends HoodieConfig {
       .defaultValue(60)// default 1 minute
       .withDescription("Check interval for streaming read of SECOND, default 1 minute");
 
+  public static final ConfigOption<String> EVENT_TIME_FIELD = ConfigOptions
+      .key(HoodiePayloadProps.PAYLOAD_EVENT_TIME_FIELD_PROP_KEY)
+      .stringType()
+      .noDefaultValue()
+      .withDescription("event time field name for flink");
+
   // this option is experimental
   public static final ConfigOption<Boolean> READ_STREAMING_SKIP_COMPACT = ConfigOptions
       .key("read.streaming.skip_compaction")
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java
index a0f994f04a..bd1b1e68f1 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java
@@ -446,6 +446,7 @@ public class StreamWriteFunction<I> extends AbstractStreamWriteFunction<I> {
         .writeStatus(writeStatus)
         .lastBatch(false)
         .endInput(false)
+        .maxEventTime(this.currentTimeStamp)
         .build();
 
     this.eventGateway.sendEventToCoordinator(event);
@@ -482,14 +483,19 @@ public class StreamWriteFunction<I> extends AbstractStreamWriteFunction<I> {
       LOG.info("No data to write in subtask [{}] for instant [{}]", taskID, currentInstant);
       writeStatus = Collections.emptyList();
     }
+
     final WriteMetadataEvent event = WriteMetadataEvent.builder()
         .taskID(taskID)
         .instantTime(currentInstant)
         .writeStatus(writeStatus)
         .lastBatch(true)
         .endInput(endInput)
+        .maxEventTime(this.currentTimeStamp)
         .build();
 
+    LOG.info("Write MetadataEvent in subtask [{}] for instant [{}] maxEventTime [{}]",
+        taskID, currentInstant, this.currentTimeStamp);
+
     this.eventGateway.sendEventToCoordinator(event);
     this.buckets.clear();
     this.tracer.reset();
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
index 670748b90f..578bb10db5 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
@@ -56,6 +56,7 @@ import java.io.Serializable;
 import java.util.Arrays;
 import java.util.Collection;
 import java.util.Collections;
+import java.util.Comparator;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Locale;
@@ -147,6 +148,10 @@ public class StreamWriteOperatorCoordinator
    */
   private transient TableState tableState;
 
+  private transient Long minEventTime = Long.MAX_VALUE;
+
+  private final boolean commitEventTimeEnable;
+
   /**
    * The checkpoint metadata.
    */
@@ -165,6 +170,7 @@ public class StreamWriteOperatorCoordinator
     this.context = context;
     this.parallelism = context.currentParallelism();
     this.hiveConf = new SerializableConfiguration(HadoopConfigurations.getHiveConf(conf));
+    this.commitEventTimeEnable = Objects.nonNull(conf.get(FlinkOptions.EVENT_TIME_FIELD));
   }
 
   @Override
@@ -503,10 +509,30 @@ public class StreamWriteOperatorCoordinator
       sendCommitAckEvents(checkpointId);
       return false;
     }
+    setMinEventTime();
     doCommit(instant, writeResults);
+    resetMinEventTime();
     return true;
   }
 
+  public void setMinEventTime() {
+    if (commitEventTimeEnable) {
+      LOG.info("[setMinEventTime] receive event time for current commit: {} ", Arrays.stream(eventBuffer).map(WriteMetadataEvent::getMaxEventTime).map(String::valueOf)
+          .collect(Collectors.joining(", ")));
+      this.minEventTime = Arrays.stream(eventBuffer)
+          .filter(Objects::nonNull)
+          .filter(maxEventTime -> maxEventTime.getMaxEventTime() > 0)
+          .map(WriteMetadataEvent::getMaxEventTime)
+          .min(Comparator.naturalOrder())
+          .map(aLong -> Math.min(aLong, this.minEventTime)).orElse(Long.MAX_VALUE);
+      LOG.info("[setMinEventTime] minEventTime: {} ", this.minEventTime);
+    }
+  }
+
+  public void resetMinEventTime() {
+    this.minEventTime = Long.MAX_VALUE;
+  }
+
   /**
    * Performs the actual commit action.
    */
@@ -519,6 +545,10 @@ public class StreamWriteOperatorCoordinator
 
     if (!hasErrors || this.conf.getBoolean(FlinkOptions.IGNORE_FAILED)) {
       HashMap<String, String> checkpointCommitMetadata = new HashMap<>();
+      if (commitEventTimeEnable) {
+        LOG.info("[doCommit] minEventTime: {} ", this.minEventTime);
+        checkpointCommitMetadata.put(FlinkOptions.EVENT_TIME_FIELD.key(), this.minEventTime.toString());
+      }
       if (hasErrors) {
         LOG.warn("Some records failed to merge but forcing commit since commitOnErrors set to true. Errors/Total="
             + totalErrorRecords + "/" + totalRecords);
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunction.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunction.java
index e1db125731..afecc6cc49 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunction.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunction.java
@@ -59,7 +59,6 @@ public class AppendWriteFunction<I> extends AbstractStreamWriteFunction<I> {
    * Table row type.
    */
   private final RowType rowType;
-
   /**
    * Constructs an AppendWriteFunction.
    *
@@ -133,6 +132,7 @@ public class AppendWriteFunction<I> extends AbstractStreamWriteFunction<I> {
         .writeStatus(writeStatus)
         .lastBatch(true)
         .endInput(endInput)
+        .maxEventTime(this.currentTimeStamp)
         .build();
     this.eventGateway.sendEventToCoordinator(event);
     // nullify the write helper for next ckp
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriteFunction.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriteFunction.java
index 06d9fcd851..63560fa123 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriteFunction.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriteFunction.java
@@ -63,12 +63,6 @@ public class BulkInsertWriteFunction<I>
    * Helper class for bulk insert mode.
    */
   private transient BulkInsertWriterHelper writerHelper;
-
-  /**
-   * Config options.
-   */
-  private final Configuration config;
-
   /**
    * Table row type.
    */
@@ -105,7 +99,7 @@ public class BulkInsertWriteFunction<I>
    * @param config The config options
    */
   public BulkInsertWriteFunction(Configuration config, RowType rowType) {
-    this.config = config;
+    super(config);
     this.rowType = rowType;
   }
 
@@ -144,6 +138,7 @@ public class BulkInsertWriteFunction<I>
         .writeStatus(writeStatus)
         .lastBatch(true)
         .endInput(true)
+        .maxEventTime(this.currentTimeStamp)
         .build();
     this.eventGateway.sendEventToCoordinator(event);
   }
@@ -180,6 +175,7 @@ public class BulkInsertWriteFunction<I>
         .bootstrap(true)
         .build();
     this.eventGateway.sendEventToCoordinator(event);
+    resetEventTime();
     LOG.info("Send bootstrap write metadata event to coordinator, task[{}].", taskID);
   }
 
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractStreamWriteFunction.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractStreamWriteFunction.java
index 674cd3588a..b4569894a2 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractStreamWriteFunction.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractStreamWriteFunction.java
@@ -58,12 +58,6 @@ public abstract class AbstractStreamWriteFunction<I>
     implements CheckpointedFunction {
 
   private static final Logger LOG = LoggerFactory.getLogger(AbstractStreamWriteFunction.class);
-
-  /**
-   * Config options.
-   */
-  protected final Configuration config;
-
   /**
    * Id of current subtask.
    */
@@ -132,7 +126,7 @@ public abstract class AbstractStreamWriteFunction<I>
    * @param config The config options
    */
   public AbstractStreamWriteFunction(Configuration config) {
-    this.config = config;
+    super(config);
   }
 
   @Override
@@ -166,6 +160,8 @@ public abstract class AbstractStreamWriteFunction<I>
     snapshotState();
     // Reload the snapshot state as the current state.
     reloadWriteMetaState();
+    //reset event time for current checkpoint interval
+    resetEventTime();
   }
 
   public abstract void snapshotState();
@@ -225,6 +221,7 @@ public abstract class AbstractStreamWriteFunction<I>
         .instantTime(currentInstant)
         .writeStatus(new ArrayList<>(writeStatuses))
         .bootstrap(true)
+        .maxEventTime(this.currentTimeStamp)
         .build();
     this.writeMetadataState.add(event);
     writeStatuses.clear();
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractWriteFunction.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractWriteFunction.java
index 9e131ff91e..7756a1c2fa 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractWriteFunction.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractWriteFunction.java
@@ -18,10 +18,26 @@
 
 package org.apache.hudi.sink.common;
 
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.flink.configuration.Configuration;
 import org.apache.flink.runtime.operators.coordination.OperatorEvent;
 import org.apache.flink.runtime.operators.coordination.OperatorEventGateway;
 import org.apache.flink.streaming.api.functions.ProcessFunction;
 import org.apache.flink.streaming.api.operators.BoundedOneInput;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.types.logical.LogicalType;
+import org.apache.flink.table.types.logical.RowType;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.model.HoodieAvroRecord;
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.util.AvroSchemaConverter;
+import org.apache.hudi.util.DataTypeUtils;
+import org.apache.hudi.util.StreamerUtil;
+
+import java.io.IOException;
+import java.util.Objects;
 
 /**
  * Base class for write function.
@@ -29,6 +45,39 @@ import org.apache.flink.streaming.api.operators.BoundedOneInput;
  * @param <I> the input type
  */
 public abstract class AbstractWriteFunction<I> extends ProcessFunction<I, Object> implements BoundedOneInput {
+
+  /**
+   * Config options.
+   */
+  protected final Configuration config;
+  /**
+   * The current event time this write task seen for now.
+   */
+  protected Long currentTimeStamp = -1L;
+
+  protected String eventTimeField;
+
+  protected int eventTimeFieldIndex;
+
+  protected LogicalType eventTimeDataType;
+
+  protected RowData.FieldGetter eventTimeFieldGetter;
+
+  protected Schema writeSchema;
+
+  public AbstractWriteFunction(Configuration config) {
+    this.config = config;
+    if (config.containsKey(FlinkOptions.EVENT_TIME_FIELD.key())) {
+      this.writeSchema = StreamerUtil.getSourceSchema(config);
+      this.eventTimeField = config.getString(FlinkOptions.EVENT_TIME_FIELD);
+      this.eventTimeFieldIndex = this.writeSchema.getField(this.eventTimeField).pos();
+      RowType rowType =
+          (RowType) AvroSchemaConverter.convertToDataType(this.writeSchema).getLogicalType();
+      this.eventTimeDataType = rowType.getTypeAt(eventTimeFieldIndex);
+      this.eventTimeFieldGetter = RowData.createFieldGetter(eventTimeDataType, eventTimeFieldIndex);
+    }
+  }
+
   /**
    * Sets up the event gateway.
    */
@@ -45,4 +94,58 @@ public abstract class AbstractWriteFunction<I> extends ProcessFunction<I, Object
    * @param event The event
    */
   public abstract void handleOperatorEvent(OperatorEvent event);
+
+  /**
+   * Extract TimeStamp from input value with specified event time field.
+   *
+   * @param value The input value
+   * @return the new timestamp for current.
+   */
+  public long extractTimestamp(I value) {
+    if (value instanceof HoodieAvroRecord) {
+      return extractTimestamp((HoodieAvroRecord) value);
+    }
+    return extractTimestamp((RowData) value);
+  }
+
+  /**
+   * whether enable extract event time stamp from record.
+   *
+   * @return flag to enable or disable the event time extract.
+   */
+  public boolean extractTimeStampEnable() {
+    return Objects.nonNull(this.eventTimeField);
+  }
+
+  public long extractTimestamp(HoodieAvroRecord value) {
+    try {
+      GenericRecord record = (GenericRecord) value.getData()
+          .getInsertValue(this.writeSchema).get();
+      Long eventTime = HoodieAvroUtils.getNestedFieldValAsLong(
+          record, eventTimeField,
+          true, -1L);
+      this.currentTimeStamp = Math.max(eventTime, this.currentTimeStamp);
+      return eventTime;
+    } catch (IOException e) {
+      throw new HoodieException("extract event time failed. " + e);
+    }
+  }
+
+  public long extractTimestamp(RowData value) {
+    Long eventTime = -1L;
+    try {
+      Object eventTimeObject =
+          value.isNullAt(eventTimeFieldIndex) ? -1L : this.eventTimeFieldGetter.getFieldOrNull(value);
+      eventTime = DataTypeUtils.getAsLong(eventTimeObject, this.eventTimeDataType);
+      this.currentTimeStamp = Math.max(eventTime, this.currentTimeStamp);
+    } catch (Throwable e) {
+      throw new HoodieException(String.format("eventTimeFieldIndex=%s, eventTimeDataType=%s, eventTime=%s. ",
+          this.eventTimeFieldIndex, this.eventTimeDataType, eventTime) + e);
+    }
+    return eventTime;
+  }
+
+  public void resetEventTime() {
+    this.currentTimeStamp = -1L;
+  }
 }
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractWriteOperator.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractWriteOperator.java
index e339ccb0b7..bd41898487 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractWriteOperator.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractWriteOperator.java
@@ -23,6 +23,7 @@ import org.apache.flink.runtime.operators.coordination.OperatorEventGateway;
 import org.apache.flink.runtime.operators.coordination.OperatorEventHandler;
 import org.apache.flink.streaming.api.operators.BoundedOneInput;
 import org.apache.flink.streaming.api.operators.ProcessOperator;
+import org.apache.flink.streaming.runtime.streamrecord.StreamRecord;
 
 /**
  * Base class for write operator.
@@ -39,6 +40,14 @@ public abstract class AbstractWriteOperator<I>
     this.function = function;
   }
 
+  @Override
+  public void processElement(StreamRecord<I> element) throws Exception {
+    if (this.function.extractTimeStampEnable()) {
+      this.function.extractTimestamp(element.getValue());
+    }
+    super.processElement(element);
+  }
+
   public void setOperatorEventGateway(OperatorEventGateway operatorEventGateway) {
     this.function.setOperatorEventGateway(operatorEventGateway);
   }
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/event/WriteMetadataEvent.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/event/WriteMetadataEvent.java
index 0eb06bdd82..f68052e895 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/event/WriteMetadataEvent.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/event/WriteMetadataEvent.java
@@ -54,6 +54,7 @@ public class WriteMetadataEvent implements OperatorEvent {
    */
   private boolean bootstrap;
 
+  private Long maxEventTime = Long.MIN_VALUE;
   /**
    * Creates an event.
    *
@@ -81,6 +82,18 @@ public class WriteMetadataEvent implements OperatorEvent {
     this.bootstrap = bootstrap;
   }
 
+  private WriteMetadataEvent(
+      int taskID,
+      String instantTime,
+      List<WriteStatus> writeStatuses,
+      boolean lastBatch,
+      boolean endInput,
+      boolean bootstrap,
+      Long maxEventTIme) {
+    this(taskID, instantTime, writeStatuses, lastBatch, endInput, bootstrap);
+    this.maxEventTime = maxEventTIme;
+  }
+
   // default constructor for efficient serialization
   public WriteMetadataEvent() {
   }
@@ -140,6 +153,14 @@ public class WriteMetadataEvent implements OperatorEvent {
     this.lastBatch = lastBatch;
   }
 
+  public Long getMaxEventTime() {
+    return maxEventTime;
+  }
+
+  public void setMaxEventTime(Long maxEventTime) {
+    this.maxEventTime = maxEventTime;
+  }
+
   /**
    * Merges this event with given {@link WriteMetadataEvent} {@code other}.
    *
@@ -172,6 +193,7 @@ public class WriteMetadataEvent implements OperatorEvent {
         + ", lastBatch=" + lastBatch
         + ", endInput=" + endInput
         + ", bootstrap=" + bootstrap
+        + ", maxEventTime=" + maxEventTime
         + '}';
   }
 
@@ -209,11 +231,13 @@ public class WriteMetadataEvent implements OperatorEvent {
     private boolean endInput = false;
     private boolean bootstrap = false;
 
+    private Long maxEventTime = Long.MAX_VALUE;
+
     public WriteMetadataEvent build() {
       Objects.requireNonNull(taskID);
       Objects.requireNonNull(instantTime);
       Objects.requireNonNull(writeStatus);
-      return new WriteMetadataEvent(taskID, instantTime, writeStatus, lastBatch, endInput, bootstrap);
+      return new WriteMetadataEvent(taskID, instantTime, writeStatus, lastBatch, endInput, bootstrap, maxEventTime);
     }
 
     public Builder taskID(int taskID) {
@@ -226,6 +250,11 @@ public class WriteMetadataEvent implements OperatorEvent {
       return this;
     }
 
+    public Builder maxEventTime(Long maxEventTime) {
+      this.maxEventTime = maxEventTime;
+      return this;
+    }
+
     public Builder writeStatus(List<WriteStatus> writeStatus) {
       this.writeStatus = writeStatus;
       return this;
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/DataTypeUtils.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/DataTypeUtils.java
index e91432b5e3..6c4bb9925c 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/DataTypeUtils.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/DataTypeUtils.java
@@ -18,6 +18,8 @@
 
 package org.apache.hudi.util;
 
+import org.apache.flink.table.api.DataTypes;
+import org.apache.flink.table.data.TimestampData;
 import org.apache.flink.table.types.DataType;
 import org.apache.flink.table.types.logical.LocalZonedTimestampType;
 import org.apache.flink.table.types.logical.LogicalType;
@@ -26,15 +28,41 @@ import org.apache.flink.table.types.logical.LogicalTypeRoot;
 import org.apache.flink.table.types.logical.RowType;
 import org.apache.flink.table.types.logical.TimestampType;
 
+import javax.annotation.Nullable;
 import java.math.BigDecimal;
+import java.time.DateTimeException;
 import java.time.LocalDate;
 import java.time.LocalDateTime;
+import java.time.format.DateTimeFormatter;
+import java.time.format.DateTimeFormatterBuilder;
+import java.time.format.DateTimeParseException;
+import java.time.temporal.TemporalAccessor;
+import java.util.ArrayList;
 import java.util.Arrays;
+import java.util.List;
+
+import static java.time.temporal.ChronoField.DAY_OF_MONTH;
+import static java.time.temporal.ChronoField.HOUR_OF_DAY;
+import static java.time.temporal.ChronoField.MINUTE_OF_HOUR;
+import static java.time.temporal.ChronoField.MONTH_OF_YEAR;
+import static java.time.temporal.ChronoField.NANO_OF_SECOND;
+import static java.time.temporal.ChronoField.SECOND_OF_MINUTE;
+import static java.time.temporal.ChronoField.YEAR;
 
 /**
  * Utilities for {@link org.apache.flink.table.types.DataType}.
  */
 public class DataTypeUtils {
+
+  private static final DateTimeFormatter DEFAULT_TIMESTAMP_FORMATTER =
+      new DateTimeFormatterBuilder()
+          .appendPattern("yyyy-[MM][M]-[dd][d]")
+          .optionalStart()
+          .appendPattern(" [HH][H]:[mm][m]:[ss][s]")
+          .appendFraction(NANO_OF_SECOND, 0, 9, true)
+          .optionalEnd()
+          .toFormatter();
+
   /**
    * Returns whether the given type is TIMESTAMP type.
    */
@@ -123,4 +151,117 @@ public class DataTypeUtils {
                 "Can not convert %s to type %s for partition value", partition, type));
     }
   }
+
+  /**
+   * Ensures the give columns of the row data type are not nullable(for example, the primary keys).
+   *
+   * @param dataType  The row data type, datatype logicaltype must be rowtype
+   * @param pkColumns The primary keys
+   * @return a new row data type if any column nullability is tweaked or the original data type
+   */
+  public static DataType ensureColumnsAsNonNullable(DataType dataType, @Nullable List<String> pkColumns) {
+    if (pkColumns == null || pkColumns.isEmpty()) {
+      return dataType;
+    }
+    LogicalType dataTypeLogicalType = dataType.getLogicalType();
+    if (!(dataTypeLogicalType instanceof RowType)) {
+      throw new RuntimeException("The datatype to be converted must be row type, but this type is :" + dataTypeLogicalType.getClass());
+    }
+    RowType rowType = (RowType) dataTypeLogicalType;
+    List<DataType> originalFieldTypes = dataType.getChildren();
+    List<String> fieldNames = rowType.getFieldNames();
+    List<DataType> fieldTypes = new ArrayList<>();
+    boolean tweaked = false;
+    for (int i = 0; i < fieldNames.size(); i++) {
+      if (pkColumns.contains(fieldNames.get(i)) && rowType.getTypeAt(i).isNullable()) {
+        fieldTypes.add(originalFieldTypes.get(i).notNull());
+        tweaked = true;
+      } else {
+        fieldTypes.add(originalFieldTypes.get(i));
+      }
+    }
+    if (!tweaked) {
+      return dataType;
+    }
+    List<DataTypes.Field> fields = new ArrayList<>();
+    for (int i = 0; i < fieldNames.size(); i++) {
+      fields.add(DataTypes.FIELD(fieldNames.get(i), fieldTypes.get(i)));
+    }
+    return DataTypes.ROW(fields.stream().toArray(DataTypes.Field[]::new)).notNull();
+  }
+
+  public static Long getAsLong(Object value, LogicalType logicalType) {
+    if (logicalType.getTypeRoot() == LogicalTypeRoot.TIMESTAMP_WITHOUT_TIME_ZONE
+        || logicalType.getTypeRoot() == LogicalTypeRoot.DATE
+        || logicalType.getTypeRoot() == LogicalTypeRoot.TIME_WITHOUT_TIME_ZONE) {
+      return toMills(toLocalDateTime(value.toString()));
+    }
+    return Long.parseLong(value.toString());
+  }
+
+  public static LocalDateTime toLocalDateTime(String timestampString) {
+    try {
+      return parseTimestampData(timestampString);
+    } catch (DateTimeParseException e) {
+      return LocalDateTime.parse(timestampString);
+    }
+  }
+
+  public static LocalDateTime parseTimestampData(String dateStr) throws DateTimeException {
+    // Precision is hardcoded to match signature of TO_TIMESTAMP
+    //  https://issues.apache.org/jira/browse/FLINK-14925
+    return parseTimestampData(dateStr, 3);
+  }
+
+  public static LocalDateTime parseTimestampData(String dateStr, int precision)
+      throws DateTimeException {
+    return fromTemporalAccessor(DEFAULT_TIMESTAMP_FORMATTER.parse(dateStr), precision);
+  }
+
+  public static long toMills(LocalDateTime dateTime) {
+    return TimestampData.fromLocalDateTime(dateTime).getMillisecond();
+  }
+
+  private static LocalDateTime fromTemporalAccessor(TemporalAccessor accessor, int precision) {
+    // complement year with 1970
+    int year = accessor.isSupported(YEAR) ? accessor.get(YEAR) : 1970;
+    // complement month with 1
+    int month = accessor.isSupported(MONTH_OF_YEAR) ? accessor.get(MONTH_OF_YEAR) : 1;
+    // complement day with 1
+    int day = accessor.isSupported(DAY_OF_MONTH) ? accessor.get(DAY_OF_MONTH) : 1;
+    // complement hour with 0
+    int hour = accessor.isSupported(HOUR_OF_DAY) ? accessor.get(HOUR_OF_DAY) : 0;
+    // complement minute with 0
+    int minute = accessor.isSupported(MINUTE_OF_HOUR) ? accessor.get(MINUTE_OF_HOUR) : 0;
+    // complement second with 0
+    int second = accessor.isSupported(SECOND_OF_MINUTE) ? accessor.get(SECOND_OF_MINUTE) : 0;
+    // complement nano_of_second with 0
+    int nanoOfSecond = accessor.isSupported(NANO_OF_SECOND) ? accessor.get(NANO_OF_SECOND) : 0;
+
+    if (precision == 0) {
+      nanoOfSecond = 0;
+    } else if (precision != 9) {
+      nanoOfSecond = (int) floor(nanoOfSecond, powerX(10, 9 - precision));
+    }
+
+    return LocalDateTime.of(year, month, day, hour, minute, second, nanoOfSecond);
+  }
+
+  private static long floor(long a, long b) {
+    long r = a % b;
+    if (r < 0) {
+      return a - r - b;
+    } else {
+      return a - r;
+    }
+  }
+
+  private static long powerX(long a, long b) {
+    long x = 1;
+    while (b > 0) {
+      x *= a;
+      --b;
+    }
+    return x;
+  }
 }
diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/ITTestDataStreamWrite.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/ITTestDataStreamWrite.java
index 193c0abcd8..9556a94083 100644
--- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/ITTestDataStreamWrite.java
+++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/ITTestDataStreamWrite.java
@@ -18,14 +18,21 @@
 
 package org.apache.hudi.sink;
 
+import org.apache.hudi.client.common.HoodieFlinkEngineContext;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.configuration.FlinkOptions;
 import org.apache.hudi.configuration.OptionsInference;
 import org.apache.hudi.sink.transform.ChainedTransformer;
 import org.apache.hudi.sink.transform.Transformer;
 import org.apache.hudi.sink.utils.Pipelines;
+import org.apache.hudi.table.HoodieFlinkTable;
 import org.apache.hudi.util.AvroSchemaConverter;
 import org.apache.hudi.util.HoodiePipeline;
 import org.apache.hudi.util.StreamerUtil;
@@ -67,6 +74,9 @@ import java.util.List;
 import java.util.Map;
 import java.util.Objects;
 import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
 
 /**
  * Integration test for Flink Hoodie stream sink.
@@ -109,6 +119,31 @@ public class ITTestDataStreamWrite extends TestLogger {
     testWriteToHoodie(conf, "cow_write", 2, EXPECTED);
   }
 
+  @ParameterizedTest
+  @ValueSource(strings = {"BUCKET", "FLINK_STATE"})
+  public void testWriteCopyOnWriteWithEventTimeExtract(String indexType) throws Exception {
+    Configuration conf = TestConfigurations.getDefaultConf(tempFile.toURI().toString());
+    conf.setString(FlinkOptions.INDEX_TYPE, indexType);
+    conf.setInteger(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS, 1);
+    conf.setString(FlinkOptions.INDEX_KEY_FIELD, "id");
+    conf.setBoolean(FlinkOptions.PRE_COMBINE, true);
+    conf.setString(FlinkOptions.EVENT_TIME_FIELD, "ts");
+
+    testWriteToHoodie(conf, "cow_write", 1, EXPECTED);
+    HoodieTableMetaClient metaClient = HoodieTestUtils.init(conf.getString(FlinkOptions.PATH));
+    HoodieWriteConfig config = HoodieWriteConfig.newBuilder().withPath(conf.getString(FlinkOptions.PATH)).build();
+    HoodieFlinkTable<?> table = HoodieFlinkTable.create(config, HoodieFlinkEngineContext.DEFAULT, metaClient);
+    List<HoodieInstant> hoodieInstants = table.getFileSystemView().getTimeline().getInstants().collect(Collectors.toList());
+    assertEquals(1, hoodieInstants.size());
+    byte[] data = table.getFileSystemView().getTimeline().getInstantDetails(hoodieInstants.get(0)).get();
+    Map<String, String> extraMetadata = HoodieCommitMetadata.fromBytes(data, HoodieCommitMetadata.class).getExtraMetadata();
+    if (indexType.equals("BUCKET")) {
+      assertEquals("2000", extraMetadata.get(FlinkOptions.EVENT_TIME_FIELD.key()));
+    } else {
+      assertEquals("4000", extraMetadata.get(FlinkOptions.EVENT_TIME_FIELD.key()));
+    }
+  }
+
   @Test
   public void testWriteCopyOnWriteWithTransformer() throws Exception {
     Transformer transformer = (ds) -> ds.map((rowdata) -> {
diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteFunctionEventTimeExtract.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteFunctionEventTimeExtract.java
new file mode 100644
index 0000000000..52ba4e07b9
--- /dev/null
+++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteFunctionEventTimeExtract.java
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink;
+
+import org.apache.flink.api.common.JobStatus;
+import org.apache.flink.calcite.shaded.com.google.common.collect.Lists;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.core.execution.JobClient;
+import org.apache.flink.formats.common.TimestampFormat;
+import org.apache.flink.formats.json.JsonRowDataDeserializationSchema;
+import org.apache.flink.streaming.api.CheckpointingMode;
+import org.apache.flink.streaming.api.datastream.DataStream;
+import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.table.api.DataTypes;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.runtime.typeutils.InternalTypeInfo;
+import org.apache.flink.table.types.DataType;
+import org.apache.flink.table.types.logical.RowType;
+import org.apache.flink.types.Row;
+import org.apache.flink.util.TestLogger;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineUtils;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.common.util.DateTimeUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.util.HoodiePipeline;
+import org.apache.hudi.utils.TestConfigurations;
+import org.apache.hudi.utils.source.ContinuousFileSource;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.io.TempDir;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.time.LocalDateTime;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Stream;
+
+public class TestWriteFunctionEventTimeExtract extends TestLogger {
+
+  @TempDir
+  File tempFile;
+
+  public static final DataType ROW_DATA_TYPE = DataTypes.ROW(
+          DataTypes.FIELD("id", DataTypes.VARCHAR(20)),// record key
+          DataTypes.FIELD("data", DataTypes.VARCHAR(10)),
+          DataTypes.FIELD("ts", DataTypes.TIMESTAMP(3)), // precombine field
+          DataTypes.FIELD("partition", DataTypes.VARCHAR(10)))
+      .notNull();
+
+  private static Stream<Arguments> parameters() {
+    return Stream.of(
+        Arguments.of("COPY_ON_WRITE", "FLINK_STATE", 1, false),
+        Arguments.of("COPY_ON_WRITE", "FLINK_STATE", 2, true),
+        Arguments.of("MERGE_ON_READ", "FLINK_STATE", 1, false),
+        Arguments.of("MERGE_ON_READ", "FLINK_STATE", 2, true),
+        Arguments.of("COPY_ON_WRITE", "BUCKET", 1, false),
+        Arguments.of("COPY_ON_WRITE", "BUCKET", 2, true),
+        Arguments.of("MERGE_ON_READ", "BUCKET", 1, false),
+        Arguments.of("MERGE_ON_READ", "BUCKET", 2, true),
+        Arguments.of("MERGE_ON_READ", "NON_INDEX", 1, false),
+        Arguments.of("MERGE_ON_READ", "NON_INDEX", 2, true));
+  }
+
+  @ParameterizedTest
+  @MethodSource("parameters")
+  void testWriteWithEventTime(String tableType, String indexType, int parallelism, boolean partitioned) throws Exception {
+    Configuration conf = getConf(tableType, indexType, parallelism);
+
+    List<Row> rows =
+        Lists.newArrayList(
+            Row.of("1", "hello", LocalDateTime.parse("2012-12-12T12:12:12")),
+            Row.of("2", "world", LocalDateTime.parse("2012-12-12T12:12:01")),
+            Row.of("3", "world", LocalDateTime.parse("2012-12-12T12:12:02")),
+            Row.of("4", "foo", LocalDateTime.parse("2012-12-12T12:12:10")));
+
+    // write rowData with eventTime
+    testWriteToHoodie(conf, parallelism, partitioned, ((RowType) ROW_DATA_TYPE.getLogicalType()), rows);
+
+    // check eventTime
+    checkWriteEventTime(indexType, parallelism, conf);
+  }
+
+  public Configuration getConf(String tableType, String indexType, int parallelism) {
+    Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath(), ROW_DATA_TYPE);
+    conf.setString(FlinkOptions.TABLE_TYPE, tableType);
+    conf.setString(FlinkOptions.INDEX_TYPE, indexType);
+    conf.setString(FlinkOptions.RECORD_KEY_FIELD, "id");
+    conf.setString(FlinkOptions.INDEX_KEY_FIELD, "id");
+    conf.setString(FlinkOptions.EVENT_TIME_FIELD, "ts");
+    conf.setInteger(FlinkOptions.WRITE_TASKS, parallelism);
+    conf.setString("hoodie.metrics.on", "false");
+    conf.setInteger(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS, parallelism);
+    return conf;
+  }
+
+  private void checkWriteEventTime(String indexType, int parallelism, Configuration conf) throws IOException {
+    if (parallelism <= 1) { // single mode
+      Assertions.assertEquals(DateTimeUtils.millisFromTimestamp(LocalDateTime.parse("2012-12-12T12:12:12")),
+          getLastEventTime(conf));
+    } else if (indexType.equals("BUCKET") || indexType.equals("NON_INDEX")) { // hash mode
+      Assertions.assertEquals(DateTimeUtils.millisFromTimestamp(LocalDateTime.parse("2012-12-12T12:12:10")),
+          getLastEventTime(conf));
+    } else { // partition mode
+      Assertions.assertEquals(DateTimeUtils.millisFromTimestamp(LocalDateTime.parse("2012-12-12T12:12:02")),
+          getLastEventTime(conf));
+    }
+  }
+
+  public long getLastEventTime(Configuration conf) throws IOException {
+    String path = conf.getString(FlinkOptions.PATH);
+    HoodieTableMetaClient metaClient = HoodieTestUtils.init(path);
+    HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();
+    HoodieTimeline timeline = activeTimeline.getCommitsTimeline().filterCompletedInstants();
+    HoodieInstant hoodieInstant = timeline.getReverseOrderedInstants()
+        .findFirst().orElse(null);
+    if (hoodieInstant == null) {
+      return -1L;
+    } else {
+      Option<String> eventtime = TimelineUtils.getMetadataValue(metaClient, FlinkOptions.EVENT_TIME_FIELD.key(),
+          hoodieInstant);
+      return Long.parseLong(eventtime.orElseGet(() -> "-1"));
+    }
+  }
+
+  public List<String> getRowDataString(List<Row> rows, boolean partitioned) {
+    List<String> dataBuffer = new ArrayList<>();
+    for (Row row : rows) {
+      String id = (String) row.getField(0);
+      String data = (String) row.getField(1);
+      LocalDateTime ts = (LocalDateTime) row.getField(2);
+      String rowData = String.format("{\"id\": \"%s\", \"data\": \"%s\", \"ts\": \"%s\", \"partition\": \"%s\"}",
+          id, data, ts.toString(), partitioned ? data : "par");
+      dataBuffer.add(rowData);
+    }
+    return dataBuffer;
+  }
+
+  private void testWriteToHoodie(
+      Configuration conf,
+      int parallelism,
+      boolean partitioned,
+      RowType rowType,
+      List<Row> rows) throws Exception {
+
+    StreamExecutionEnvironment execEnv = StreamExecutionEnvironment.getExecutionEnvironment();
+    execEnv.getConfig().disableObjectReuse();
+    execEnv.setParallelism(parallelism);
+    execEnv.setMaxParallelism(parallelism);
+    // set up checkpoint interval
+    execEnv.enableCheckpointing(4000, CheckpointingMode.EXACTLY_ONCE);
+    execEnv.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
+
+    JsonRowDataDeserializationSchema deserializationSchema = new JsonRowDataDeserializationSchema(
+        rowType,
+        InternalTypeInfo.of(rowType),
+        false,
+        true,
+        TimestampFormat.ISO_8601
+    );
+
+    boolean isMor = conf.getString(FlinkOptions.TABLE_TYPE).equals(HoodieTableType.MERGE_ON_READ.name());
+
+    List<String> dataBuffer = getRowDataString(rows, partitioned);
+
+    DataStream<RowData> dataStream;
+    dataStream = execEnv
+        // use continuous file source to trigger checkpoint
+        .addSource(new ContinuousFileSource.BoundedSourceFunction(null, dataBuffer, 1))
+        .name("continuous_file_source")
+        .setParallelism(1)
+        .map(record -> deserializationSchema.deserialize(record.getBytes(StandardCharsets.UTF_8)))
+        .setParallelism(1);
+
+
+    //sink to hoodie table use low-level sink api.
+    HoodiePipeline.Builder builder = HoodiePipeline.builder("t_event_sink")
+        .column("id string not null")
+        .column("data string")
+        .column("`ts` timestamp(3)")
+        .column("`partition` string")
+        .pk("id")
+        .partition("partition")
+        .options(conf.toMap());
+
+    builder.sink(dataStream, false);
+    execute(execEnv, isMor, "EventTime_Sink_Test");
+  }
+
+  public void execute(StreamExecutionEnvironment execEnv, boolean isMor, String jobName) throws Exception {
+    if (isMor) {
+      JobClient client = execEnv.executeAsync(jobName);
+      if (client.getJobStatus().get() != JobStatus.FAILED) {
+        try {
+          TimeUnit.SECONDS.sleep(20); // wait long enough for the compaction to finish
+          client.cancel();
+        } catch (Throwable var1) {
+          // ignored
+        }
+      }
+    } else {
+      // wait for the streaming job to finish
+      execEnv.execute(jobName);
+    }
+  }
+}
diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java
index b83f3cc478..d31a884827 100644
--- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java
+++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java
@@ -171,8 +171,11 @@ public class StreamWriteFunctionWrapper<I> implements TestFunctionWrapper<I> {
     } else {
       bucketAssignerFunction.processElement(hoodieRecord, null, collector);
       bucketAssignFunctionContext.setCurrentKey(hoodieRecord.getRecordKey());
-      writeFunction.processElement(collector.getVal(), null, null);
+      if (writeFunction.extractTimeStampEnable()) {
+        writeFunction.extractTimestamp(collector.getVal());
+      }
     }
+    writeFunction.processElement(collector.getVal(), null, null);
   }
 
   public WriteMetadataEvent[] getEventBuffer() {
diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestDataTypeUtils.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestDataTypeUtils.java
new file mode 100644
index 0000000000..cda252ca5d
--- /dev/null
+++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestDataTypeUtils.java
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink.utils;
+
+import org.apache.flink.table.api.DataTypes;
+import org.apache.hudi.util.DataTypeUtils;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+public class TestDataTypeUtils {
+  @Test
+  public void testGetAsLong() {
+    long t1 = DataTypeUtils.getAsLong("2012-12-12T12:12:12", DataTypes.TIMESTAMP(3).getLogicalType());
+    assertEquals(1355314332000L, t1);
+
+    long t2 = DataTypeUtils.getAsLong("2012-12-12 12:12:12", DataTypes.TIME().getLogicalType());
+    assertEquals(1355314332000L, t2);
+
+    long t3 = DataTypeUtils.getAsLong("2012-12-12", DataTypes.DATE().getLogicalType());
+    assertEquals(1355270400000L, t3);
+
+    long t4 = DataTypeUtils.getAsLong(100L, DataTypes.BIGINT().getLogicalType());
+    assertEquals(100L, t4);
+
+    long t5 = DataTypeUtils.getAsLong(100, DataTypes.INT().getLogicalType());
+    assertEquals(100, t5);
+  }
+}
diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/source/ContinuousFileSource.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/source/ContinuousFileSource.java
index 2830eefef0..1715593077 100644
--- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/source/ContinuousFileSource.java
+++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/source/ContinuousFileSource.java
@@ -133,6 +133,11 @@ public class ContinuousFileSource implements ScanTableSource {
       this.checkpoints = checkpoints;
     }
 
+    public BoundedSourceFunction(Path path, List<String> dataBuffer, int checkpoints) {
+      this(path, checkpoints);
+      this.dataBuffer = dataBuffer;
+    }
+
     @Override
     public void run(SourceContext<String> context) throws Exception {
       if (this.dataBuffer == null) {


[hudi] 38/45: [MINOR] add integrity check of parquet file for HoodieRowDataParquetWriter.

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 0d71705ec2276f6524ab97e765563f6c902f35d9
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Fri Dec 9 12:39:39 2022 +0800

    [MINOR] add integrity check of parquet file for HoodieRowDataParquetWriter.
---
 .../java/org/apache/hudi/io/HoodieMergeHandle.java | 22 ++++++----------------
 .../src/main/java/org/apache/hudi/io/IOUtils.java  | 14 ++++++++++++++
 .../io/storage/row/HoodieRowDataParquetWriter.java |  2 ++
 3 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
index 88db25bac4..c569acdda6 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
@@ -18,6 +18,10 @@
 
 package org.apache.hudi.io;
 
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.fs.Path;
 import org.apache.hudi.client.WriteStatus;
 import org.apache.hudi.common.engine.TaskContextSupplier;
 import org.apache.hudi.common.fs.FSUtils;
@@ -34,7 +38,6 @@ import org.apache.hudi.common.model.IOType;
 import org.apache.hudi.common.util.DefaultSizeEstimator;
 import org.apache.hudi.common.util.HoodieRecordSizeEstimator;
 import org.apache.hudi.common.util.Option;
-import org.apache.hudi.common.util.ParquetUtils;
 import org.apache.hudi.common.util.ValidationUtils;
 import org.apache.hudi.common.util.collection.ExternalSpillableMap;
 import org.apache.hudi.config.HoodieWriteConfig;
@@ -47,27 +50,19 @@ import org.apache.hudi.io.storage.HoodieFileWriter;
 import org.apache.hudi.keygen.BaseKeyGenerator;
 import org.apache.hudi.keygen.KeyGenUtils;
 import org.apache.hudi.table.HoodieTable;
-
-import org.apache.avro.Schema;
-import org.apache.avro.generic.GenericRecord;
-import org.apache.avro.generic.IndexedRecord;
-import org.apache.hadoop.fs.Path;
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 
 import javax.annotation.concurrent.NotThreadSafe;
-
 import java.io.IOException;
 import java.util.Collections;
 import java.util.HashSet;
 import java.util.Iterator;
 import java.util.List;
-import java.util.NoSuchElementException;
 import java.util.Map;
+import java.util.NoSuchElementException;
 import java.util.Set;
 
-import static org.apache.hudi.common.model.HoodieFileFormat.PARQUET;
-
 @SuppressWarnings("Duplicates")
 /**
  * Handle to merge incoming records to those in storage.
@@ -450,12 +445,7 @@ public class HoodieMergeHandle<T extends HoodieRecordPayload, I, K, O> extends H
       return;
     }
 
-    // Fast verify the integrity of the parquet file.
-    // only check the readable of parquet metadata.
-    final String extension = FSUtils.getFileExtension(newFilePath.toString());
-    if (PARQUET.getFileExtension().equals(extension)) {
-      new ParquetUtils().readMetadata(hoodieTable.getHadoopConf(), newFilePath);
-    }
+    IOUtils.checkParquetFileVaid(hoodieTable.getHadoopConf(), newFilePath);
 
     long oldNumWrites = 0;
     try {
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/IOUtils.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/IOUtils.java
index 7636384c3a..b231136ece 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/IOUtils.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/IOUtils.java
@@ -18,11 +18,16 @@
 
 package org.apache.hudi.io;
 
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
 import org.apache.hudi.common.config.HoodieConfig;
 import org.apache.hudi.common.engine.EngineProperty;
 import org.apache.hudi.common.engine.TaskContextSupplier;
+import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ParquetUtils;
 
+import static org.apache.hudi.common.model.HoodieFileFormat.PARQUET;
 import static org.apache.hudi.config.HoodieMemoryConfig.DEFAULT_MAX_MEMORY_FOR_SPILLABLE_MAP_IN_BYTES;
 import static org.apache.hudi.config.HoodieMemoryConfig.DEFAULT_MIN_MEMORY_FOR_SPILLABLE_MAP_IN_BYTES;
 import static org.apache.hudi.config.HoodieMemoryConfig.MAX_MEMORY_FOR_COMPACTION;
@@ -72,4 +77,13 @@ public class IOUtils {
     String fraction = hoodieConfig.getStringOrDefault(MAX_MEMORY_FRACTION_FOR_COMPACTION);
     return getMaxMemoryAllowedForMerge(context, fraction);
   }
+
+  public static void checkParquetFileVaid(Configuration hadoopConf, Path filePath) {
+    // Fast verify the integrity of the parquet file.
+    // only check the readable of parquet metadata.
+    final String extension = FSUtils.getFileExtension(filePath.toString());
+    if (PARQUET.getFileExtension().equals(extension)) {
+      new ParquetUtils().readMetadata(hadoopConf, filePath);
+    }
+  }
 }
diff --git a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowDataParquetWriter.java b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowDataParquetWriter.java
index 17b3b6b37c..fd1edaab84 100644
--- a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowDataParquetWriter.java
+++ b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowDataParquetWriter.java
@@ -20,6 +20,7 @@ package org.apache.hudi.io.storage.row;
 
 import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
+import org.apache.hudi.io.IOUtils;
 import org.apache.hudi.io.storage.HoodieParquetConfig;
 
 import org.apache.flink.table.data.RowData;
@@ -74,5 +75,6 @@ public class HoodieRowDataParquetWriter extends ParquetWriter<RowData>
   @Override
   public void close() throws IOException {
     super.close();
+    IOUtils.checkParquetFileVaid(fs.getConf(), file);
   }
 }


[hudi] 34/45: exclude hudi-kafka-connect & add some api to support FLIP-27 source

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f2256ec94ccf7c6fc68b49ddd351f5b2aaa8fc35
Author: shaoxiong.zhan <sh...@gmail.com>
AuthorDate: Tue Dec 6 20:15:49 2022 +0800

    exclude hudi-kafka-connect & add some api to support FLIP-27 source
---
 .../apache/hudi/configuration/FlinkOptions.java    |  6 ++++
 .../org/apache/hudi/table/HoodieTableSource.java   | 42 +++++++++++++---------
 .../table/format/mor/MergeOnReadInputFormat.java   | 29 +++++++++++++++
 .../table/format/mor/MergeOnReadInputSplit.java    |  8 ++++-
 4 files changed, 68 insertions(+), 17 deletions(-)

diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
index aa1e3297bd..df2c96c8a9 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
@@ -200,6 +200,12 @@ public class FlinkOptions extends HoodieConfig {
       .noDefaultValue()
       .withDescription("Parallelism of tasks that do actual read, default is the parallelism of the execution environment");
 
+  public static final ConfigOption<Integer> NUM_RECORDS_PER_BATCH = ConfigOptions
+      .key("num.records_per.batch")
+      .intType()
+      .defaultValue(10000)
+      .withDescription("num records per batch in single split");
+
   public static final ConfigOption<String> SOURCE_AVRO_SCHEMA_PATH = ConfigOptions
       .key("source.avro-schema.path")
       .stringType()
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java
index 6fac5e4b88..ab270f89b0 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java
@@ -115,8 +115,9 @@ public class HoodieTableSource implements
   private final transient HoodieTableMetaClient metaClient;
   private final long maxCompactionMemoryInBytes;
 
-  private final ResolvedSchema schema;
   private final RowType tableRowType;
+  private final String[] schemaFieldNames;
+  private final DataType[] schemaTypes;
   private final Path path;
   private final List<String> partitionKeys;
   private final String defaultPartName;
@@ -135,34 +136,43 @@ public class HoodieTableSource implements
       List<String> partitionKeys,
       String defaultPartName,
       Configuration conf) {
-    this(schema, path, partitionKeys, defaultPartName, conf, null, null, null, null);
+
+    this(schema.getColumnNames().toArray(new String[0]),
+        schema.getColumnDataTypes().toArray(new DataType[0]),
+        (RowType) schema.toPhysicalRowDataType().notNull().getLogicalType(),
+        path, partitionKeys, defaultPartName, conf, null, null, null, null, null);
   }
 
   public HoodieTableSource(
-      ResolvedSchema schema,
+      String[] schemaFieldNames,
+      DataType[] schemaTypes,
+      RowType rowType,
       Path path,
       List<String> partitionKeys,
       String defaultPartName,
       Configuration conf,
+      @Nullable FileIndex fileIndex,
       @Nullable List<Map<String, String>> requiredPartitions,
       @Nullable int[] requiredPos,
       @Nullable Long limit,
-      @Nullable List<ResolvedExpression> filters) {
-    this.schema = schema;
-    this.tableRowType = (RowType) schema.toPhysicalRowDataType().notNull().getLogicalType();
+      @Nullable HoodieTableMetaClient metaClient) {
+    this.schemaFieldNames = schemaFieldNames;
+    this.schemaTypes = schemaTypes;
+    this.tableRowType = rowType;
     this.path = path;
     this.partitionKeys = partitionKeys;
     this.defaultPartName = defaultPartName;
     this.conf = conf;
+    this.fileIndex = fileIndex == null
+        ? FileIndex.instance(this.path, this.conf, this.tableRowType)
+        : fileIndex;
     this.requiredPartitions = requiredPartitions;
     this.requiredPos = requiredPos == null
         ? IntStream.range(0, this.tableRowType.getFieldCount()).toArray()
         : requiredPos;
     this.limit = limit == null ? NO_LIMIT_CONSTANT : limit;
-    this.filters = filters == null ? Collections.emptyList() : filters;
     this.hadoopConf = HadoopConfigurations.getHadoopConf(conf);
-    this.metaClient = StreamerUtil.metaClientForReader(conf, hadoopConf);
-    this.fileIndex = FileIndex.instance(this.path, this.conf, this.tableRowType);
+    this.metaClient = metaClient == null ? StreamerUtil.metaClientForReader(conf, hadoopConf) : metaClient;
     this.maxCompactionMemoryInBytes = StreamerUtil.getMaxCompactionMemoryInBytes(conf);
   }
 
@@ -210,8 +220,8 @@ public class HoodieTableSource implements
 
   @Override
   public DynamicTableSource copy() {
-    return new HoodieTableSource(schema, path, partitionKeys, defaultPartName,
-        conf, requiredPartitions, requiredPos, limit, filters);
+    return new HoodieTableSource(schemaFieldNames, schemaTypes, tableRowType, path, partitionKeys, defaultPartName,
+        conf, fileIndex, requiredPartitions, requiredPos, limit, metaClient);
   }
 
   @Override
@@ -256,8 +266,8 @@ public class HoodieTableSource implements
   }
 
   private DataType getProducedDataType() {
-    String[] schemaFieldNames = this.schema.getColumnNames().toArray(new String[0]);
-    DataType[] schemaTypes = this.schema.getColumnDataTypes().toArray(new DataType[0]);
+    String[] schemaFieldNames = this.schemaFieldNames;
+    DataType[] schemaTypes = this.schemaTypes;
 
     return DataTypes.ROW(Arrays.stream(this.requiredPos)
             .mapToObj(i -> DataTypes.FIELD(schemaFieldNames[i], schemaTypes[i]))
@@ -266,7 +276,7 @@ public class HoodieTableSource implements
   }
 
   private String getSourceOperatorName(String operatorName) {
-    String[] schemaFieldNames = this.schema.getColumnNames().toArray(new String[0]);
+    String[] schemaFieldNames = this.schemaFieldNames;
     List<String> fields = Arrays.stream(this.requiredPos)
         .mapToObj(i -> schemaFieldNames[i])
         .collect(Collectors.toList());
@@ -450,8 +460,8 @@ public class HoodieTableSource implements
 
     return new CopyOnWriteInputFormat(
         FilePathUtils.toFlinkPaths(paths),
-        this.schema.getColumnNames().toArray(new String[0]),
-        this.schema.getColumnDataTypes().toArray(new DataType[0]),
+        this.schemaFieldNames,
+        this.schemaTypes,
         this.requiredPos,
         this.conf.getString(FlinkOptions.PARTITION_DEFAULT_NAME),
         this.limit == NO_LIMIT_CONSTANT ? Long.MAX_VALUE : this.limit, // ParquetInputFormat always uses the limit value
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
index c9b6561bde..a0026d54ca 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
@@ -59,6 +59,7 @@ import org.apache.flink.table.types.logical.RowType;
 import org.apache.flink.types.RowKind;
 
 import java.io.IOException;
+import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.HashSet;
 import java.util.Iterator;
@@ -132,6 +133,8 @@ public class MergeOnReadInputFormat
    */
   private boolean emitDelete;
 
+  private final int numRecordsPerBatch;
+
   /**
    * Flag saying whether the input format has been closed.
    */
@@ -153,6 +156,7 @@ public class MergeOnReadInputFormat
     // because we need to
     this.requiredPos = tableState.getRequiredPositions();
     this.limit = limit;
+    this.numRecordsPerBatch = conf.get(FlinkOptions.NUM_RECORDS_PER_BATCH);
     this.emitDelete = emitDelete;
   }
 
@@ -408,6 +412,31 @@ public class MergeOnReadInputFormat
     };
   }
 
+  public Iterator<RowData> readBatch() throws IOException {
+    List<RowData> result = new ArrayList<>(this.numRecordsPerBatch);
+    int remaining = this.numRecordsPerBatch;
+    RowData next = null;
+    while (!this.isClosed() && remaining-- > 0) {
+      if (!this.reachedEnd()) {
+        next = nextRecord(null);
+        result.add(next);
+      } else {
+        close();
+        break;
+      }
+    }
+
+    if (result.isEmpty()) {
+      return null;
+    }
+    return result.iterator();
+  }
+
+  public MergeOnReadInputFormat copy() {
+    return new MergeOnReadInputFormat(this.conf, this.tableState, this.fieldTypes,
+        this.defaultPartName, this.limit, this.emitDelete);
+  }
+
   private ClosableIterator<RowData> getUnMergedLogFileIterator(MergeOnReadInputSplit split) {
     final Schema tableSchema = new Schema.Parser().parse(tableState.getAvroSchema());
     final Schema requiredSchema = new Schema.Parser().parse(tableState.getRequiredAvroSchema());
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputSplit.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputSplit.java
index cde646e41f..f3bc3361a6 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputSplit.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputSplit.java
@@ -21,6 +21,7 @@ package org.apache.hudi.table.format.mor;
 import org.apache.hudi.common.table.log.InstantRange;
 import org.apache.hudi.common.util.Option;
 
+import org.apache.flink.api.connector.source.SourceSplit;
 import org.apache.flink.core.io.InputSplit;
 
 import javax.annotation.Nullable;
@@ -30,7 +31,7 @@ import java.util.List;
 /**
  * Represents an input split of source, actually a data bucket.
  */
-public class MergeOnReadInputSplit implements InputSplit {
+public class MergeOnReadInputSplit implements InputSplit, SourceSplit {
   private static final long serialVersionUID = 1L;
 
   private static final long NUM_NO_CONSUMPTION = 0L;
@@ -78,6 +79,11 @@ public class MergeOnReadInputSplit implements InputSplit {
     this.fileId = fileId;
   }
 
+  @Override
+  public String splitId() {
+    return getFileId();
+  }
+
   public Option<String> getBasePath() {
     return basePath;
   }


[hudi] 19/45: add log to print scanInternal's logFilePath

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ecd39e3ad76b92a4f1dd0e18beed146736dc0592
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Wed Nov 2 12:26:49 2022 +0800

    add log to print scanInternal's logFilePath
---
 .../table/log/AbstractHoodieLogRecordReader.java   | 41 ++++++++++++----------
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java b/hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java
index 4566b1f5cd..eaca33ddcf 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java
@@ -72,7 +72,7 @@ import static org.apache.hudi.common.table.log.block.HoodieLogBlock.HoodieLogBlo
 /**
  * Implements logic to scan log blocks and expose valid and deleted log records to subclass implementation. Subclass is
  * free to either apply merging or expose raw data back to the caller.
- *
+ * <p>
  * NOTE: If readBlockLazily is turned on, does not merge, instead keeps reading log blocks and merges everything at once
  * This is an optimization to avoid seek() back and forth to read new block (forward seek()) and lazily read content of
  * seen block (reverse and forward seek()) during merge | | Read Block 1 Metadata | | Read Block 1 Data | | | Read Block
@@ -208,6 +208,8 @@ public abstract class AbstractHoodieLogRecordReader {
     HoodieTimeline commitsTimeline = this.hoodieTableMetaClient.getCommitsTimeline();
     HoodieTimeline completedInstantsTimeline = commitsTimeline.filterCompletedInstants();
     HoodieTimeline inflightInstantsTimeline = commitsTimeline.filterInflights();
+    HoodieLogFile logFile;
+    Path logFilePath = null;
     try {
       // Get the key field based on populate meta fields config
       // and the table type
@@ -216,12 +218,13 @@ public abstract class AbstractHoodieLogRecordReader {
       // Iterate over the paths
       boolean enableRecordLookups = !forceFullScan;
       logFormatReaderWrapper = new HoodieLogFormatReader(fs,
-          logFilePaths.stream().map(logFile -> new HoodieLogFile(new Path(logFile))).collect(Collectors.toList()),
+          logFilePaths.stream().map(log -> new HoodieLogFile(new Path(log))).collect(Collectors.toList()),
           readerSchema, readBlocksLazily, reverseReader, bufferSize, enableRecordLookups, keyField, internalSchema);
 
       Set<HoodieLogFile> scannedLogFiles = new HashSet<>();
       while (logFormatReaderWrapper.hasNext()) {
-        HoodieLogFile logFile = logFormatReaderWrapper.getLogFile();
+        logFile = logFormatReaderWrapper.getLogFile();
+        logFilePath = logFile.getPath();
         LOG.info("Scanning log file " + logFile);
         scannedLogFiles.add(logFile);
         totalLogFiles.set(scannedLogFiles.size());
@@ -250,7 +253,7 @@ public abstract class AbstractHoodieLogRecordReader {
           case HFILE_DATA_BLOCK:
           case AVRO_DATA_BLOCK:
           case PARQUET_DATA_BLOCK:
-            LOG.info("Reading a data block from file " + logFile.getPath() + " at instant "
+            LOG.info("Reading a data block from file " + logFilePath + " at instant "
                 + logBlock.getLogBlockHeader().get(INSTANT_TIME));
             if (isNewInstantBlock(logBlock) && !readBlocksLazily) {
               // If this is an avro data block belonging to a different commit/instant,
@@ -261,7 +264,7 @@ public abstract class AbstractHoodieLogRecordReader {
             currentInstantLogBlocks.push(logBlock);
             break;
           case DELETE_BLOCK:
-            LOG.info("Reading a delete block from file " + logFile.getPath());
+            LOG.info("Reading a delete block from file " + logFilePath);
             if (isNewInstantBlock(logBlock) && !readBlocksLazily) {
               // If this is a delete data block belonging to a different commit/instant,
               // then merge the last blocks and records into the main result
@@ -283,7 +286,7 @@ public abstract class AbstractHoodieLogRecordReader {
             // written per ingestion batch for a file but in reality we need to rollback (B1 & B2)
             // The following code ensures the same rollback block (R1) is used to rollback
             // both B1 & B2
-            LOG.info("Reading a command block from file " + logFile.getPath());
+            LOG.info("Reading a command block from file " + logFilePath);
             // This is a command block - take appropriate action based on the command
             HoodieCommandBlock commandBlock = (HoodieCommandBlock) logBlock;
             String targetInstantForCommandBlock =
@@ -302,23 +305,23 @@ public abstract class AbstractHoodieLogRecordReader {
                   HoodieLogBlock lastBlock = currentInstantLogBlocks.peek();
                   // handle corrupt blocks separately since they may not have metadata
                   if (lastBlock.getBlockType() == CORRUPT_BLOCK) {
-                    LOG.info("Rolling back the last corrupted log block read in " + logFile.getPath());
+                    LOG.info("Rolling back the last corrupted log block read in " + logFilePath);
                     currentInstantLogBlocks.pop();
                     numBlocksRolledBack++;
                   } else if (targetInstantForCommandBlock.contentEquals(lastBlock.getLogBlockHeader().get(INSTANT_TIME))) {
                     // rollback last data block or delete block
-                    LOG.info("Rolling back the last log block read in " + logFile.getPath());
+                    LOG.info("Rolling back the last log block read in " + logFilePath);
                     currentInstantLogBlocks.pop();
                     numBlocksRolledBack++;
                   } else if (!targetInstantForCommandBlock
                       .contentEquals(currentInstantLogBlocks.peek().getLogBlockHeader().get(INSTANT_TIME))) {
                     // invalid or extra rollback block
                     LOG.warn("TargetInstantTime " + targetInstantForCommandBlock
-                        + " invalid or extra rollback command block in " + logFile.getPath());
+                        + " invalid or extra rollback command block in " + logFilePath);
                     break;
                   } else {
                     // this should not happen ideally
-                    LOG.warn("Unable to apply rollback command block in " + logFile.getPath());
+                    LOG.warn("Unable to apply rollback command block in " + logFilePath);
                   }
                 }
                 LOG.info("Number of applied rollback blocks " + numBlocksRolledBack);
@@ -328,7 +331,7 @@ public abstract class AbstractHoodieLogRecordReader {
             }
             break;
           case CORRUPT_BLOCK:
-            LOG.info("Found a corrupt block in " + logFile.getPath());
+            LOG.info("Found a corrupt block in " + logFilePath);
             totalCorruptBlocks.incrementAndGet();
             // If there is a corrupt block - we will assume that this was the next data block
             currentInstantLogBlocks.push(logBlock);
@@ -345,11 +348,11 @@ public abstract class AbstractHoodieLogRecordReader {
       // Done
       progress = 1.0f;
     } catch (IOException e) {
-      LOG.error("Got IOException when reading log file", e);
-      throw new HoodieIOException("IOException when reading log file ", e);
+      LOG.error("Got IOException when reading log file: " + logFilePath, e);
+      throw new HoodieIOException("IOException when reading log file: " + logFilePath, e);
     } catch (Exception e) {
-      LOG.error("Got exception when reading log file", e);
-      throw new HoodieException("Exception when reading log file ", e);
+      LOG.error("Got exception when reading log file: " + logFilePath, e);
+      throw new HoodieException("Exception when reading log file: " + logFilePath, e);
     } finally {
       try {
         if (null != logFormatReaderWrapper) {
@@ -423,10 +426,10 @@ public abstract class AbstractHoodieLogRecordReader {
    * @return HoodieRecord created from the IndexedRecord
    */
   protected HoodieAvroRecord<?> createHoodieRecord(final IndexedRecord rec, final HoodieTableConfig hoodieTableConfig,
-                                               final String payloadClassFQN, final String preCombineField,
-                                               final boolean withOperationField,
-                                               final Option<Pair<String, String>> simpleKeyGenFields,
-                                               final Option<String> partitionName) {
+                                                   final String payloadClassFQN, final String preCombineField,
+                                                   final boolean withOperationField,
+                                                   final Option<Pair<String, String>> simpleKeyGenFields,
+                                                   final Option<String> partitionName) {
     if (this.populateMetaFields) {
       return SpillableMapUtils.convertToHoodieRecordPayload((GenericRecord) rec, payloadClassFQN,
           preCombineField, withOperationField);


[hudi] 42/45: improve checkstyle

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 79abc24265debcd0ce4a57adbd1ca33e8591d1a4
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Thu Dec 15 16:45:05 2022 +0800

    improve checkstyle
---
 dev/tencent-install.sh |  5 +++--
 dev/tencent-release.sh |  1 +
 pom.xml                | 10 ++++++++++
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/dev/tencent-install.sh b/dev/tencent-install.sh
index 1e34f40440..173ca06671 100644
--- a/dev/tencent-install.sh
+++ b/dev/tencent-install.sh
@@ -40,6 +40,7 @@ echo "Preparing source for $tagrc"
 # change version
 echo "Change version for ${version}"
 mvn versions:set -DnewVersion=${version} -DgenerateBackupPom=false -s dev/settings.xml -U
+mvn -N versions:update-child-modules
 mvn versions:commit -s dev/settings.xml -U
 
 function git_push() {
@@ -118,9 +119,9 @@ function deploy_spark() {
   FLINK_VERSION=$3
 
   if [ ${release_repo} = "Y" ]; then
-    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -s dev/settings.xml -DretryFailedDeploymentCount=30 -T 2.5C"
+    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -s dev/settings.xml -DretryFailedDeploymentCount=30 -T 2.5C"
   else
-    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -s dev/settings.xml -DretryFailedDeploymentCount=30 -T 2.5C"
+    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -s dev/settings.xml -DretryFailedDeploymentCount=30 -T 2.5C"
   fi
 
   #  INSTALL_OPTIONS="-U -Drat.skip=true -Djacoco.skip=true -Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -DskipTests -s dev/settings.xml -T 2.5C"
diff --git a/dev/tencent-release.sh b/dev/tencent-release.sh
index b788d62dc7..54631f5c0f 100644
--- a/dev/tencent-release.sh
+++ b/dev/tencent-release.sh
@@ -40,6 +40,7 @@ echo "Preparing source for $tagrc"
 # change version
 echo "Change version for ${version}"
 mvn versions:set -DnewVersion=${version} -DgenerateBackupPom=false -s dev/settings.xml -U
+mvn -N versions:update-child-modules
 mvn versions:commit -s dev/settings.xml -U
 
 # create version.txt for this release
diff --git a/pom.xml b/pom.xml
index 01be2c1f89..230b338df6 100644
--- a/pom.xml
+++ b/pom.xml
@@ -583,6 +583,16 @@
             </execution>
           </executions>
         </plugin>
+
+        <!-- One-click update submodule version number -->
+        <plugin>
+          <groupId>org.codehaus.mojo</groupId>
+          <artifactId>versions-maven-plugin</artifactId>
+          <version>2.7</version>
+          <configuration>
+            <generateBackupPoms>false</generateBackupPoms>
+          </configuration>
+        </plugin>
       </plugins>
     </pluginManagement>
   </build>


[hudi] 41/45: fix read log not exist

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 4fe2aec44acefc7ec857836b54af69dce5f41bda
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Tue Dec 13 14:52:03 2022 +0800

    fix read log not exist
---
 .../org/apache/hudi/common/table/log/HoodieLogFormatReader.java   | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java
index c48107e392..7f67c76870 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java
@@ -67,8 +67,12 @@ public class HoodieLogFormatReader implements HoodieLogFormat.Reader {
     this.internalSchema = internalSchema == null ? InternalSchema.getEmptyInternalSchema() : internalSchema;
     if (logFiles.size() > 0) {
       HoodieLogFile nextLogFile = logFiles.remove(0);
-      this.currentReader = new HoodieLogFileReader(fs, nextLogFile, readerSchema, bufferSize, readBlocksLazily, false,
-          enableRecordLookups, recordKeyField, internalSchema);
+      if (fs.exists(nextLogFile.getPath())) {
+        this.currentReader = new HoodieLogFileReader(fs, nextLogFile, readerSchema, bufferSize, readBlocksLazily, false,
+            enableRecordLookups, recordKeyField, internalSchema);
+      } else {
+        LOG.warn("File does not exist: " + nextLogFile.getPath());
+      }
     }
   }
 


[hudi] 12/45: opt procedure backup_invalid_parquet

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 895f260983f3f6875c862b1d8e62c04aed09ff17
Author: shaoxiong.zhan <sh...@gmail.com>
AuthorDate: Thu Sep 22 20:18:37 2022 +0800

    opt procedure backup_invalid_parquet
    
    
    (cherry picked from commit 422f1e53)
    
    903daba5 opt procedure backup_invalid_parquet
---
 .../procedures/BackupInvalidParquetProcedure.scala        | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BackupInvalidParquetProcedure.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BackupInvalidParquetProcedure.scala
index fbbb1247fa..5c1234b7a2 100644
--- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BackupInvalidParquetProcedure.scala
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BackupInvalidParquetProcedure.scala
@@ -21,15 +21,18 @@ import org.apache.hadoop.fs.Path
 import org.apache.hudi.client.common.HoodieSparkEngineContext
 import org.apache.hudi.common.config.SerializableConfiguration
 import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.HoodieFileFormat
+import org.apache.hudi.common.util.BaseFileUtils
 import org.apache.parquet.format.converter.ParquetMetadataConverter.SKIP_ROW_GROUPS
 import org.apache.parquet.hadoop.ParquetFileReader
 import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.internal.Logging
 import org.apache.spark.sql.Row
 import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
 
 import java.util.function.Supplier
 
-class BackupInvalidParquetProcedure extends BaseProcedure with ProcedureBuilder {
+class BackupInvalidParquetProcedure extends BaseProcedure with ProcedureBuilder with Logging {
   private val PARAMETERS = Array[ProcedureParameter](
     ProcedureParameter.required(0, "path", DataTypes.StringType, None)
   )
@@ -62,9 +65,15 @@ class BackupInvalidParquetProcedure extends BaseProcedure with ProcedureBuilder
         val filePath = status.getPath
         var isInvalid = false
         if (filePath.toString.endsWith(".parquet")) {
-          try ParquetFileReader.readFooter(serHadoopConf.get(), filePath, SKIP_ROW_GROUPS).getFileMetaData catch {
+          try {
+            // check footer
+            ParquetFileReader.readFooter(serHadoopConf.get(), filePath, SKIP_ROW_GROUPS).getFileMetaData
+
+            // check row group
+            BaseFileUtils.getInstance(HoodieFileFormat.PARQUET).readAvroRecords(serHadoopConf.get(), filePath)
+          } catch {
             case e: Exception =>
-              isInvalid = e.getMessage.contains("is not a Parquet file")
+              isInvalid = true
               filePath.getFileSystem(serHadoopConf.get()).rename(filePath, new Path(backupPath, filePath.getName))
           }
         }


[hudi] 14/45: [HUDI-5041] Fix lock metric register confict error (#6968)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 8d692f38c14f8d6e6fcb2f5fa1b159d0e5ade678
Author: Bingeng Huang <30...@qq.com>
AuthorDate: Thu Oct 20 01:32:45 2022 +0800

    [HUDI-5041] Fix lock metric register confict error (#6968)
    
    Co-authored-by: hbg <bi...@shopee.com>
    (cherry picked from commit e6eb4e6f683ca9f66cdcca2d63eeb5a1a8d81241)
---
 .../transaction/lock/metrics/HoodieLockMetrics.java   | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/metrics/HoodieLockMetrics.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/metrics/HoodieLockMetrics.java
index 6ea7a1ae14..c33a86bfbe 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/metrics/HoodieLockMetrics.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/metrics/HoodieLockMetrics.java
@@ -18,15 +18,15 @@
 
 package org.apache.hudi.client.transaction.lock.metrics;
 
-import org.apache.hudi.common.util.HoodieTimer;
-import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.metrics.Metrics;
-
 import com.codahale.metrics.Counter;
 import com.codahale.metrics.MetricRegistry;
 import com.codahale.metrics.SlidingWindowReservoir;
 import com.codahale.metrics.Timer;
 
+import org.apache.hudi.common.util.HoodieTimer;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.metrics.Metrics;
+
 import java.util.concurrent.TimeUnit;
 
 public class HoodieLockMetrics {
@@ -46,6 +46,7 @@ public class HoodieLockMetrics {
   private transient Counter failedLockAttempts;
   private transient Timer lockDuration;
   private transient Timer lockApiRequestDuration;
+  private static final Object REGISTRY_LOCK = new Object();
 
   public HoodieLockMetrics(HoodieWriteConfig writeConfig) {
     this.isMetricsEnabled = writeConfig.isLockingMetricsEnabled();
@@ -69,10 +70,12 @@ public class HoodieLockMetrics {
 
   private Timer createTimerForMetrics(MetricRegistry registry, String metric) {
     String metricName = getMetricsName(metric);
-    if (registry.getMetrics().get(metricName) == null) {
-      lockDuration = new Timer(new SlidingWindowReservoir(keepLastNtimes));
-      registry.register(metricName, lockDuration);
-      return lockDuration;
+    synchronized (REGISTRY_LOCK) {
+      if (registry.getMetrics().get(metricName) == null) {
+        lockDuration = new Timer(new SlidingWindowReservoir(keepLastNtimes));
+        registry.register(metricName, lockDuration);
+        return lockDuration;
+      }
     }
     return (Timer) registry.getMetrics().get(metricName);
   }


[hudi] 26/45: optimize schema settings

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 00c3443cb4112fbd94281d52465cccfbc00516c5
Author: superche <su...@tencent.com>
AuthorDate: Fri Nov 18 16:24:19 2022 +0800

    optimize schema settings
---
 .../src/main/java/org/apache/hudi/util/HoodiePipeline.java | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/HoodiePipeline.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/HoodiePipeline.java
index f95367c836..0e7e262aeb 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/HoodiePipeline.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/HoodiePipeline.java
@@ -18,6 +18,8 @@
 
 package org.apache.hudi.util;
 
+import org.apache.flink.table.api.Schema;
+import org.apache.flink.table.utils.EncodingUtils;
 import org.apache.hudi.adapter.Utils;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.table.HoodieTableFactory;
@@ -125,6 +127,18 @@ public class HoodiePipeline {
       return this;
     }
 
+    public Builder schema(Schema schema) {
+      for (Schema.UnresolvedColumn column : schema.getColumns()) {
+        column(column.toString());
+      }
+
+      if (schema.getPrimaryKey().isPresent()) {
+        pk(schema.getPrimaryKey().get().getColumnNames().stream().map(EncodingUtils::escapeIdentifier).collect(Collectors.joining(", ")));
+      }
+
+      return this;
+    }
+
     /**
      * Add a config option.
      */


[hudi] 05/45: [HUDI-4475] fix create table with not exists hoodie properties file

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 553bb9eab475a05bb91e150ced3dc9c18df1cd7e
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Tue Aug 23 18:55:25 2022 +0800

    [HUDI-4475] fix create table with not exists hoodie properties file
---
 .../main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala
index 025a224373..775f90dae1 100644
--- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala
+++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala
@@ -24,7 +24,7 @@ import org.apache.hudi.common.config.{DFSPropertiesConfiguration, HoodieMetadata
 import org.apache.hudi.common.fs.FSUtils
 import org.apache.hudi.common.model.HoodieRecord
 import org.apache.hudi.common.table.timeline.{HoodieActiveTimeline, HoodieInstantTimeGenerator}
-import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver}
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, TableSchemaResolver}
 import org.apache.hudi.common.util.PartitionPathEncodeUtils
 import org.apache.hudi.{AvroConversionUtils, DataSourceOptionsHelper, DataSourceReadOptions, SparkAdapterSupport}
 import org.apache.spark.api.java.JavaSparkContext
@@ -227,7 +227,9 @@ object HoodieSqlCommonUtils extends SparkAdapterSupport {
     val basePath = new Path(tablePath)
     val fs = basePath.getFileSystem(conf)
     val metaPath = new Path(basePath, HoodieTableMetaClient.METAFOLDER_NAME)
-    fs.exists(metaPath)
+    val cfgPath = new Path(metaPath, HoodieTableConfig.HOODIE_PROPERTIES_FILE)
+    val backupCfgPath = new Path(metaPath, HoodieTableConfig.HOODIE_PROPERTIES_FILE_BACKUP)
+    fs.exists(metaPath) && (fs.exists(cfgPath) || fs.exists(backupCfgPath))
   }
 
   /**


[hudi] 03/45: fix cherry pick err

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit d785b41f01004897e0d9c2882a58f79827f5d9fe
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Sun Oct 23 17:09:49 2022 +0800

    fix cherry pick err
---
 .../hudi/metrics/zhiyan/ZhiyanMetricsReporter.java |  6 -----
 .../apache/hudi/configuration/FlinkOptions.java    | 31 ++++++++++++++--------
 .../main/java/org/apache/hudi/DataSourceUtils.java |  5 ++++
 pom.xml                                            | 12 ++++-----
 4 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanMetricsReporter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanMetricsReporter.java
index 323fe17106..6b820547a0 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanMetricsReporter.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanMetricsReporter.java
@@ -23,7 +23,6 @@ import com.codahale.metrics.MetricRegistry;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.metrics.MetricsReporter;
 
-import java.io.Closeable;
 import java.util.concurrent.TimeUnit;
 
 public class ZhiyanMetricsReporter extends MetricsReporter {
@@ -54,11 +53,6 @@ public class ZhiyanMetricsReporter extends MetricsReporter {
     reporter.report();
   }
 
-  @Override
-  public Closeable getReporter() {
-    return reporter;
-  }
-
   @Override
   public void stop() {
     reporter.stop();
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
index 4a298839fb..31c8b554c0 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
@@ -416,7 +416,7 @@ public class FlinkOptions extends HoodieConfig {
       .key("write.bucket_assign.tasks")
       .intType()
       .noDefaultValue()
-      .withDescription("Parallelism of tasks that do bucket assign, default is the parallelism of the execution environment");
+      .withDescription("Parallelism of tasks that do bucket assign, default same as the write task parallelism");
 
   public static final ConfigOption<Integer> WRITE_TASKS = ConfigOptions
       .key("write.tasks")
@@ -585,7 +585,7 @@ public class FlinkOptions extends HoodieConfig {
       .stringType()
       .defaultValue(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name())
       .withDescription("Clean policy to manage the Hudi table. Available option: KEEP_LATEST_COMMITS, KEEP_LATEST_FILE_VERSIONS, KEEP_LATEST_BY_HOURS."
-          +  "Default is KEEP_LATEST_COMMITS.");
+          + "Default is KEEP_LATEST_COMMITS.");
 
   public static final ConfigOption<Integer> CLEAN_RETAIN_COMMITS = ConfigOptions
       .key("clean.retain_commits")
@@ -594,6 +594,14 @@ public class FlinkOptions extends HoodieConfig {
       .withDescription("Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled).\n"
           + "This also directly translates into how much you can incrementally pull on this table, default 30");
 
+  public static final ConfigOption<Integer> CLEAN_RETAIN_HOURS = ConfigOptions
+      .key("clean.retain_hours")
+      .intType()
+      .defaultValue(24)// default 24 hours
+      .withDescription("Number of hours for which commits need to be retained. This config provides a more flexible option as"
+          + "compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group,"
+          + " corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.");
+
   public static final ConfigOption<Integer> CLEAN_RETAIN_FILE_VERSIONS = ConfigOptions
       .key("clean.retain_file_versions")
       .intType()
@@ -657,7 +665,7 @@ public class FlinkOptions extends HoodieConfig {
   public static final ConfigOption<String> CLUSTERING_PLAN_PARTITION_FILTER_MODE_NAME = ConfigOptions
       .key("clustering.plan.partition.filter.mode")
       .stringType()
-      .defaultValue("NONE")
+      .defaultValue(ClusteringPlanPartitionFilterMode.NONE.name())
       .withDescription("Partition filter mode used in the creation of clustering plan. Available values are - "
           + "NONE: do not filter table partition and thus the clustering plan will include all partitions that have clustering candidate."
           + "RECENT_DAYS: keep a continuous range of partitions, worked together with configs '" + DAYBASED_LOOKBACK_PARTITIONS.key() + "' and '"
@@ -665,16 +673,16 @@ public class FlinkOptions extends HoodieConfig {
           + "SELECTED_PARTITIONS: keep partitions that are in the specified range ['" + PARTITION_FILTER_BEGIN_PARTITION.key() + "', '"
           + PARTITION_FILTER_END_PARTITION.key() + "'].");
 
-  public static final ConfigOption<Integer> CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES = ConfigOptions
+  public static final ConfigOption<Long> CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES = ConfigOptions
       .key("clustering.plan.strategy.target.file.max.bytes")
-      .intType()
-      .defaultValue(1024 * 1024 * 1024) // default 1 GB
+      .longType()
+      .defaultValue(1024 * 1024 * 1024L) // default 1 GB
       .withDescription("Each group can produce 'N' (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file groups, default 1 GB");
 
-  public static final ConfigOption<Integer> CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT = ConfigOptions
+  public static final ConfigOption<Long> CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT = ConfigOptions
       .key("clustering.plan.strategy.small.file.limit")
-      .intType()
-      .defaultValue(600) // default 600 MB
+      .longType()
+      .defaultValue(600L) // default 600 MB
       .withDescription("Files smaller than the size specified here are candidates for clustering, default 600 MB");
 
   public static final ConfigOption<Integer> CLUSTERING_PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST = ConfigOptions
@@ -698,6 +706,7 @@ public class FlinkOptions extends HoodieConfig {
   // ------------------------------------------------------------------------
   //  Hive Sync Options
   // ------------------------------------------------------------------------
+
   public static final ConfigOption<Boolean> HIVE_SYNC_ENABLED = ConfigOptions
       .key("hive_sync.enable")
       .booleanType()
@@ -725,8 +734,8 @@ public class FlinkOptions extends HoodieConfig {
   public static final ConfigOption<String> HIVE_SYNC_MODE = ConfigOptions
       .key("hive_sync.mode")
       .stringType()
-      .defaultValue("jdbc")
-      .withDescription("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'jdbc'");
+      .defaultValue(HiveSyncMode.HMS.name())
+      .withDescription("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'hms'");
 
   public static final ConfigOption<String> HIVE_SYNC_USERNAME = ConfigOptions
       .key("hive_sync.username")
diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
index 0df3a0c8da..2b7d8f49b4 100644
--- a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
+++ b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
@@ -194,6 +194,11 @@ public class DataSourceUtils {
         .withProps(parameters).build();
   }
 
+  public static SparkRDDWriteClient createHoodieClient(JavaSparkContext jssc, String schemaStr, String basePath,
+                                                       String tblName, Map<String, String> parameters) {
+    return createHoodieClient(jssc, schemaStr, basePath, "default", tblName, parameters);
+  }
+
   public static SparkRDDWriteClient createHoodieClient(JavaSparkContext jssc, String schemaStr, String basePath,
                                                        String dbName, String tblName, Map<String, String> parameters) {
     return new SparkRDDWriteClient<>(new HoodieSparkEngineContext(jssc), createHoodieConfig(schemaStr, basePath, dbName, tblName, parameters));
diff --git a/pom.xml b/pom.xml
index 159ae2a841..60c13c8f07 100644
--- a/pom.xml
+++ b/pom.xml
@@ -375,12 +375,12 @@
               <rules>
                 <bannedDependencies>
                   <excludes>
-                    <exclude>org.slf4j:slf4j-log4j12</exclude>
-                    <exclude>org.sl4fj:slf4j-simple</exclude>
-                    <exclude>org.sl4fj:slf4j-jdk14</exclude>
-                    <exclude>org.sl4fj:slf4j-nop</exclude>
-                    <exclude>org.sl4fj:slf4j-jcl</exclude>
-                    <exclude>log4j:log4j</exclude>
+<!--                    <exclude>org.slf4j:slf4j-log4j12</exclude>-->
+<!--                    <exclude>org.sl4fj:slf4j-simple</exclude>-->
+<!--                    <exclude>org.sl4fj:slf4j-jdk14</exclude>-->
+<!--                    <exclude>org.sl4fj:slf4j-nop</exclude>-->
+<!--                    <exclude>org.sl4fj:slf4j-jcl</exclude>-->
+<!--                    <exclude>log4j:log4j</exclude>-->
                     <exclude>ch.qos.logback:logback-classic</exclude>
                     <!-- NOTE: We're banning any HBase deps versions other than the approved ${hbase.version},
                                which is aimed at preventing the classpath collisions w/ transitive deps usually) -->


[hudi] 09/45: [MINOR] Adapt to tianqiong spark

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f80900d91cd134cc2150d4e3ee6d85b8b00c2ad7
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Mon Oct 24 17:18:19 2022 +0800

    [MINOR] Adapt to tianqiong spark
---
 .../datasources/Spark31NestedSchemaPruning.scala   | 24 ++++++++++++++--------
 pom.xml                                            | 10 ++++-----
 2 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/execution/datasources/Spark31NestedSchemaPruning.scala b/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/execution/datasources/Spark31NestedSchemaPruning.scala
index 1b29c428bb..76cdb443b4 100644
--- a/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/execution/datasources/Spark31NestedSchemaPruning.scala
+++ b/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/execution/datasources/Spark31NestedSchemaPruning.scala
@@ -17,15 +17,15 @@
 
 package org.apache.spark.sql.execution.datasources
 
-import org.apache.hudi.{HoodieBaseRelation, SparkAdapterSupport}
-import org.apache.spark.sql.HoodieSpark3CatalystPlanUtils
-import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, AttributeSet, Expression, NamedExpression, ProjectionOverSchema}
+import org.apache.hudi.HoodieBaseRelation
+import org.apache.spark.sql.catalyst.expressions.{Alias, And, Attribute, AttributeReference, Expression, NamedExpression, ProjectionOverSchema}
 import org.apache.spark.sql.catalyst.planning.PhysicalOperation
-import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LeafNode, LogicalPlan, Project}
 import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
 import org.apache.spark.sql.sources.BaseRelation
 import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType}
-import org.apache.spark.sql.util.SchemaUtils.restoreOriginalOutputNames
 
 /**
  * Prunes unnecessary physical columns given a [[PhysicalOperation]] over a data source relation.
@@ -86,10 +86,8 @@ class Spark31NestedSchemaPruning extends Rule[LogicalPlan] {
       // each schemata, assuming the fields in prunedDataSchema are a subset of the fields
       // in dataSchema.
       if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) {
-        val planUtils = SparkAdapterSupport.sparkAdapter.getCatalystPlanUtils.asInstanceOf[HoodieSpark3CatalystPlanUtils]
-
         val prunedRelation = outputRelationBuilder(prunedDataSchema)
-        val projectionOverSchema = planUtils.projectOverSchema(prunedDataSchema, AttributeSet(output))
+        val projectionOverSchema = ProjectionOverSchema(prunedDataSchema)
 
         Some(buildNewProjection(projects, normalizedProjects, normalizedFilters,
           prunedRelation, projectionOverSchema))
@@ -195,4 +193,14 @@ class Spark31NestedSchemaPruning extends Rule[LogicalPlan] {
       case _ => 1
     }
   }
+
+  def restoreOriginalOutputNames(
+                                  projectList: Seq[NamedExpression],
+                                  originalNames: Seq[String]): Seq[NamedExpression] = {
+    projectList.zip(originalNames).map {
+      case (attr: Attribute, name) => attr.withName(name)
+      case (alias: Alias, name) => alias
+      case (other, _) => other
+    }
+  }
 }
diff --git a/pom.xml b/pom.xml
index 60c13c8f07..0adb64838b 100644
--- a/pom.xml
+++ b/pom.xml
@@ -381,7 +381,7 @@
 <!--                    <exclude>org.sl4fj:slf4j-nop</exclude>-->
 <!--                    <exclude>org.sl4fj:slf4j-jcl</exclude>-->
 <!--                    <exclude>log4j:log4j</exclude>-->
-                    <exclude>ch.qos.logback:logback-classic</exclude>
+<!--                    <exclude>ch.qos.logback:logback-classic</exclude>-->
                     <!-- NOTE: We're banning any HBase deps versions other than the approved ${hbase.version},
                                which is aimed at preventing the classpath collisions w/ transitive deps usually) -->
                     <exclude>org.apache.hbase:hbase-common:*</exclude>
@@ -389,7 +389,7 @@
                     <exclude>org.apache.hbase:hbase-server:*</exclude>
                   </excludes>
                   <includes>
-                    <include>org.slf4j:slf4j-simple:*:*:test</include>
+<!--                    <include>org.slf4j:slf4j-simple:*:*:test</include>-->
                     <include>org.apache.hbase:hbase-common:${hbase.version}</include>
                     <include>org.apache.hbase:hbase-client:${hbase.version}</include>
                     <include>org.apache.hbase:hbase-server:${hbase.version}</include>
@@ -1864,9 +1864,9 @@
                 <configuration>
                   <rules>
                     <bannedDependencies>
-                      <excludes combine.children="append">
-                        <exclude>*:*_2.11</exclude>
-                      </excludes>
+<!--                      <excludes combine.children="append">-->
+<!--                        <exclude>*:*_2.11</exclude>-->
+<!--                      </excludes>-->
                     </bannedDependencies>
                   </rules>
                 </configuration>


[hudi] 45/45: [HUDI-3572] support DAY_ROLLING strategy in ClusteringPlanPartitionFilterMode (#4966)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 78fe5c73a4acaf5c4b8ff69c3c95fbd7d4c2dbaf
Author: 苏承祥 <11...@qq.com>
AuthorDate: Tue Jan 3 15:31:01 2023 +0800

    [HUDI-3572] support DAY_ROLLING strategy in ClusteringPlanPartitionFilterMode (#4966)
    
    (cherry picked from commit 41bea2fec54ae6c2376f5c88bd5a524b60b74a11)
---
 .../cluster/ClusteringPlanPartitionFilter.java     | 23 +++++++++++++++++
 .../cluster/ClusteringPlanPartitionFilterMode.java |  3 ++-
 .../TestSparkClusteringPlanPartitionFilter.java    | 29 ++++++++++++++++++++++
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ClusteringPlanPartitionFilter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ClusteringPlanPartitionFilter.java
index 3a889de753..ecc3706f67 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ClusteringPlanPartitionFilter.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ClusteringPlanPartitionFilter.java
@@ -21,6 +21,9 @@ package org.apache.hudi.table.action.cluster;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.exception.HoodieClusteringException;
 
+import org.joda.time.DateTime;
+
+import java.util.ArrayList;
 import java.util.Comparator;
 import java.util.List;
 import java.util.stream.Collectors;
@@ -31,6 +34,11 @@ import java.util.stream.Stream;
  *  NONE: skip filter
  *  RECENT DAYS: output recent partition given skip num and days lookback config
  *  SELECTED_PARTITIONS: output partition falls in the [start, end] condition
+ *  DAY_ROLLING: Clustering all partitions once a day to avoid clustering data of all partitions each time.
+ *  sort partitions asc, choose which partition index % 24 = now_hour.
+ *  tips: If hoodie.clustering.inline=true, try to reach the limit of hoodie.clustering.inline.max.commits every hour.
+ *        If hoodie.clustering.async.enabled=true, try to reach the limit of hoodie.clustering.async.max.commits every hour.
+ *
  */
 public class ClusteringPlanPartitionFilter {
 
@@ -43,11 +51,26 @@ public class ClusteringPlanPartitionFilter {
         return recentDaysFilter(partitions, config);
       case SELECTED_PARTITIONS:
         return selectedPartitionsFilter(partitions, config);
+      case DAY_ROLLING:
+        return dayRollingFilter(partitions, config);
       default:
         throw new HoodieClusteringException("Unknown partition filter, filter mode: " + mode);
     }
   }
 
+  private static List<String> dayRollingFilter(List<String> partitions, HoodieWriteConfig config) {
+    int hour = DateTime.now().getHourOfDay();
+    int len = partitions.size();
+    List<String> selectPt = new ArrayList<>();
+    partitions.sort(String::compareTo);
+    for (int i = 0; i < len; i++) {
+      if (i % 24 == hour) {
+        selectPt.add(partitions.get(i));
+      }
+    }
+    return selectPt;
+  }
+
   private static List<String> recentDaysFilter(List<String> partitions, HoodieWriteConfig config) {
     int targetPartitionsForClustering = config.getTargetPartitionsForClustering();
     int skipPartitionsFromLatestForClustering = config.getSkipPartitionsFromLatestForClustering();
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ClusteringPlanPartitionFilterMode.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ClusteringPlanPartitionFilterMode.java
index fbaf79797f..261c1874cc 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ClusteringPlanPartitionFilterMode.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ClusteringPlanPartitionFilterMode.java
@@ -24,5 +24,6 @@ package org.apache.hudi.table.action.cluster;
 public enum ClusteringPlanPartitionFilterMode {
   NONE,
   RECENT_DAYS,
-  SELECTED_PARTITIONS
+  SELECTED_PARTITIONS,
+  DAY_ROLLING
 }
diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/cluster/strategy/TestSparkClusteringPlanPartitionFilter.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/cluster/strategy/TestSparkClusteringPlanPartitionFilter.java
index a68a9e3360..70643a327d 100644
--- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/cluster/strategy/TestSparkClusteringPlanPartitionFilter.java
+++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/cluster/strategy/TestSparkClusteringPlanPartitionFilter.java
@@ -26,6 +26,7 @@ import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.table.HoodieSparkCopyOnWriteTable;
 import org.apache.hudi.table.action.cluster.ClusteringPlanPartitionFilterMode;
 
+import org.joda.time.DateTime;
 import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 import org.mockito.Mock;
@@ -104,4 +105,32 @@ public class TestSparkClusteringPlanPartitionFilter {
     assertEquals(1, list.size());
     assertSame("20211222", list.get(0));
   }
+
+  @Test
+  public void testDayRollingPartitionFilter() {
+    HoodieWriteConfig config = hoodieWriteConfigBuilder.withClusteringConfig(HoodieClusteringConfig.newBuilder()
+            .withClusteringPlanPartitionFilterMode(ClusteringPlanPartitionFilterMode.DAY_ROLLING)
+            .build())
+        .build();
+    PartitionAwareClusteringPlanStrategy sg = new SparkSizeBasedClusteringPlanStrategy(table, context, config);
+    ArrayList<String> fakeTimeBasedPartitionsPath = new ArrayList<>();
+    for (int i = 0; i < 24; i++) {
+      fakeTimeBasedPartitionsPath.add("20220301" + (i >= 10 ? String.valueOf(i) : "0" + i));
+    }
+    List filterPartitions = sg.filterPartitionPaths(fakeTimeBasedPartitionsPath);
+    assertEquals(1, filterPartitions.size());
+    assertEquals(fakeTimeBasedPartitionsPath.get(DateTime.now().getHourOfDay()), filterPartitions.get(0));
+    fakeTimeBasedPartitionsPath = new ArrayList<>();
+    for (int i = 0; i < 24; i++) {
+      fakeTimeBasedPartitionsPath.add("20220301" + (i >= 10 ? String.valueOf(i) : "0" + i));
+      fakeTimeBasedPartitionsPath.add("20220302" + (i >= 10 ? String.valueOf(i) : "0" + i));
+    }
+    filterPartitions = sg.filterPartitionPaths(fakeTimeBasedPartitionsPath);
+    assertEquals(2, filterPartitions.size());
+
+    int hourOfDay = DateTime.now().getHourOfDay();
+    String suffix = hourOfDay >= 10 ? hourOfDay + "" : "0" + hourOfDay;
+    assertEquals("20220301" + suffix, filterPartitions.get(0));
+    assertEquals("20220302" + suffix, filterPartitions.get(1));
+  }
 }


[hudi] 06/45: [MINOR] fix Invalid value for YearOfEra

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit c1ceb628e576dd50f9c3bdf1ab830dcf61f70296
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Sun Oct 23 17:35:58 2022 +0800

    [MINOR] fix Invalid value for YearOfEra
---
 .../apache/hudi/client/BaseHoodieWriteClient.java  | 28 ++++++++++++++--------
 .../apache/hudi/client/SparkRDDWriteClient.java    | 22 +++++++++++++++++
 .../table/timeline/HoodieActiveTimeline.java       | 18 ++++++++++++++
 3 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
index d9f260e633..ff500a617e 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
@@ -104,6 +104,7 @@ import org.apache.log4j.Logger;
 
 import java.io.IOException;
 import java.nio.charset.StandardCharsets;
+import java.text.ParseException;
 import java.util.Collection;
 import java.util.Collections;
 import java.util.HashMap;
@@ -115,6 +116,7 @@ import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
 import static org.apache.hudi.common.model.HoodieCommitMetadata.SCHEMA_KEY;
+import static org.apache.hudi.common.model.TableServiceType.CLEAN;
 
 /**
  * Abstract Write Client providing functionality for performing commit, index updates and rollback
@@ -306,14 +308,20 @@ public abstract class BaseHoodieWriteClient<T extends HoodieRecordPayload, I, K,
   protected abstract HoodieTable<T, I, K, O> createTable(HoodieWriteConfig config, Configuration hadoopConf);
 
   void emitCommitMetrics(String instantTime, HoodieCommitMetadata metadata, String actionType) {
-    if (writeTimer != null) {
-      long durationInMs = metrics.getDurationInMs(writeTimer.stop());
-      // instantTime could be a non-standard value, so use `parseDateFromInstantTimeSafely`
-      // e.g. INIT_INSTANT_TS, METADATA_BOOTSTRAP_INSTANT_TS and FULL_BOOTSTRAP_INSTANT_TS in HoodieTimeline
-      HoodieActiveTimeline.parseDateFromInstantTimeSafely(instantTime).ifPresent(parsedInstant ->
-          metrics.updateCommitMetrics(parsedInstant.getTime(), durationInMs, metadata, actionType)
-      );
-      writeTimer = null;
+    try {
+      if (writeTimer != null) {
+        long durationInMs = metrics.getDurationInMs(writeTimer.stop());
+        long commitEpochTimeInMs = 0;
+        if (HoodieActiveTimeline.checkDateTime(instantTime)) {
+          commitEpochTimeInMs = HoodieActiveTimeline.parseDateFromInstantTime(instantTime).getTime();
+        }
+        metrics.updateCommitMetrics(commitEpochTimeInMs, durationInMs,
+            metadata, actionType);
+        writeTimer = null;
+      }
+    } catch (ParseException e) {
+      throw new HoodieCommitException("Failed to complete commit " + config.getBasePath() + " at time " + instantTime
+          + "Instant time is not of valid format", e);
     }
   }
 
@@ -862,7 +870,7 @@ public abstract class BaseHoodieWriteClient<T extends HoodieRecordPayload, I, K,
       LOG.info("Cleaner started");
       // proceed only if multiple clean schedules are enabled or if there are no pending cleans.
       if (scheduleInline) {
-        scheduleTableServiceInternal(cleanInstantTime, Option.empty(), TableServiceType.CLEAN);
+        scheduleTableServiceInternal(cleanInstantTime, Option.empty(), CLEAN);
         table.getMetaClient().reloadActiveTimeline();
       }
     }
@@ -1286,7 +1294,7 @@ public abstract class BaseHoodieWriteClient<T extends HoodieRecordPayload, I, K,
    * @param extraMetadata Extra Metadata to be stored
    */
   protected boolean scheduleCleaningAtInstant(String instantTime, Option<Map<String, String>> extraMetadata) throws HoodieIOException {
-    return scheduleTableService(instantTime, extraMetadata, TableServiceType.CLEAN).isPresent();
+    return scheduleTableService(instantTime, extraMetadata, CLEAN).isPresent();
   }
 
   /**
diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
index 7110e26bb0..32c4a0a06d 100644
--- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
+++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
@@ -45,6 +45,7 @@ import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.data.HoodieJavaRDD;
 import org.apache.hudi.exception.HoodieClusteringException;
+import org.apache.hudi.exception.HoodieCommitException;
 import org.apache.hudi.exception.HoodieWriteConflictException;
 import org.apache.hudi.index.HoodieIndex;
 import org.apache.hudi.index.SparkHoodieIndexFactory;
@@ -68,6 +69,7 @@ import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 
 import java.nio.charset.StandardCharsets;
+import java.text.ParseException;
 import java.util.List;
 import java.util.Map;
 import java.util.stream.Collectors;
@@ -324,6 +326,16 @@ public class SparkRDDWriteClient<T extends HoodieRecordPayload> extends
       HoodieActiveTimeline.parseDateFromInstantTimeSafely(compactionCommitTime).ifPresent(parsedInstant ->
           metrics.updateCommitMetrics(parsedInstant.getTime(), durationInMs, metadata, HoodieActiveTimeline.COMPACTION_ACTION)
       );
+      try {
+        long commitEpochTimeInMs = 0;
+        if (HoodieActiveTimeline.checkDateTime(compactionCommitTime)) {
+          commitEpochTimeInMs = HoodieActiveTimeline.parseDateFromInstantTime(compactionCommitTime).getTime();
+        }
+        metrics.updateCommitMetrics(commitEpochTimeInMs, durationInMs, metadata, HoodieActiveTimeline.COMPACTION_ACTION);
+      } catch (ParseException e) {
+        throw new HoodieCommitException("Commit time is not of valid format. Failed to commit compaction "
+            + config.getBasePath() + " at time " + compactionCommitTime, e);
+      }
     }
     LOG.info("Compacted successfully on commit " + compactionCommitTime);
   }
@@ -406,6 +418,16 @@ public class SparkRDDWriteClient<T extends HoodieRecordPayload> extends
       HoodieActiveTimeline.parseDateFromInstantTimeSafely(clusteringCommitTime).ifPresent(parsedInstant ->
           metrics.updateCommitMetrics(parsedInstant.getTime(), durationInMs, metadata, HoodieActiveTimeline.REPLACE_COMMIT_ACTION)
       );
+      try {
+        long commitEpochTimeInMs = 0;
+        if (HoodieActiveTimeline.checkDateTime(clusteringCommitTime)) {
+          commitEpochTimeInMs = HoodieActiveTimeline.parseDateFromInstantTime(clusteringCommitTime).getTime();
+        }
+        metrics.updateCommitMetrics(commitEpochTimeInMs, durationInMs, metadata, HoodieActiveTimeline.REPLACE_COMMIT_ACTION);
+      } catch (ParseException e) {
+        throw new HoodieCommitException("Commit time is not of valid format. Failed to commit compaction "
+            + config.getBasePath() + " at time " + clusteringCommitTime, e);
+      }
     }
     LOG.info("Clustering successfully on commit " + clusteringCommitTime);
   }
diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
index 2b27d3ab5e..414e92e58b 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
@@ -39,6 +39,8 @@ import java.io.FileNotFoundException;
 import java.io.IOException;
 import java.io.Serializable;
 import java.text.ParseException;
+import java.time.LocalDateTime;
+import java.time.format.DateTimeFormatter;
 import java.util.Arrays;
 import java.util.Collections;
 import java.util.Comparator;
@@ -49,6 +51,8 @@ import java.util.Set;
 import java.util.function.Function;
 import java.util.stream.Stream;
 
+import static org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator.SECS_INSTANT_TIMESTAMP_FORMAT;
+
 /**
  * Represents the Active Timeline for the Hoodie table. Instants for the last 12 hours (configurable) is in the
  * ActiveTimeline and the rest are Archived. ActiveTimeline is a special timeline that allows for creation of instants
@@ -125,6 +129,20 @@ public class HoodieActiveTimeline extends HoodieDefaultTimeline {
     return parsedDate;
   }
 
+  /**
+   * Check if the instantTime is in SECS_INSTANT_TIMESTAMP_FORMAT format.
+   */
+  public static boolean checkDateTime(String instantTime) {
+    DateTimeFormatter dtf = DateTimeFormatter.ofPattern(SECS_INSTANT_TIMESTAMP_FORMAT);
+    boolean flag = true;
+    try {
+      LocalDateTime.parse(instantTime, dtf);
+    } catch (Exception e) {
+      flag = false;
+    }
+    return flag;
+  }
+
   /**
    * Format the Date to a String representing the timestamp of a Hoodie Instant.
    */


[hudi] 24/45: [HUDI-4526] Improve spillableMapBasePath when disk directory is full (#6284)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit e95a9f56ae2703b566e9c21e58bacbd6379fa35e
Author: ForwardXu <fo...@gmail.com>
AuthorDate: Wed Nov 9 13:07:55 2022 +0800

    [HUDI-4526] Improve spillableMapBasePath when disk directory is full (#6284)
    
    (cherry picked from commit 371296173a7c51c325e6f9c3a3ef2ba5f6a89f6e)
---
 .../org/apache/hudi/config/HoodieMemoryConfig.java |  9 ++++--
 .../table/log/HoodieMergedLogRecordScanner.java    |  9 +++---
 .../org/apache/hudi/common/util/FileIOUtils.java   | 36 ++++++++++++++++++++++
 3 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java
index 4e37796393..960ec61dc0 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java
@@ -22,9 +22,10 @@ import org.apache.hudi.common.config.ConfigClassProperty;
 import org.apache.hudi.common.config.ConfigGroups;
 import org.apache.hudi.common.config.ConfigProperty;
 import org.apache.hudi.common.config.HoodieConfig;
+import org.apache.hudi.common.util.FileIOUtils;
+import org.apache.hudi.common.util.Option;
 
 import javax.annotation.concurrent.Immutable;
-
 import java.io.File;
 import java.io.FileReader;
 import java.io.IOException;
@@ -80,7 +81,11 @@ public class HoodieMemoryConfig extends HoodieConfig {
   public static final ConfigProperty<String> SPILLABLE_MAP_BASE_PATH = ConfigProperty
       .key("hoodie.memory.spillable.map.path")
       .defaultValue("/tmp/")
-      .withDocumentation("Default file path prefix for spillable map");
+      .withInferFunction(cfg -> {
+        String[] localDirs = FileIOUtils.getConfiguredLocalDirs();
+        return (localDirs != null && localDirs.length > 0) ? Option.of(localDirs[0]) : Option.empty();
+      })
+      .withDocumentation("Default file path for spillable map");
 
   public static final ConfigProperty<Double> WRITESTATUS_FAILURE_FRACTION = ConfigProperty
       .key("hoodie.memory.writestatus.failure.fraction")
diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java
index 5ef0a6821f..45975fbfde 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java
@@ -18,6 +18,9 @@
 
 package org.apache.hudi.common.table.log;
 
+import org.apache.avro.Schema;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
 import org.apache.hudi.common.config.HoodieCommonConfig;
 import org.apache.hudi.common.model.DeleteRecord;
 import org.apache.hudi.common.model.HoodieAvroRecord;
@@ -34,12 +37,7 @@ import org.apache.hudi.common.util.ReflectionUtils;
 import org.apache.hudi.common.util.SpillableMapUtils;
 import org.apache.hudi.common.util.collection.ExternalSpillableMap;
 import org.apache.hudi.exception.HoodieIOException;
-
-import org.apache.avro.Schema;
-import org.apache.hadoop.fs.FileSystem;
 import org.apache.hudi.internal.schema.InternalSchema;
-
-import org.apache.hadoop.fs.Path;
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 
@@ -95,6 +93,7 @@ public class HoodieMergedLogRecordScanner extends AbstractHoodieLogRecordReader
       // Store merged records for all versions for this log file, set the in-memory footprint to maxInMemoryMapSize
       this.records = new ExternalSpillableMap<>(maxMemorySizeInBytes, spillableMapBasePath, new DefaultSizeEstimator(),
           new HoodieRecordSizeEstimator(readerSchema), diskMapType, isBitCaskDiskMapCompressionEnabled);
+
       this.maxMemorySizeInBytes = maxMemorySizeInBytes;
     } catch (IOException e) {
       throw new HoodieIOException("IOException when creating ExternalSpillableMap at " + spillableMapBasePath, e);
diff --git a/hudi-common/src/main/java/org/apache/hudi/common/util/FileIOUtils.java b/hudi-common/src/main/java/org/apache/hudi/common/util/FileIOUtils.java
index 6a9e2e1b35..426a703503 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/util/FileIOUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/util/FileIOUtils.java
@@ -204,4 +204,40 @@ public class FileIOUtils {
   public static Option<byte[]> readDataFromPath(FileSystem fileSystem, org.apache.hadoop.fs.Path detailPath) {
     return readDataFromPath(fileSystem, detailPath, false);
   }
+
+  /**
+   * Return the configured local directories where hudi can write files. This
+   * method does not create any directories on its own, it only encapsulates the
+   * logic of locating the local directories according to deployment mode.
+   */
+  public static String[] getConfiguredLocalDirs() {
+    if (isRunningInYarnContainer()) {
+      // If we are in yarn mode, systems can have different disk layouts so we must set it
+      // to what Yarn on this system said was available. Note this assumes that Yarn has
+      // created the directories already, and that they are secured so that only the
+      // user has access to them.
+      return getYarnLocalDirs().split(",");
+    } else if (System.getProperty("java.io.tmpdir") != null) {
+      return System.getProperty("java.io.tmpdir").split(",");
+    } else {
+      return null;
+    }
+  }
+
+  private static boolean isRunningInYarnContainer() {
+    // These environment variables are set by YARN.
+    return System.getenv("CONTAINER_ID") != null;
+  }
+
+  /**
+   * Get the Yarn approved local directories.
+   */
+  private static String getYarnLocalDirs() {
+    String localDirs = Option.of(System.getenv("LOCAL_DIRS")).orElse("");
+
+    if (localDirs.isEmpty()) {
+      throw new HoodieIOException("Yarn Local dirs can't be empty");
+    }
+    return localDirs;
+  }
 }


[hudi] 32/45: Reduce the scope and duration of holding checkpoint lock in stream read

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 619b7504ca10dffc517771c212adcb64a3c13f47
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Tue Aug 9 13:18:55 2022 +0800

    Reduce the scope and duration of holding checkpoint lock in stream read
---
 .../hudi/source/StreamReadMonitoringFunction.java  | 33 ++++++++++++----------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java
index 3318cecf10..fde5130237 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java
@@ -168,9 +168,7 @@ public class StreamReadMonitoringFunction
   public void run(SourceFunction.SourceContext<MergeOnReadInputSplit> context) throws Exception {
     checkpointLock = context.getCheckpointLock();
     while (isRunning) {
-      synchronized (checkpointLock) {
-        monitorDirAndForwardSplits(context);
-      }
+      monitorDirAndForwardSplits(context);
       TimeUnit.SECONDS.sleep(interval);
     }
   }
@@ -195,6 +193,8 @@ public class StreamReadMonitoringFunction
       // table does not exist
       return;
     }
+
+    long start = System.currentTimeMillis();
     IncrementalInputSplits.Result result =
         incrementalInputSplits.inputSplits(metaClient, this.hadoopConf, this.issuedInstant);
     if (result.isEmpty()) {
@@ -202,28 +202,31 @@ public class StreamReadMonitoringFunction
       return;
     }
 
-    for (MergeOnReadInputSplit split : result.getInputSplits()) {
-      context.collect(split);
+    LOG.debug(
+        "Discovered {} splits, time elapsed {}ms",
+        result.getInputSplits().size(),
+        System.currentTimeMillis() - start);
+
+    // only need to hold the checkpoint lock when emitting the splits
+    start = System.currentTimeMillis();
+    synchronized (checkpointLock) {
+      for (MergeOnReadInputSplit split : result.getInputSplits()) {
+        context.collect(split);
+      }
     }
+
     // update the issues instant time
     this.issuedInstant = result.getEndInstant();
     LOG.info("\n"
             + "------------------------------------------------------------\n"
-            + "---------- consumed to instant: {}\n"
+            + "---------- consumed to instant: {}, time elapsed {}ms\n"
             + "------------------------------------------------------------",
-        this.issuedInstant);
+        this.issuedInstant, System.currentTimeMillis() - start);
   }
 
   @Override
   public void close() throws Exception {
-    super.close();
-
-    if (checkpointLock != null) {
-      synchronized (checkpointLock) {
-        issuedInstant = null;
-        isRunning = false;
-      }
-    }
+    cancel();
 
     if (LOG.isDebugEnabled()) {
       LOG.debug("Closed File Monitoring Source for path: " + path + ".");


[hudi] 37/45: remove ZhiyanReporter's report print

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit adc8aa6ebd082788e55519cefc54669c1ad68546
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Thu Dec 8 21:30:54 2022 +0800

    remove ZhiyanReporter's report print
---
 .../src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanReporter.java     | 1 -
 1 file changed, 1 deletion(-)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanReporter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanReporter.java
index 4e5d416989..d0bea0705a 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanReporter.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/zhiyan/ZhiyanReporter.java
@@ -81,7 +81,6 @@ public class ZhiyanReporter extends ScheduledReporter {
 
     String payload = builder.build();
 
-    LOG.info("Payload is:" + payload);
     try {
       client.post(payload);
     } catch (Exception e) {


[hudi] 44/45: improve DropHoodieTableCommand

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 4f005ea5d4532c6434548e0858b336fe6a30dd42
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Fri Dec 30 17:28:47 2022 +0800

    improve DropHoodieTableCommand
---
 .../spark/sql/hudi/command/DropHoodieTableCommand.scala     | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/DropHoodieTableCommand.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/DropHoodieTableCommand.scala
index a0252861db..f9cae4369d 100644
--- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/DropHoodieTableCommand.scala
+++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/DropHoodieTableCommand.scala
@@ -70,16 +70,17 @@ case class DropHoodieTableCommand(
 
     val basePath = hoodieCatalogTable.tableLocation
     val catalog = sparkSession.sessionState.catalog
+    val hoodieTableExists = hoodieCatalogTable.hoodieTableExists
+    val tableType = hoodieCatalogTable.tableType
 
     // Drop table in the catalog
-    if (hoodieCatalogTable.hoodieTableExists &&
-        HoodieTableType.MERGE_ON_READ == hoodieCatalogTable.tableType && purge) {
+    if (hoodieTableExists && HoodieTableType.MERGE_ON_READ == tableType && purge) {
       val (rtTableOpt, roTableOpt) = getTableRTAndRO(catalog, hoodieCatalogTable)
-      rtTableOpt.foreach(table => catalog.dropTable(table.identifier, true, false))
-      roTableOpt.foreach(table => catalog.dropTable(table.identifier, true, false))
-      catalog.dropTable(table.identifier.copy(table = hoodieCatalogTable.tableName), ifExists, purge)
+      rtTableOpt.foreach(table => catalog.dropTable(table.identifier, ignoreIfNotExists = true, purge = false))
+      roTableOpt.foreach(table => catalog.dropTable(table.identifier, ignoreIfNotExists = true, purge = false))
+      catalog.dropTable(table.identifier.copy(table = hoodieCatalogTable.tableName), ignoreIfNotExists = true, purge = purge)
     } else {
-      catalog.dropTable(table.identifier, ifExists, purge)
+      catalog.dropTable(table.identifier, ignoreIfNotExists = true, purge = purge)
     }
 
     // Recursively delete table directories


[hudi] 01/45: [MINOR] Adapt to tianqiong spark

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ac0d1d81a48d1ce558294757a5812487cd9b2cf0
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Tue Aug 23 11:47:37 2022 +0800

    [MINOR] Adapt to tianqiong spark
---
 dev/settings.xml                                   | 266 +++++++++++++++++++++
 dev/tencent-install.sh                             | 157 ++++++++++++
 dev/tencent-release.sh                             | 154 ++++++++++++
 hudi-cli/pom.xml                                   |   4 +-
 hudi-client/hudi-spark-client/pom.xml              |   4 +-
 hudi-examples/hudi-examples-spark/pom.xml          |   4 +-
 hudi-integ-test/pom.xml                            |   4 +-
 hudi-spark-datasource/hudi-spark-common/pom.xml    |  12 +-
 hudi-spark-datasource/hudi-spark/pom.xml           |  12 +-
 hudi-spark-datasource/hudi-spark2/pom.xml          |  12 +-
 hudi-spark-datasource/hudi-spark3-common/pom.xml   |   2 +-
 hudi-spark-datasource/hudi-spark3.1.x/pom.xml      |   2 +-
 hudi-spark-datasource/hudi-spark3.2.x/pom.xml      |   6 +-
 hudi-spark-datasource/hudi-spark3.3.x/pom.xml      |   6 +-
 hudi-sync/hudi-hive-sync/pom.xml                   |   4 +-
 hudi-utilities/pom.xml                             |  10 +-
 .../org/apache/hudi/utilities/UtilHelpers.java     |  38 ++-
 packaging/hudi-integ-test-bundle/pom.xml           |   8 +-
 pom.xml                                            |  94 +++++---
 19 files changed, 715 insertions(+), 84 deletions(-)

diff --git a/dev/settings.xml b/dev/settings.xml
new file mode 100644
index 0000000000..5f5dfd4fa6
--- /dev/null
+++ b/dev/settings.xml
@@ -0,0 +1,266 @@
+<settings>
+    <proxies>
+        <proxy>
+            <id>dev http</id>
+            <active>true</active>
+            <protocol>http</protocol>
+            <host>web-proxy.oa.com</host>
+            <port>8080</port>
+            <nonProxyHosts>mirrors.tencent.com|qq.com|localhost|127.0.0.1|*.oa.com|repo.maven.apache.org|packages.confluent.io</nonProxyHosts>
+        </proxy>
+        <proxy>
+            <id>dev https</id>
+            <active>true</active>
+            <protocol>https</protocol>
+            <host>web-proxy.oa.com</host>
+            <port>8080</port>
+            <nonProxyHosts>mirrors.tencent.com|qq.com|localhost|127.0.0.1|*.oa.com|repo.maven.apache.org|packages.confluent.io</nonProxyHosts>
+        </proxy>
+    </proxies>
+
+    <offline>false</offline>
+
+    <profiles>
+        <profile>
+            <id>nexus</id>
+            <repositories>
+                <repository>
+                    <id>maven_public</id>
+                    <url>https://mirrors.tencent.com/nexus/repository/maven-public/</url>
+                    <releases>
+                        <enabled>true</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>false</enabled>
+                    </snapshots>
+                </repository>
+                <repository>
+                    <id>tencent_public</id>
+                    <url>https://mirrors.tencent.com/repository/maven/tencent_public/</url>
+                    <releases>
+                        <enabled>true</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>false</enabled>
+                    </snapshots>
+                </repository>
+
+                <repository>
+                    <id>thirdparty</id>
+                    <url>https://mirrors.tencent.com/repository/maven/thirdparty/</url>
+                    <releases>
+                        <enabled>true</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>false</enabled>
+                    </snapshots>
+                </repository>
+
+                <repository>
+                    <id>mqq</id>
+                    <url>https://mirrors.tencent.com/repository/maven/mqq/</url>
+                    <releases>
+                        <enabled>false</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>true</enabled>
+                    </snapshots>
+                </repository>
+
+                <repository>
+                    <id>thirdparty-snapshots</id>
+                    <url>https://mirrors.tencent.com/repository/maven/thirdparty-snapshots/</url>
+                    <releases>
+                        <enabled>false</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>true</enabled>
+                    </snapshots>
+                </repository>
+            </repositories>
+
+            <pluginRepositories>
+                <pluginRepository>
+                    <id>maven-public-plugin</id>
+                    <url>https://mirrors.tencent.com/nexus/repository/maven-public/</url>
+                    <releases>
+                        <enabled>true</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>false</enabled>
+                    </snapshots>
+                </pluginRepository>
+                <pluginRepository>
+                    <id>public-plugin</id>
+                    <url>https://mirrors.tencent.com/repository/maven/tencent_public/</url>
+                    <releases>
+                        <enabled>true</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>false</enabled>
+                    </snapshots>
+                </pluginRepository>
+                <pluginRepository>
+                    <id>thirdparty-plugin</id>
+                    <url>https://mirrors.tencent.com/repository/maven/thirdparty/</url>
+                    <releases>
+                        <enabled>true</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>false</enabled>
+                    </snapshots>
+                </pluginRepository>
+            </pluginRepositories>
+        </profile>
+
+        <profile>
+            <id>tbds</id>
+            <repositories>
+                <repository>
+                    <id>tbds-maven-public</id>
+                    <url>http://tbdsrepo.oa.com/repository/maven-public/</url>
+                    <releases>
+                        <enabled>true</enabled>
+                        <updatePolicy>never</updatePolicy>
+                    </releases>
+                    <snapshots>
+                        <enabled>true</enabled>
+                        <updatePolicy>never</updatePolicy>
+                    </snapshots>
+                </repository>
+                <repository>
+                    <id>tbds</id>
+                    <url>http://tbdsrepo.oa.com/repository/tbds/</url>
+                    <releases>
+                        <enabled>true</enabled>
+                        <updatePolicy>never</updatePolicy>
+                        <checksumPolicy>ignore</checksumPolicy>
+                    </releases>
+                    <snapshots>
+                        <enabled>true</enabled>
+                        <updatePolicy>never</updatePolicy>
+                        <checksumPolicy>ignore</checksumPolicy>
+                    </snapshots>
+                </repository>
+            </repositories>
+            <pluginRepositories>
+                <pluginRepository>
+                    <id>tbds</id>
+                    <url>http://tbdsrepo.oa.com/repository/tbds/</url>
+                    <releases>
+                        <enabled>true</enabled>
+                        <updatePolicy>never</updatePolicy>
+                        <checksumPolicy>ignore</checksumPolicy>
+                    </releases>
+                    <snapshots>
+                        <enabled>true</enabled>
+                        <updatePolicy>never</updatePolicy>
+                        <checksumPolicy>ignore</checksumPolicy>
+                    </snapshots>
+                </pluginRepository>
+                <pluginRepository>
+                    <id>tbds-maven-public</id>
+                    <url>http://tbdsrepo.oa.com/repository/maven-public/</url>
+                    <releases>
+                        <enabled>true</enabled>
+                        <updatePolicy>never</updatePolicy>
+                        <checksumPolicy>warn</checksumPolicy>
+                    </releases>
+                    <snapshots>
+                        <enabled>true</enabled>
+                        <updatePolicy>never</updatePolicy>
+                        <checksumPolicy>ignore</checksumPolicy>
+                    </snapshots>
+                </pluginRepository>
+            </pluginRepositories>
+        </profile>
+
+        <profile>
+            <id>confluent_repo</id>
+            <repositories>
+                <repository>
+                    <id>tencent-repo</id>
+                    <url>https://mirrors.tencent.com/repository/maven/CSIG_TWINS</url>
+                    <releases>
+                        <enabled>true</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>false</enabled>
+                    </snapshots>
+                </repository>
+                <repository>
+                    <id>confluent</id>
+                    <url>https://packages.confluent.io/maven/</url>
+                    <releases>
+                        <enabled>true</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>false</enabled>
+                    </snapshots>
+                </repository>
+            </repositories>
+        </profile>
+
+        <profile>
+            <id>tianqiong_releases</id>
+            <repositories>
+                <repository>
+                    <id>tianqiong-releases</id>
+                    <url>https://mirrors.tencent.com/repository/maven/tianqiong-releases</url>
+                    <releases>
+                        <enabled>true</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>false</enabled>
+                    </snapshots>
+                </repository>
+            </repositories>
+        </profile>
+
+        <profile>
+            <id>tianqiong_snapshots</id>
+            <repositories>
+                <repository>
+                    <id>tianqiong-snapshots</id>
+                    <url>https://mirrors.tencent.com/repository/maven/tianqiong-snapshots</url>
+                    <releases>
+                        <enabled>false</enabled>
+                    </releases>
+                    <snapshots>
+                        <enabled>true</enabled>
+                        <updatePolicy>always</updatePolicy>
+                    </snapshots>
+                </repository>
+            </repositories>
+        </profile>
+    </profiles>
+
+    <activeProfiles>
+        <activeProfile>confluent_repo</activeProfile>
+        <activeProfile>tianqiong_releases</activeProfile>
+        <activeProfile>tianqiong_snapshots</activeProfile>
+        <activeProfile>nexus</activeProfile>
+    </activeProfiles>
+    <servers>
+        <server>
+            <id>thirdparty-snapshots</id>
+            <username>ethansu</username>
+            <password>664a1eeceee211e9b3cf6c92bf47000d</password>
+        </server>
+        <server>
+            <id>tbds</id>
+            <username>tbds</username>
+            <password>tbds@Tbds.com</password>
+        </server>
+        <server>
+            <id>tianqiong-releases</id>
+            <username>g_datalake</username>
+            <password>be3c75f8fc9a11e9b2a36c92bf3acd2c</password>
+        </server>
+        <server>
+            <id>tianqiong-snapshots</id>
+            <username>g_datalake</username>
+            <password>be3c75f8fc9a11e9b2a36c92bf3acd2c</password>
+        </server>
+    </servers>
+</settings>
diff --git a/dev/tencent-install.sh b/dev/tencent-install.sh
new file mode 100644
index 0000000000..1e34f40440
--- /dev/null
+++ b/dev/tencent-install.sh
@@ -0,0 +1,157 @@
+#!/bin/bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set -e # Exit immediately if a command exits with a non-zero status
+
+if [ $# -ne 7 ]; then
+  echo "Usage: $0 <apache-version> <tencent-version> <rc-num> <release-repo-not-snapshot?> <scala_version> <spark_version> <flink_version>"
+  echo "example: $0 0.12.0 1 1 N 2.11 2 1.13"
+  exit
+fi
+
+version=$1-$2-tencent # <apache-version>-<tencent-version>-tencent, e.g. 0.10.0-1-tencent
+if [ $4 = "N" ]; then
+  version=$version-SNAPSHOT
+fi
+rc=$3
+release_repo=$4 # Y for release repo, others for snapshot repo
+
+tag=apache-hudi-$version
+tagrc=${tag}-rc${rc}
+
+echo "Preparing source for $tagrc"
+
+# change version
+echo "Change version for ${version}"
+mvn versions:set -DnewVersion=${version} -DgenerateBackupPom=false -s dev/settings.xml -U
+mvn versions:commit -s dev/settings.xml -U
+
+function git_push() {
+  # create version.txt for this release
+  if [ ${release_repo} = "Y" ]; then
+    git add .
+
+    if [ $# -eq 7 ]; then
+      git commit -m "Add version tag for release ${version} $5 $6"
+    else
+      git commit -m "Add version tag for release ${version}"
+    fi
+  else
+    git add .
+
+    if [ $# -eq 7 ]; then
+      git commit -m"Add snapshot tag ${version} $5 $6"
+    else
+      git commit -m"Add snapshot tag ${version}"
+    fi
+  fi
+
+  set_version_hash=$(git rev-list HEAD 2>/dev/null | head -n 1)
+
+  # delete remote tag
+  git fetch --tags --all
+  tag_exist=`git tag -l ${tagrc} | wc -l`
+  if [ ${tag_exist} -gt 0 ]; then
+    git tag -l ${tagrc} | xargs git tag -d
+    git push origin :refs/tags/${tagrc}
+  fi
+
+  # add remote tag
+  git tag -am "Apache Hudi $version" ${tagrc} ${set_version_hash}
+  remote=$(git remote -v | grep data-lake-technology/hudi.git | head -n 1 | awk '{print $1}')
+  git push ${remote} ${tagrc}
+
+  release_hash=$(git rev-list ${tagrc} 2>/dev/null | head -n 1)
+
+  if [ -z "$release_hash" ]; then
+    echo "Cannot continue: unknown git tag: $tag"
+    exit
+  fi
+
+  echo -e "Using commit ${release_hash}\n"
+
+  #echo "git push origin"
+  #git push origin
+
+  echo -e "begin archive ${release_hash}\n"
+  rm -rf ${tag}*
+  tarball=$tag.tar.gz
+
+  # be conservative and use the release hash, even though git produces the same
+  # archive (identical hashes) using the scm tag
+  git archive $release_hash --worktree-attributes --prefix $tag/ -o $tarball
+
+  # checksum
+  sha512sum $tarball >${tarball}.sha512
+
+  # extract source tarball
+  tar xzf ${tarball}
+
+  cd ${tag}
+  if [ ${release_repo} = "N" ]; then
+    echo $version >version.txt
+  fi
+
+  echo -e "end archive ${release_hash}\n"
+}
+
+function deploy_spark() {
+  echo -------------------------------------------------------
+  SCALA_VERSION=$1
+  SPARK_VERSION=$2
+  FLINK_VERSION=$3
+
+  if [ ${release_repo} = "Y" ]; then
+    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -s dev/settings.xml -DretryFailedDeploymentCount=30 -T 2.5C"
+  else
+    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -s dev/settings.xml -DretryFailedDeploymentCount=30 -T 2.5C"
+  fi
+
+  #  INSTALL_OPTIONS="-U -Drat.skip=true -Djacoco.skip=true -Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -DskipTests -s dev/settings.xml -T 2.5C"
+  #
+  #  echo "INSTALL_OPTIONS: mvn clean package ${INSTALL_OPTIONS}"
+  #  mvn clean package ${INSTALL_OPTIONS}
+
+  echo "DEPLOY_OPTIONS: mvn clean install $COMMON_OPTIONS"
+  #  mvn clean package install $COMMON_OPTIONS
+  mvn clean package install $COMMON_OPTIONS -Drat.skip=true
+
+  if [ ${release_repo} = "Y" ]; then
+    echo -e "Published to release repo\n"
+  else
+    echo -e "Published to snapshot repo\n"
+  fi
+  echo -------------------------------------------------------
+}
+
+echo "SCALA_VERSION: $5 SPARK_VERSION: $6"
+deploy_spark $5 $6 $7
+
+## spark 2.4.6
+#deploy_spark 2.11 2
+## spark 3.0.1
+#deploy_spark 2.12 3.0.x
+## spark 3.1.2
+#deploy_spark 2.12 3
+
+# clean
+#rm -rf ../${tag}*
+
+echo "Success! The release candidate [${tagrc}] is available"
+echo "Commit SHA1: ${release_hash}"
diff --git a/dev/tencent-release.sh b/dev/tencent-release.sh
new file mode 100644
index 0000000000..944f497070
--- /dev/null
+++ b/dev/tencent-release.sh
@@ -0,0 +1,154 @@
+#!/bin/bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set -e  # Exit immediately if a command exits with a non-zero status
+
+if [ $# -ne 7 ]; then
+  echo "Usage: $0 <apache-version> <tencent-version> <rc-num> <release-repo-not-snapshot?> <scala_version> <spark_version> <flink_version>"
+  echo "example: $0 0.12.0 1 1 N 2.11 2 1.13"
+  exit
+fi
+
+version=$1-$2-tencent  # <apache-version>-<tencent-version>-tencent, e.g. 0.10.0-1-tencent
+if [ $4 = "N" ]; then
+  version=$version-SNAPSHOT
+fi
+rc=$3
+release_repo=$4  # Y for release repo, others for snapshot repo
+
+tag=apache-hudi-$version
+tagrc=${tag}-rc${rc}
+
+echo "Preparing source for $tagrc"
+
+# change version
+echo "Change version for ${version}"
+mvn versions:set -DnewVersion=${version} -DgenerateBackupPom=false -s dev/settings.xml -U
+mvn versions:commit -s dev/settings.xml -U
+
+# create version.txt for this release
+if [ ${release_repo} = "Y" ]; then
+  git add .
+
+  if [ $# -eq 7 ]; then
+    git commit -m "Add version tag for release ${version} $5 $6"
+  else
+    git commit -m "Add version tag for release ${version}"
+  fi
+else
+  git add .
+
+  if [ $# -eq 7 ]; then
+    git commit -m"Add snapshot tag ${version} $5 $6"
+  else
+    git commit -m"Add snapshot tag ${version}"
+  fi
+fi
+
+set_version_hash=`git rev-list HEAD 2> /dev/null | head -n 1 `
+
+# delete remote tag
+git fetch --tags --all
+tag_exist=`git tag -l ${tagrc} | wc -l`
+if [ ${tag_exist} -gt 0 ]; then
+  git tag -l ${tagrc} | xargs git tag -d
+  git push origin :refs/tags/${tagrc}
+fi
+
+# add remote tag
+git tag -am "Apache Hudi $version" ${tagrc} ${set_version_hash}
+remote=$(git remote -v | grep data-lake-technology/hudi.git | head -n 1 | awk '{print $1}')
+git push ${remote} ${tagrc}
+
+release_hash=`git rev-list ${tagrc} 2> /dev/null | head -n 1 `
+
+if [ -z "$release_hash" ]; then
+  echo "Cannot continue: unknown git tag: $tag"
+  exit
+fi
+
+echo -e "Using commit ${release_hash}\n"
+
+#echo "git push origin"
+#git push origin
+
+echo -e "begin archive ${release_hash}\n"
+rm -rf ${tag}*
+tarball=$tag.tar.gz
+
+# be conservative and use the release hash, even though git produces the same
+# archive (identical hashes) using the scm tag
+git archive $release_hash --worktree-attributes --prefix $tag/ -o $tarball
+
+# checksum
+sha512sum $tarball > ${tarball}.sha512
+
+# extract source tarball
+tar xzf ${tarball}
+
+cd ${tag}
+if [ ${release_repo} = "N" ]; then
+  echo $version > version.txt
+fi
+
+echo -e "end archive ${release_hash}\n"
+
+function deploy_spark(){
+  echo -------------------------------------------------------
+  SCALA_VERSION=$1
+  SPARK_VERSION=$2
+  FLINK_VERSION=$3
+
+  if [ ${release_repo} = "Y" ]; then
+    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -s dev/settings.xml -DretryFailedDeploymentCount=30"
+  else
+    COMMON_OPTIONS="-Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -Dflink${FLINK_VERSION} -DskipTests -s dev/settings.xml -DretryFailedDeploymentCount=30"
+  fi
+
+#  INSTALL_OPTIONS="-U -Drat.skip=true -Djacoco.skip=true -Dscala-${SCALA_VERSION} -Dspark${SPARK_VERSION} -DskipTests -s dev/settings.xml -T 2.5C"
+#
+#  echo "INSTALL_OPTIONS: mvn clean package ${INSTALL_OPTIONS}"
+#  mvn clean package ${INSTALL_OPTIONS}
+
+  echo "DEPLOY_OPTIONS: mvn clean deploy $COMMON_OPTIONS"
+  mvn deploy $COMMON_OPTIONS
+
+  if [ ${release_repo} = "Y" ]; then
+    echo -e "Published to release repo\n"
+  else
+    echo -e "Published to snapshot repo\n"
+  fi
+  echo -------------------------------------------------------
+}
+
+echo "SCALA_VERSION: $5 SPARK_VERSION: $6"
+deploy_spark $5 $6 $7
+
+## spark 2.4.6
+#deploy_spark 2.11 2
+## spark 3.0.1
+#deploy_spark 2.12 3.0.x
+## spark 3.1.2
+#deploy_spark 2.12 3
+
+# clean
+rm -rf ../${tag}*
+
+echo "Success! The release candidate [${tagrc}] is available"
+echo "Commit SHA1: ${release_hash}"
diff --git a/hudi-cli/pom.xml b/hudi-cli/pom.xml
index ee78bf24b0..27596e779f 100644
--- a/hudi-cli/pom.xml
+++ b/hudi-cli/pom.xml
@@ -250,11 +250,11 @@
 
     <!-- Spark -->
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-core_${scala.binary.version}</artifactId>
     </dependency>
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
     </dependency>
 
diff --git a/hudi-client/hudi-spark-client/pom.xml b/hudi-client/hudi-spark-client/pom.xml
index a7ae3a7049..da1ad6cb9f 100644
--- a/hudi-client/hudi-spark-client/pom.xml
+++ b/hudi-client/hudi-spark-client/pom.xml
@@ -57,11 +57,11 @@
 
     <!-- Spark -->
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-core_${scala.binary.version}</artifactId>
     </dependency>
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
     </dependency>
 
diff --git a/hudi-examples/hudi-examples-spark/pom.xml b/hudi-examples/hudi-examples-spark/pom.xml
index 4eeb11ecb1..d0611c6752 100644
--- a/hudi-examples/hudi-examples-spark/pom.xml
+++ b/hudi-examples/hudi-examples-spark/pom.xml
@@ -189,11 +189,11 @@
 
         <!-- Spark -->
         <dependency>
-            <groupId>org.apache.spark</groupId>
+            <groupId>${spark.groupId}</groupId>
             <artifactId>spark-core_${scala.binary.version}</artifactId>
         </dependency>
         <dependency>
-            <groupId>org.apache.spark</groupId>
+            <groupId>${spark.groupId}</groupId>
             <artifactId>spark-sql_${scala.binary.version}</artifactId>
         </dependency>
 
diff --git a/hudi-integ-test/pom.xml b/hudi-integ-test/pom.xml
index 2134f80bb0..703cbb067f 100644
--- a/hudi-integ-test/pom.xml
+++ b/hudi-integ-test/pom.xml
@@ -62,7 +62,7 @@
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
       <exclusions>
         <exclusion>
@@ -89,7 +89,7 @@
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-avro_${scala.binary.version}</artifactId>
       <version>${spark.version}</version>
       <scope>test</scope>
diff --git a/hudi-spark-datasource/hudi-spark-common/pom.xml b/hudi-spark-datasource/hudi-spark-common/pom.xml
index a1016299ba..6fd1d7d458 100644
--- a/hudi-spark-datasource/hudi-spark-common/pom.xml
+++ b/hudi-spark-datasource/hudi-spark-common/pom.xml
@@ -184,7 +184,7 @@
 
     <!-- Spark -->
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-core_${scala.binary.version}</artifactId>
       <exclusions>
         <exclusion>
@@ -194,29 +194,29 @@
       </exclusions>
     </dependency>
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-hive_${scala.binary.version}</artifactId>
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
       <classifier>tests</classifier>
       <scope>test</scope>
     </dependency>
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-core_${scala.binary.version}</artifactId>
       <classifier>tests</classifier>
       <scope>test</scope>
     </dependency>
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-catalyst_${scala.binary.version}</artifactId>
       <classifier>tests</classifier>
       <scope>test</scope>
diff --git a/hudi-spark-datasource/hudi-spark/pom.xml b/hudi-spark-datasource/hudi-spark/pom.xml
index f55cb3359c..f4ad09bb57 100644
--- a/hudi-spark-datasource/hudi-spark/pom.xml
+++ b/hudi-spark-datasource/hudi-spark/pom.xml
@@ -245,7 +245,7 @@
 
     <!-- Spark -->
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-core_${scala.binary.version}</artifactId>
         <exclusions>
            <exclusion>
@@ -255,31 +255,31 @@
         </exclusions>
     </dependency>
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-hive_${scala.binary.version}</artifactId>
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
       <classifier>tests</classifier>
       <scope>test</scope>
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-core_${scala.binary.version}</artifactId>
       <classifier>tests</classifier>
       <scope>test</scope>
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-catalyst_${scala.binary.version}</artifactId>
       <classifier>tests</classifier>
       <scope>test</scope>
diff --git a/hudi-spark-datasource/hudi-spark2/pom.xml b/hudi-spark-datasource/hudi-spark2/pom.xml
index f74dd96a5b..63cc6f3a4f 100644
--- a/hudi-spark-datasource/hudi-spark2/pom.xml
+++ b/hudi-spark-datasource/hudi-spark2/pom.xml
@@ -21,10 +21,10 @@
   </parent>
   <modelVersion>4.0.0</modelVersion>
 
-  <artifactId>hudi-spark2_${scala.binary.version}</artifactId>
+  <artifactId>hudi-spark2_2.11</artifactId>
   <version>0.12.1</version>
 
-  <name>hudi-spark2_${scala.binary.version}</name>
+  <name>hudi-spark2_2.11</name>
   <packaging>jar</packaging>
 
   <properties>
@@ -185,13 +185,13 @@
     </dependency>
     <dependency>
       <groupId>org.apache.hudi</groupId>
-      <artifactId>hudi-spark-common_${scala.binary.version}</artifactId>
+      <artifactId>hudi-spark-common_2.11</artifactId>
       <version>${project.version}</version>
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
-      <artifactId>spark-sql_${scala.binary.version}</artifactId>
+      <groupId>${spark.groupId}</groupId>
+      <artifactId>spark-sql_2.11</artifactId>
       <version>${spark2.version}</version>
       <scope>provided</scope>
       <optional>true</optional>
@@ -230,7 +230,7 @@
     </dependency>
     <dependency>
       <groupId>org.apache.hudi</groupId>
-      <artifactId>hudi-spark-common_${scala.binary.version}</artifactId>
+      <artifactId>hudi-spark-common_2.11</artifactId>
       <version>${project.version}</version>
       <classifier>tests</classifier>
       <type>test-jar</type>
diff --git a/hudi-spark-datasource/hudi-spark3-common/pom.xml b/hudi-spark-datasource/hudi-spark3-common/pom.xml
index 75957d6d4c..6bbb4e42b4 100644
--- a/hudi-spark-datasource/hudi-spark3-common/pom.xml
+++ b/hudi-spark-datasource/hudi-spark3-common/pom.xml
@@ -160,7 +160,7 @@
     <dependencies>
 
         <dependency>
-            <groupId>org.apache.spark</groupId>
+            <groupId>${spark.groupId}</groupId>
             <artifactId>spark-sql_2.12</artifactId>
             <version>${spark3.version}</version>
             <scope>provided</scope>
diff --git a/hudi-spark-datasource/hudi-spark3.1.x/pom.xml b/hudi-spark-datasource/hudi-spark3.1.x/pom.xml
index 6768e0ce03..fb43cd2855 100644
--- a/hudi-spark-datasource/hudi-spark3.1.x/pom.xml
+++ b/hudi-spark-datasource/hudi-spark3.1.x/pom.xml
@@ -151,7 +151,7 @@
   <dependencies>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_2.12</artifactId>
       <version>${spark31.version}</version>
       <optional>true</optional>
diff --git a/hudi-spark-datasource/hudi-spark3.2.x/pom.xml b/hudi-spark-datasource/hudi-spark3.2.x/pom.xml
index cd6ba3a4b5..51f986e069 100644
--- a/hudi-spark-datasource/hudi-spark3.2.x/pom.xml
+++ b/hudi-spark-datasource/hudi-spark3.2.x/pom.xml
@@ -174,7 +174,7 @@
   <dependencies>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_2.12</artifactId>
       <version>${spark32.version}</version>
       <scope>provided</scope>
@@ -182,7 +182,7 @@
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-catalyst_2.12</artifactId>
       <version>${spark32.version}</version>
       <scope>provided</scope>
@@ -190,7 +190,7 @@
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-core_2.12</artifactId>
       <version>${spark32.version}</version>
       <scope>provided</scope>
diff --git a/hudi-spark-datasource/hudi-spark3.3.x/pom.xml b/hudi-spark-datasource/hudi-spark3.3.x/pom.xml
index 9ab65dca2e..65ce18d2d3 100644
--- a/hudi-spark-datasource/hudi-spark3.3.x/pom.xml
+++ b/hudi-spark-datasource/hudi-spark3.3.x/pom.xml
@@ -174,7 +174,7 @@
   <dependencies>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_2.12</artifactId>
       <version>${spark33.version}</version>
       <scope>provided</scope>
@@ -182,7 +182,7 @@
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-catalyst_2.12</artifactId>
       <version>${spark33.version}</version>
       <scope>provided</scope>
@@ -190,7 +190,7 @@
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-core_2.12</artifactId>
       <version>${spark33.version}</version>
       <scope>provided</scope>
diff --git a/hudi-sync/hudi-hive-sync/pom.xml b/hudi-sync/hudi-hive-sync/pom.xml
index 7cf31550b6..9785d71c9e 100644
--- a/hudi-sync/hudi-hive-sync/pom.xml
+++ b/hudi-sync/hudi-hive-sync/pom.xml
@@ -139,13 +139,13 @@
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
       <scope>test</scope>
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-core_${scala.binary.version}</artifactId>
       <scope>test</scope>
     </dependency>
diff --git a/hudi-utilities/pom.xml b/hudi-utilities/pom.xml
index 0c2a612d78..93cb94b320 100644
--- a/hudi-utilities/pom.xml
+++ b/hudi-utilities/pom.xml
@@ -184,7 +184,7 @@
 
     <!-- Spark -->
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-core_${scala.binary.version}</artifactId>
       <exclusions>
         <exclusion>
@@ -199,7 +199,7 @@
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
       <exclusions>
         <exclusion>
@@ -210,17 +210,17 @@
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-streaming_${scala.binary.version}</artifactId>
       <version>${spark.version}</version>
     </dependency>
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId>
       <version>${spark.version}</version>
     </dependency>
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId>
       <version>${spark.version}</version>
       <classifier>tests</classifier>
diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
index 523546c9ef..4a38da6528 100644
--- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
+++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
@@ -83,6 +83,7 @@ import org.apache.spark.util.LongAccumulator;
 import java.io.BufferedReader;
 import java.io.IOException;
 import java.io.StringReader;
+import java.lang.reflect.Method;
 import java.nio.ByteBuffer;
 import java.sql.Connection;
 import java.sql.Driver;
@@ -421,7 +422,7 @@ public class UtilHelpers {
       statement.setQueryTimeout(Integer.parseInt(options.get(JDBCOptions.JDBC_QUERY_TIMEOUT())));
       statement.executeQuery();
     } catch (SQLException e) {
-      throw new HoodieException(e);
+      return false;
     }
     return true;
   }
@@ -445,12 +446,23 @@ public class UtilHelpers {
         statement.setQueryTimeout(Integer.parseInt(options.get("queryTimeout")));
         try (ResultSet rs = statement.executeQuery()) {
           StructType structType;
+          Object[] methodParas;
+          Method method = getMethodByName(JdbcUtils.class, "getSchema");
+          int parasCount = getMethodParasCount(method);
+
           if (Boolean.parseBoolean(options.get("nullable"))) {
-            structType = JdbcUtils.getSchema(rs, dialect, true);
+            methodParas = parasCount == 3 ? new Object[] {rs, dialect, true} : new Object[] {method, rs, dialect, url, true};
           } else {
-            structType = JdbcUtils.getSchema(rs, dialect, false);
+            methodParas = parasCount == 3 ? new Object[] {rs, dialect, false} : new Object[] {method, rs, dialect, url, false};
+          }
+
+          structType = getStructTypeReflection(method, methodParas);
+
+          if (structType != null) {
+            return AvroConversionUtils.convertStructTypeToAvroSchema(structType, table, "hoodie." + table);
+          } else {
+            throw new HoodieException(String.format("%s structType can not null!", table));
           }
-          return AvroConversionUtils.convertStructTypeToAvroSchema(structType, table, "hoodie." + table);
         }
       }
     } else {
@@ -572,4 +584,22 @@ public class UtilHelpers {
     Schema schema = schemaResolver.getTableAvroSchema(false);
     return schema.toString();
   }
+
+  public static Method getMethodByName(Class clazz, String methodName) {
+    return Arrays.stream(clazz.getDeclaredMethods())
+        .filter(m -> m.getName().equalsIgnoreCase(methodName))
+        .findFirst().orElse(null);
+  }
+
+  public static int getMethodParasCount(Method method) {
+    return method.getParameterCount();
+  }
+
+  public static StructType getStructTypeReflection(Method method, Object... objs) throws Exception {
+    if (method != null) {
+      return (StructType) method.invoke(null, objs);
+    } else {
+      return null;
+    }
+  }
 }
diff --git a/packaging/hudi-integ-test-bundle/pom.xml b/packaging/hudi-integ-test-bundle/pom.xml
index d1789b863a..8323703622 100644
--- a/packaging/hudi-integ-test-bundle/pom.xml
+++ b/packaging/hudi-integ-test-bundle/pom.xml
@@ -646,12 +646,12 @@
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-core_${scala.binary.version}</artifactId>
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
     </dependency>
 
@@ -662,14 +662,14 @@
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-streaming_${scala.binary.version}</artifactId>
       <version>${spark.version}</version>
       <scope>provided</scope>
     </dependency>
 
     <dependency>
-      <groupId>org.apache.spark</groupId>
+      <groupId>${spark.groupId}</groupId>
       <artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId>
       <version>${spark.version}</version>
     </dependency>
diff --git a/pom.xml b/pom.xml
index 47e53fed97..159ae2a841 100644
--- a/pom.xml
+++ b/pom.xml
@@ -125,6 +125,7 @@
     <spark2.version>2.4.4</spark2.version>
     <spark3.version>3.3.0</spark3.version>
     <sparkbundle.version></sparkbundle.version>
+    <spark.groupId>com.tencent.spark</spark.groupId>
     <flink1.15.version>1.15.1</flink1.15.version>
     <flink1.14.version>1.14.5</flink1.14.version>
     <flink1.13.version>1.13.6</flink1.13.version>
@@ -142,7 +143,7 @@
     <flink.clients.artifactId>flink-clients</flink.clients.artifactId>
     <flink.connector.kafka.artifactId>flink-connector-kafka</flink.connector.kafka.artifactId>
     <flink.hadoop.compatibility.artifactId>flink-hadoop-compatibility_2.12</flink.hadoop.compatibility.artifactId>
-    <spark31.version>3.1.3</spark31.version>
+    <spark31.version>3.1.2</spark31.version>
     <spark32.version>3.2.2</spark32.version>
     <spark33.version>3.3.0</spark33.version>
     <hudi.spark.module>hudi-spark2</hudi.spark.module>
@@ -724,7 +725,7 @@
 
       <!-- Spark -->
       <dependency>
-        <groupId>org.apache.spark</groupId>
+        <groupId>${spark.groupId}</groupId>
         <artifactId>spark-core_${scala.binary.version}</artifactId>
         <version>${spark.version}</version>
         <scope>provided</scope>
@@ -740,26 +741,26 @@
         </exclusions>
       </dependency>
       <dependency>
-        <groupId>org.apache.spark</groupId>
+        <groupId>${spark.groupId}</groupId>
         <artifactId>spark-sql_${scala.binary.version}</artifactId>
         <version>${spark.version}</version>
         <scope>provided</scope>
       </dependency>
       <dependency>
-        <groupId>org.apache.spark</groupId>
+        <groupId>${spark.groupId}</groupId>
         <artifactId>spark-hive_${scala.binary.version}</artifactId>
         <version>${spark.version}</version>
         <scope>provided</scope>
       </dependency>
       <dependency>
-        <groupId>org.apache.spark</groupId>
+        <groupId>${spark.groupId}</groupId>
         <artifactId>spark-sql_${scala.binary.version}</artifactId>
         <classifier>tests</classifier>
         <version>${spark.version}</version>
         <scope>test</scope>
       </dependency>
       <dependency>
-        <groupId>org.apache.spark</groupId>
+        <groupId>${spark.groupId}</groupId>
         <artifactId>spark-core_${scala.binary.version}</artifactId>
         <classifier>tests</classifier>
         <version>${spark.version}</version>
@@ -776,7 +777,7 @@
         </exclusions>
       </dependency>
       <dependency>
-        <groupId>org.apache.spark</groupId>
+        <groupId>${spark.groupId}</groupId>
         <artifactId>spark-catalyst_${scala.binary.version}</artifactId>
         <classifier>tests</classifier>
         <version>${spark.version}</version>
@@ -1472,33 +1473,6 @@
       </dependency>
     </dependencies>
   </dependencyManagement>
-  <repositories>
-    <repository>
-      <id>Maven Central</id>
-      <name>Maven Repository</name>
-      <url>https://repo.maven.apache.org/maven2</url>
-      <releases>
-        <enabled>true</enabled>
-      </releases>
-      <snapshots>
-        <enabled>false</enabled>
-      </snapshots>
-    </repository>
-    <repository>
-      <id>cloudera-repo-releases</id>
-      <url>https://repository.cloudera.com/artifactory/public/</url>
-      <releases>
-        <enabled>true</enabled>
-      </releases>
-      <snapshots>
-        <enabled>false</enabled>
-      </snapshots>
-    </repository>
-    <repository>
-      <id>confluent</id>
-      <url>https://packages.confluent.io/maven/</url>
-    </repository>
-  </repositories>
 
   <profiles>
     <profile>
@@ -1985,7 +1959,7 @@
     <profile>
       <id>spark3.1</id>
       <properties>
-        <spark3.version>3.1.3</spark3.version>
+        <spark3.version>3.1.2</spark3.version>
         <spark.version>${spark3.version}</spark.version>
         <sparkbundle.version>3.1</sparkbundle.version>
         <scala.version>${scala12.version}</scala.version>
@@ -2137,6 +2111,8 @@
         <flink.clients.artifactId>flink-clients_${scala.binary.version}</flink.clients.artifactId>
         <flink.connector.kafka.artifactId>flink-connector-kafka_${scala.binary.version}</flink.connector.kafka.artifactId>
         <flink.hadoop.compatibility.artifactId>flink-hadoop-compatibility_${scala.binary.version}</flink.hadoop.compatibility.artifactId>
+        <hudi.flink.module>hudi-flink1.13.x</hudi.flink.module>
+        <flink.bundle.version>1.13</flink.bundle.version>
         <skipITs>true</skipITs>
       </properties>
       <activation>
@@ -2157,6 +2133,54 @@
         </property>
       </activation>
     </profile>
+
+    <profile>
+      <id>community</id>
+      <properties>
+        <spark.groupId>org.apache.spark</spark.groupId>
+      </properties>
+      <repositories>
+        <repository>
+          <id>Maven Central</id>
+          <name>Maven Repository</name>
+          <url>https://repo.maven.apache.org/maven2</url>
+          <releases>
+            <enabled>true</enabled>
+          </releases>
+          <snapshots>
+            <enabled>false</enabled>
+          </snapshots>
+        </repository>
+        <repository>
+          <id>cloudera-repo-releases</id>
+          <url>https://repository.cloudera.com/artifactory/public/</url>
+          <releases>
+            <enabled>true</enabled>
+          </releases>
+          <snapshots>
+            <enabled>false</enabled>
+          </snapshots>
+        </repository>
+        <repository>
+          <id>confluent</id>
+          <url>https://packages.confluent.io/maven/</url>
+        </repository>
+      </repositories>
+    </profile>
   </profiles>
 
+  <distributionManagement>
+    <repository>
+      <id>tianqiong-releases</id>
+      <name>Tianqiong Release Repository</name>
+      <url>https://mirrors.tencent.com/repository/maven/tianqiong-releases</url>
+    </repository>
+
+    <snapshotRepository>
+      <id>tianqiong-snapshots</id>
+      <name>Tianqiong Snapshots Repository</name>
+      <url>https://mirrors.tencent.com/repository/maven/tianqiong-snapshots</url>
+    </snapshotRepository>
+  </distributionManagement>
+
 </project>


[hudi] 29/45: Fix tauth issue (merge request !102)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f7fe437faf8f0d7ec358076973aec49a0d9e29ff
Author: superche <su...@tencent.com>
AuthorDate: Wed Nov 23 16:43:27 2022 +0800

    Fix tauth issue (merge request !102)
    
    Squash merge branch 'fix_tauth_issue' into 'release-0.12.1'
    <img width="" src="/uploads/96CCBC0A860C477FBA33C4AAE4965D3B/图片" alt="图片" />
    
    原因:`UserGroupInformation`在`presto work`节点中的用户信息改变为默认的`root`,导致Tauth认证失败;
    
    解决:在获取`fileSystem`之前,都进行`UserGroupInformation.setConfiguration(hadoopConf.get());`
---
 hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java        | 2 ++
 .../java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java    | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
index 1350108a11..15a729a812 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.common.fs;
 
+import org.apache.hadoop.security.UserGroupInformation;
 import org.apache.hudi.common.config.HoodieMetadataConfig;
 import org.apache.hudi.common.config.SerializableConfiguration;
 import org.apache.hudi.common.engine.HoodieEngineContext;
@@ -107,6 +108,7 @@ public class FSUtils {
     FileSystem fs;
     prepareHadoopConf(conf);
     try {
+      UserGroupInformation.setConfiguration(conf);
       fs = path.getFileSystem(conf);
     } catch (IOException e) {
       throw new HoodieIOException("Failed to get instance of " + FileSystem.class.getName(), e);
diff --git a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
index bcfd891711..db1eeaed7e 100644
--- a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
+++ b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.metadata;
 
+import org.apache.hadoop.security.UserGroupInformation;
 import org.apache.hudi.avro.model.HoodieMetadataColumnStats;
 import org.apache.hudi.common.bloom.BloomFilter;
 import org.apache.hudi.common.config.SerializableConfiguration;
@@ -96,6 +97,7 @@ public class FileSystemBackedTableMetadata implements HoodieTableMetadata {
         // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions.
         // and second entry holds optionally a directory path to be processed further.
         List<Pair<Option<String>, Option<Path>>> result = engineContext.map(dirToFileListing, fileStatus -> {
+          UserGroupInformation.setConfiguration(hadoopConf.get());
           FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get());
           if (fileStatus.isDirectory()) {
             if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) {


[hudi] 36/45: [HUDI-5223] Partial failover for flink (#7208)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit bddf061a794df35936eb532ba7d747e7aa3fbb47
Author: Danny Chan <yu...@gmail.com>
AuthorDate: Wed Nov 16 14:47:38 2022 +0800

    [HUDI-5223] Partial failover for flink (#7208)
    
    Before the patch, when there are partial failover within the write tasks, the write task current instant was initialized as the latest inflight instant, the write task then waits for a new instant to write with so hangs and failover continuously.
    
    For a task recovered from failover (with attempt number greater than 0), the latest inflight instant can actually be reused, the intermediate data files can be cleaned with MARGER files post commit.
    
    (cherry picked from commit d3f957755abf76c64ff06fac6d857cba9bdbbacf)
---
 .../src/main/java/org/apache/hudi/io/FlinkMergeHandle.java   |  8 +-------
 .../apache/hudi/sink/common/AbstractStreamWriteFunction.java | 12 ++++++++++--
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkMergeHandle.java b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkMergeHandle.java
index 69121a9a04..a44783f99e 100644
--- a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkMergeHandle.java
+++ b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkMergeHandle.java
@@ -143,13 +143,7 @@ public class FlinkMergeHandle<T extends HoodieRecordPayload, I, K, O>
           break;
         }
 
-        // Override the old file name,
-        // In rare cases, when a checkpoint was aborted and the instant time
-        // is reused, the merge handle generates a new file name
-        // with the reused instant time of last checkpoint, which is duplicate,
-        // use the same name file as new base file in case data loss.
-        oldFilePath = newFilePath;
-        rolloverPaths.add(oldFilePath);
+        rolloverPaths.add(newFilePath);
         newFileName = newFileNameWithRollover(rollNumber++);
         newFilePath = makeNewFilePath(partitionPath, newFileName);
         LOG.warn("Duplicate write for MERGE bucket with path: " + oldFilePath + ", rolls over to new path: " + newFilePath);
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractStreamWriteFunction.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractStreamWriteFunction.java
index b4569894a2..7642e9f28f 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractStreamWriteFunction.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractStreamWriteFunction.java
@@ -188,10 +188,9 @@ public abstract class AbstractStreamWriteFunction<I>
   // -------------------------------------------------------------------------
 
   private void restoreWriteMetadata() throws Exception {
-    String lastInflight = lastPendingInstant();
     boolean eventSent = false;
     for (WriteMetadataEvent event : this.writeMetadataState.get()) {
-      if (Objects.equals(lastInflight, event.getInstantTime())) {
+      if (Objects.equals(this.currentInstant, event.getInstantTime())) {
         // Reset taskID for event
         event.setTaskID(taskID);
         // The checkpoint succeed but the meta does not commit,
@@ -207,6 +206,15 @@ public abstract class AbstractStreamWriteFunction<I>
   }
 
   private void sendBootstrapEvent() {
+    int attemptId = getRuntimeContext().getAttemptNumber();
+    if (attemptId > 0) {
+      // either a partial or global failover, reuses the current inflight instant
+      if (this.currentInstant != null) {
+        LOG.info("Recover task[{}] for instant [{}] with attemptId [{}]", taskID, this.currentInstant, attemptId);
+        this.currentInstant = null;
+      }
+      return;
+    }
     this.eventGateway.sendEventToCoordinator(WriteMetadataEvent.emptyBootstrap(taskID));
     LOG.info("Send bootstrap write metadata event to coordinator, task[{}].", taskID);
   }


[hudi] 23/45: [HUDI-5178] Add Call show_table_properties for spark sql (#7161)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 7d6654c1d0ac00d01f322d4c9cd80fdfc5a03828
Author: ForwardXu <fo...@gmail.com>
AuthorDate: Wed Nov 9 10:41:03 2022 +0800

    [HUDI-5178] Add Call show_table_properties for spark sql (#7161)
    
    (cherry picked from commit 1d1181a4410154ff0615f374cfee97630b425e88)
---
 .../hudi/command/procedures/BaseProcedure.scala    |  4 +-
 .../hudi/command/procedures/HoodieProcedures.scala |  1 +
 .../procedures/ShowTablePropertiesProcedure.scala  | 71 ++++++++++++++++++++++
 .../TestShowTablePropertiesProcedure.scala         | 45 ++++++++++++++
 4 files changed, 119 insertions(+), 2 deletions(-)

diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BaseProcedure.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BaseProcedure.scala
index d0404664f4..67930cb3ed 100644
--- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BaseProcedure.scala
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/BaseProcedure.scala
@@ -22,7 +22,7 @@ import org.apache.hudi.client.SparkRDDWriteClient
 import org.apache.hudi.client.common.HoodieSparkEngineContext
 import org.apache.hudi.common.model.HoodieRecordPayload
 import org.apache.hudi.config.{HoodieIndexConfig, HoodieWriteConfig}
-import org.apache.hudi.exception.HoodieClusteringException
+import org.apache.hudi.exception.HoodieException
 import org.apache.hudi.index.HoodieIndex.IndexType
 import org.apache.spark.api.java.JavaSparkContext
 import org.apache.spark.sql.SparkSession
@@ -111,7 +111,7 @@ abstract class BaseProcedure extends Procedure {
       t => HoodieCLIUtils.getHoodieCatalogTable(sparkSession, t.asInstanceOf[String]).tableLocation)
       .getOrElse(
         tablePath.map(p => p.asInstanceOf[String]).getOrElse(
-          throw new HoodieClusteringException("Table name or table path must be given one"))
+          throw new HoodieException("Table name or table path must be given one"))
       )
   }
 
diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
index fabfda9367..d6131353c5 100644
--- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
@@ -83,6 +83,7 @@ object HoodieProcedures {
       ,(BackupInvalidParquetProcedure.NAME, BackupInvalidParquetProcedure.builder)
       ,(CopyToTempView.NAME, CopyToTempView.builder)
       ,(ShowCommitExtraMetadataProcedure.NAME, ShowCommitExtraMetadataProcedure.builder)
+      ,(ShowTablePropertiesProcedure.NAME, ShowTablePropertiesProcedure.builder)
     )
   }
 }
diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowTablePropertiesProcedure.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowTablePropertiesProcedure.scala
new file mode 100644
index 0000000000..d75df07fc9
--- /dev/null
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowTablePropertiesProcedure.scala
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.hudi.HoodieCLIUtils
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
+
+import java.util
+import java.util.function.Supplier
+import scala.collection.JavaConversions._
+
+class ShowTablePropertiesProcedure() extends BaseProcedure with ProcedureBuilder {
+  private val PARAMETERS = Array[ProcedureParameter](
+    ProcedureParameter.optional(0, "table", DataTypes.StringType, None),
+    ProcedureParameter.optional(1, "path", DataTypes.StringType, None),
+    ProcedureParameter.optional(2, "limit", DataTypes.IntegerType, 10)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+    StructField("key", DataTypes.StringType, nullable = true, Metadata.empty),
+    StructField("value", DataTypes.StringType, nullable = true, Metadata.empty)
+  ))
+
+  def parameters: Array[ProcedureParameter] = PARAMETERS
+
+  def outputType: StructType = OUTPUT_TYPE
+
+  override def call(args: ProcedureArgs): Seq[Row] = {
+    super.checkArgs(PARAMETERS, args)
+
+    val tableName = getArgValueOrDefault(args, PARAMETERS(0))
+    val tablePath = getArgValueOrDefault(args, PARAMETERS(1))
+    val limit = getArgValueOrDefault(args, PARAMETERS(2)).get.asInstanceOf[Int]
+
+    val basePath: String = getBasePath(tableName, tablePath)
+    val metaClient = HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build
+    val tableProps = metaClient.getTableConfig.getProps
+
+    val rows = new util.ArrayList[Row]
+    tableProps.foreach(p => rows.add(Row(p._1, p._2)))
+    rows.stream().limit(limit).toArray().map(r => r.asInstanceOf[Row]).toList
+  }
+
+  override def build: Procedure = new ShowTablePropertiesProcedure()
+
+}
+
+object ShowTablePropertiesProcedure {
+  val NAME = "show_table_properties"
+
+  def builder: Supplier[ProcedureBuilder] = new Supplier[ProcedureBuilder] {
+    override def get() = new ShowTablePropertiesProcedure()
+  }
+}
diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestShowTablePropertiesProcedure.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestShowTablePropertiesProcedure.scala
new file mode 100644
index 0000000000..0488920458
--- /dev/null
+++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestShowTablePropertiesProcedure.scala
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.procedure
+
+class TestShowTablePropertiesProcedure extends HoodieSparkProcedureTestBase {
+  test("Test Call show_table_properties Procedure") {
+    withTempDir { tmp =>
+      val tableName = generateTableName
+      val tablePath = tmp.getCanonicalPath + "/" + tableName
+      // create table
+      spark.sql(
+        s"""
+           |create table $tableName (
+           |  id int,
+           |  name string,
+           |  price double,
+           |  ts long
+           |) using hudi
+           | location '$tablePath'
+           | tblproperties (
+           |  primaryKey = 'id',
+           |  preCombineField = 'ts'
+           | )
+       """.stripMargin)
+
+      val result = spark.sql(s"""call show_table_properties(path => '$tablePath')""").collect()
+      assertResult(true) {result.length > 0}
+    }
+  }
+}


[hudi] 22/45: [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table (#6741)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 738e2cce8f437ed3a2f7fc474e4143bb42fbad22
Author: xiarixiaoyao <me...@qq.com>
AuthorDate: Thu Nov 3 23:11:39 2022 +0800

    [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table (#6741)
    
    * [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table
    
    * Update HiveAvroSerializer.java otherwise payload string type combine field will cause cast exception
    
    (cherry picked from commit cd314b8cfa58c32f731f7da2aa6377a09df4c6f9)
---
 .../realtime/AbstractRealtimeRecordReader.java     |  72 +++-
 .../realtime/HoodieHFileRealtimeInputFormat.java   |   2 +-
 .../realtime/HoodieParquetRealtimeInputFormat.java |  14 +-
 .../realtime/RealtimeCompactedRecordReader.java    |  25 +-
 .../hudi/hadoop/utils/HiveAvroSerializer.java      | 409 +++++++++++++++++++++
 .../utils/HoodieRealtimeInputFormatUtils.java      |  19 +-
 .../utils/HoodieRealtimeRecordReaderUtils.java     |   5 +
 .../hudi/hadoop/utils/TestHiveAvroSerializer.java  | 148 ++++++++
 8 files changed, 678 insertions(+), 16 deletions(-)

diff --git a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java
index dfdda9dfc8..83b69812e1 100644
--- a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java
+++ b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java
@@ -18,26 +18,34 @@
 
 package org.apache.hudi.hadoop.realtime;
 
-import org.apache.hudi.common.model.HoodieAvroPayload;
-import org.apache.hudi.common.model.HoodiePayloadProps;
-import org.apache.hudi.common.table.HoodieTableMetaClient;
-import org.apache.hudi.exception.HoodieException;
-import org.apache.hudi.common.table.TableSchemaResolver;
-import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils;
-
 import org.apache.avro.Schema;
 import org.apache.avro.Schema.Field;
 import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants;
+import org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector;
+import org.apache.hadoop.hive.serde.serdeConstants;
 import org.apache.hadoop.hive.serde2.ColumnProjectionUtils;
+import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
 import org.apache.hadoop.mapred.JobConf;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodiePayloadProps;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.TableSchemaResolver;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.hadoop.utils.HiveAvroSerializer;
+import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils;
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.List;
+import java.util.Locale;
 import java.util.Map;
 import java.util.Properties;
+import java.util.Set;
 import java.util.stream.Collectors;
 
 /**
@@ -55,6 +63,10 @@ public abstract class AbstractRealtimeRecordReader {
   private Schema writerSchema;
   private Schema hiveSchema;
   private HoodieTableMetaClient metaClient;
+  // support merge operation
+  protected boolean supportPayload = true;
+  // handle hive type to avro record
+  protected HiveAvroSerializer serializer;
 
   public AbstractRealtimeRecordReader(RealtimeSplit split, JobConf job) {
     this.split = split;
@@ -62,6 +74,7 @@ public abstract class AbstractRealtimeRecordReader {
     LOG.info("cfg ==> " + job.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR));
     LOG.info("columnIds ==> " + job.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR));
     LOG.info("partitioningColumns ==> " + job.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, ""));
+    this.supportPayload = Boolean.parseBoolean(job.get("hoodie.support.payload", "true"));
     try {
       metaClient = HoodieTableMetaClient.builder().setConf(jobConf).setBasePath(split.getBasePath()).build();
       if (metaClient.getTableConfig().getPreCombineField() != null) {
@@ -73,6 +86,7 @@ public abstract class AbstractRealtimeRecordReader {
     } catch (Exception e) {
       throw new HoodieException("Could not create HoodieRealtimeRecordReader on path " + this.split.getPath(), e);
     }
+    prepareHiveAvroSerializer();
   }
 
   private boolean usesCustomPayload(HoodieTableMetaClient metaClient) {
@@ -80,6 +94,34 @@ public abstract class AbstractRealtimeRecordReader {
         || metaClient.getTableConfig().getPayloadClass().contains("org.apache.hudi.OverwriteWithLatestAvroPayload"));
   }
 
+  private void prepareHiveAvroSerializer() {
+    try {
+      // hive will append virtual columns at the end of column list. we should remove those columns.
+      // eg: current table is col1, col2, col3; jobConf.get(serdeConstants.LIST_COLUMNS): col1, col2, col3 ,BLOCK__OFFSET__INSIDE__FILE ...
+      Set<String> writerSchemaColNames = writerSchema.getFields().stream().map(f -> f.name().toLowerCase(Locale.ROOT)).collect(Collectors.toSet());
+      List<String> columnNameList = Arrays.stream(jobConf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList());
+      List<TypeInfo> columnTypeList = TypeInfoUtils.getTypeInfosFromTypeString(jobConf.get(serdeConstants.LIST_COLUMN_TYPES));
+      int columnNameListLen = columnNameList.size() - 1;
+      for (int i = columnNameListLen; i >= 0; i--) {
+        String lastColName = columnNameList.get(columnNameList.size() - 1);
+        // virtual columns will only append at the end of column list. it will be ok to break the loop.
+        if (writerSchemaColNames.contains(lastColName)) {
+          break;
+        }
+        LOG.debug(String.format("remove virtual column: %s", lastColName));
+        columnNameList.remove(columnNameList.size() - 1);
+        columnTypeList.remove(columnTypeList.size() - 1);
+      }
+      StructTypeInfo rowTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(columnNameList, columnTypeList);
+      this.serializer = new HiveAvroSerializer(new ArrayWritableObjectInspector(rowTypeInfo), columnNameList, columnTypeList);
+    } catch (Exception e) {
+      // fallback to origin logical
+      LOG.warn("fall to init HiveAvroSerializer to support payload merge", e);
+      this.supportPayload = false;
+    }
+
+  }
+
   /**
    * Gets schema from HoodieTableMetaClient. If not, falls
    * back to the schema from the latest parquet file. Finally, sets the partition column and projection fields into the
@@ -135,6 +177,10 @@ public abstract class AbstractRealtimeRecordReader {
     return hiveSchema;
   }
 
+  protected Schema getLogScannerReaderSchema() {
+    return usesCustomPayload ? writerSchema : readerSchema;
+  }
+
   public Schema getReaderSchema() {
     return readerSchema;
   }
@@ -154,4 +200,16 @@ public abstract class AbstractRealtimeRecordReader {
   public JobConf getJobConf() {
     return jobConf;
   }
+
+  public void setReaderSchema(Schema readerSchema) {
+    this.readerSchema = readerSchema;
+  }
+
+  public void setWriterSchema(Schema writerSchema) {
+    this.writerSchema = writerSchema;
+  }
+
+  public void setHiveSchema(Schema hiveSchema) {
+    this.hiveSchema = hiveSchema;
+  }
 }
diff --git a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieHFileRealtimeInputFormat.java b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieHFileRealtimeInputFormat.java
index 799d90bce5..cdc062475f 100644
--- a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieHFileRealtimeInputFormat.java
+++ b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieHFileRealtimeInputFormat.java
@@ -72,7 +72,7 @@ public class HoodieHFileRealtimeInputFormat extends HoodieMergeOnReadTableInputF
           // For e:g _hoodie_record_key would be missing and merge step would throw exceptions.
           // TO fix this, hoodie columns are appended late at the time record-reader gets built instead of construction
           // time.
-          HoodieRealtimeInputFormatUtils.addRequiredProjectionFields(jobConf, Option.empty());
+          HoodieRealtimeInputFormatUtils.addRequiredProjectionFields(jobConf, Option.empty(), Option.empty());
 
           this.conf = jobConf;
           this.conf.set(HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP, "true");
diff --git a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
index e8c806ed2c..78768104d9 100644
--- a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
+++ b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
@@ -27,6 +27,10 @@ import org.apache.hadoop.mapred.RecordReader;
 import org.apache.hadoop.mapred.Reporter;
 import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.hadoop.HoodieParquetInputFormat;
 import org.apache.hudi.hadoop.UseFileSplitsFromInputFormat;
 import org.apache.hudi.hadoop.UseRecordReaderFromInputFormat;
@@ -61,7 +65,10 @@ public class HoodieParquetRealtimeInputFormat extends HoodieParquetInputFormat {
     ValidationUtils.checkArgument(split instanceof RealtimeSplit,
         "HoodieRealtimeRecordReader can only work on RealtimeSplit and not with " + split);
     RealtimeSplit realtimeSplit = (RealtimeSplit) split;
-    addProjectionToJobConf(realtimeSplit, jobConf);
+    // add preCombineKey
+    HoodieTableMetaClient metaClient = HoodieTableMetaClient.builder().setConf(jobConf).setBasePath(realtimeSplit.getBasePath()).build();
+    HoodieTableConfig tableConfig = metaClient.getTableConfig();
+    addProjectionToJobConf(realtimeSplit, jobConf, metaClient.getTableConfig().getPreCombineField());
     LOG.info("Creating record reader with readCols :" + jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR)
         + ", Ids :" + jobConf.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR));
 
@@ -74,7 +81,7 @@ public class HoodieParquetRealtimeInputFormat extends HoodieParquetInputFormat {
         super.getRecordReader(split, jobConf, reporter));
   }
 
-  void addProjectionToJobConf(final RealtimeSplit realtimeSplit, final JobConf jobConf) {
+  void addProjectionToJobConf(final RealtimeSplit realtimeSplit, final JobConf jobConf, String preCombineKey) {
     // Hive on Spark invokes multiple getRecordReaders from different threads in the same spark task (and hence the
     // same JVM) unlike Hive on MR. Due to this, accesses to JobConf, which is shared across all threads, is at the
     // risk of experiencing race conditions. Hence, we synchronize on the JobConf object here. There is negligible
@@ -94,7 +101,8 @@ public class HoodieParquetRealtimeInputFormat extends HoodieParquetInputFormat {
           // TO fix this, hoodie columns are appended late at the time record-reader gets built instead of construction
           // time.
           if (!realtimeSplit.getDeltaLogPaths().isEmpty()) {
-            HoodieRealtimeInputFormatUtils.addRequiredProjectionFields(jobConf, realtimeSplit.getVirtualKeyInfo());
+            HoodieRealtimeInputFormatUtils.addRequiredProjectionFields(jobConf, realtimeSplit.getVirtualKeyInfo(),
+                StringUtils.isNullOrEmpty(preCombineKey) ? Option.empty() : Option.of(preCombineKey));
           }
           jobConf.set(HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP, "true");
           setConf(jobConf);
diff --git a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
index b917f004bc..0143672fa0 100644
--- a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
+++ b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
@@ -33,6 +33,7 @@ import org.apache.hudi.common.model.HoodieRecordPayload;
 import org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.hadoop.config.HoodieRealtimeConfig;
+import org.apache.hudi.hadoop.utils.HiveAvroSerializer;
 import org.apache.hudi.hadoop.utils.HoodieInputFormatUtils;
 import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils;
 import org.apache.log4j.LogManager;
@@ -81,7 +82,7 @@ class RealtimeCompactedRecordReader extends AbstractRealtimeRecordReader
         .withFileSystem(FSUtils.getFs(split.getPath().toString(), jobConf))
         .withBasePath(split.getBasePath())
         .withLogFilePaths(split.getDeltaLogPaths())
-        .withReaderSchema(usesCustomPayload ? getWriterSchema() : getReaderSchema())
+        .withReaderSchema(getLogScannerReaderSchema())
         .withLatestInstantTime(split.getMaxCommitTime())
         .withMaxMemorySizeInBytes(HoodieRealtimeRecordReaderUtils.getMaxCompactionMemoryInBytes(jobConf))
         .withReadBlocksLazily(Boolean.parseBoolean(jobConf.get(HoodieRealtimeConfig.COMPACTION_LAZY_BLOCK_READ_ENABLED_PROP, HoodieRealtimeConfig.DEFAULT_COMPACTION_LAZY_BLOCK_READ_ENABLED)))
@@ -112,9 +113,7 @@ class RealtimeCompactedRecordReader extends AbstractRealtimeRecordReader
         if (deltaRecordMap.containsKey(key)) {
           // mark the key as handled
           this.deltaRecordKeys.remove(key);
-          // TODO(NA): Invoke preCombine here by converting arrayWritable to Avro. This is required since the
-          // deltaRecord may not be a full record and needs values of columns from the parquet
-          Option<GenericRecord> rec = buildGenericRecordwithCustomPayload(deltaRecordMap.get(key));
+          Option<GenericRecord> rec = supportPayload ? mergeRecord(deltaRecordMap.get(key), arrayWritable) : buildGenericRecordwithCustomPayload(deltaRecordMap.get(key));
           // If the record is not present, this is a delete record using an empty payload so skip this base record
           // and move to the next record
           if (!rec.isPresent()) {
@@ -173,6 +172,24 @@ class RealtimeCompactedRecordReader extends AbstractRealtimeRecordReader
     }
   }
 
+  private Option<GenericRecord> mergeRecord(HoodieRecord<? extends HoodieRecordPayload> newRecord, ArrayWritable writableFromParquet) throws IOException {
+    GenericRecord oldRecord = convertArrayWritableToHoodieRecord(writableFromParquet);
+    // presto will not append partition columns to jobConf.get(serdeConstants.LIST_COLUMNS), but hive will do it. This will lead following results
+    // eg: current table: col1: int, col2: int, par: string, and column par is partition columns.
+    // for hive engine, the hiveSchema will be: col1,col2,par, and the writerSchema will be col1,col2,par
+    // for presto engine, the hiveSchema will be: col1,col2, but the writerSchema will be col1,col2,par
+    // so to be compatible with hive and presto, we should rewrite oldRecord before we call combineAndGetUpdateValue,
+    // once presto on hudi have it's own mor reader, we can remove the rewrite logical.
+    Option<GenericRecord> combinedValue = newRecord.getData().combineAndGetUpdateValue(HiveAvroSerializer.rewriteRecordIgnoreResultCheck(oldRecord,
+        getLogScannerReaderSchema()), getLogScannerReaderSchema(), payloadProps);
+    return combinedValue;
+  }
+
+  private GenericRecord convertArrayWritableToHoodieRecord(ArrayWritable arrayWritable) {
+    GenericRecord record = serializer.serialize(arrayWritable, getHiveSchema());
+    return record;
+  }
+
   @Override
   public NullWritable createKey() {
     return parquetReader.createKey();
diff --git a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HiveAvroSerializer.java b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HiveAvroSerializer.java
new file mode 100644
index 0000000000..16942ba8b3
--- /dev/null
+++ b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HiveAvroSerializer.java
@@ -0,0 +1,409 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.utils;
+
+import org.apache.avro.JsonProperties;
+import org.apache.avro.LogicalTypes;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericEnumSymbol;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.specific.SpecificRecordBase;
+import org.apache.avro.util.Utf8;
+import org.apache.hadoop.hive.common.type.HiveChar;
+import org.apache.hadoop.hive.common.type.HiveDecimal;
+import org.apache.hadoop.hive.common.type.HiveVarchar;
+import org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils;
+import org.apache.hadoop.hive.serde2.avro.InstanceCache;
+import org.apache.hadoop.hive.serde2.io.DateWritable;
+import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.StructField;
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.UnionObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.DateObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.TimestampObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableDateObjectInspector;
+import org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.MapTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.UnionTypeInfo;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.math.BigDecimal;
+import java.sql.Timestamp;
+import java.util.ArrayList;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.avro.AvroSchemaUtils.resolveUnionSchema;
+import static org.apache.hudi.avro.HoodieAvroUtils.isMetadataField;
+
+/**
+ * Helper class to serialize hive writable type to avro record.
+ */
+public class HiveAvroSerializer {
+
+  private final List<String> columnNames;
+  private final List<TypeInfo> columnTypes;
+  private final ObjectInspector objectInspector;
+
+  private static final Logger LOG = LogManager.getLogger(HiveAvroSerializer.class);
+
+  public HiveAvroSerializer(ObjectInspector objectInspector, List<String> columnNames, List<TypeInfo> columnTypes) {
+    this.columnNames = columnNames;
+    this.columnTypes = columnTypes;
+    this.objectInspector = objectInspector;
+  }
+
+  private static final Schema STRING_SCHEMA = Schema.create(Schema.Type.STRING);
+
+  public GenericRecord serialize(Object o, Schema schema) {
+
+    StructObjectInspector soi = (StructObjectInspector) objectInspector;
+    GenericData.Record record = new GenericData.Record(schema);
+
+    List<? extends StructField> outputFieldRefs = soi.getAllStructFieldRefs();
+    if (outputFieldRefs.size() != columnNames.size()) {
+      throw new HoodieException("Number of input columns was different than output columns (in = " + columnNames.size() + " vs out = " + outputFieldRefs.size());
+    }
+
+    int size = schema.getFields().size();
+
+    List<? extends StructField> allStructFieldRefs = soi.getAllStructFieldRefs();
+    List<Object> structFieldsDataAsList = soi.getStructFieldsDataAsList(o);
+
+    for (int i  = 0; i < size; i++) {
+      Schema.Field field = schema.getFields().get(i);
+      if (i >= columnTypes.size()) {
+        break;
+      }
+      try {
+        setUpRecordFieldFromWritable(columnTypes.get(i), structFieldsDataAsList.get(i),
+            allStructFieldRefs.get(i).getFieldObjectInspector(), record, field);
+      } catch (Exception e) {
+        LOG.error(String.format("current columnNames: %s", columnNames.stream().collect(Collectors.joining(","))));
+        LOG.error(String.format("current type: %s", columnTypes.stream().map(f -> f.getTypeName()).collect(Collectors.joining(","))));
+        LOG.error(String.format("current value: %s", HoodieRealtimeRecordReaderUtils.arrayWritableToString((ArrayWritable) o)));
+        throw e;
+      }
+    }
+    return record;
+  }
+
+  private void setUpRecordFieldFromWritable(TypeInfo typeInfo, Object structFieldData, ObjectInspector fieldOI, GenericData.Record record, Schema.Field field) {
+    Object val = serialize(typeInfo, fieldOI, structFieldData, field.schema());
+    if (val == null) {
+      if (field.defaultVal() instanceof JsonProperties.Null) {
+        record.put(field.name(), null);
+      } else {
+        record.put(field.name(), field.defaultVal());
+      }
+    } else {
+      record.put(field.name(), val);
+    }
+  }
+
+  /**
+   * Determine if an Avro schema is of type Union[T, NULL].  Avro supports nullable
+   * types via a union of type T and null.  This is a very common use case.
+   * As such, we want to silently convert it to just T and allow the value to be null.
+   *
+   * When a Hive union type is used with AVRO, the schema type becomes
+   * Union[NULL, T1, T2, ...]. The NULL in the union should be silently removed
+   *
+   * @return true if type represents Union[T, Null], false otherwise
+   */
+  public static boolean isNullableType(Schema schema) {
+    if (!schema.getType().equals(Schema.Type.UNION)) {
+      return false;
+    }
+
+    List<Schema> itemSchemas = schema.getTypes();
+    if (itemSchemas.size() < 2) {
+      return false;
+    }
+
+    for (Schema itemSchema : itemSchemas) {
+      if (Schema.Type.NULL.equals(itemSchema.getType())) {
+        return true;
+      }
+    }
+
+    // [null, null] not allowed, so this check is ok.
+    return false;
+  }
+
+  /**
+   * If the union schema is a nullable union, get the schema for the non-nullable type.
+   * This method does no checking that the provided Schema is nullable. If the provided
+   * union schema is non-nullable, it simply returns the union schema
+   */
+  public static Schema getOtherTypeFromNullableType(Schema unionSchema) {
+    final List<Schema> types = unionSchema.getTypes();
+    if (types.size() == 2) { // most common scenario
+      if (types.get(0).getType() == Schema.Type.NULL) {
+        return types.get(1);
+      }
+      if (types.get(1).getType() == Schema.Type.NULL) {
+        return types.get(0);
+      }
+      // not a nullable union
+      return unionSchema;
+    }
+
+    final List<Schema> itemSchemas = new ArrayList<>();
+    for (Schema itemSchema : types) {
+      if (!Schema.Type.NULL.equals(itemSchema.getType())) {
+        itemSchemas.add(itemSchema);
+      }
+    }
+
+    if (itemSchemas.size() > 1) {
+      return Schema.createUnion(itemSchemas);
+    } else {
+      return itemSchemas.get(0);
+    }
+  }
+
+  private Object serialize(TypeInfo typeInfo, ObjectInspector fieldOI, Object structFieldData, Schema schema) throws HoodieException {
+    if (null == structFieldData) {
+      return null;
+    }
+
+    if (isNullableType(schema)) {
+      schema = getOtherTypeFromNullableType(schema);
+    }
+    /* Because we use Hive's 'string' type when Avro calls for enum, we have to expressly check for enum-ness */
+    if (Schema.Type.ENUM.equals(schema.getType())) {
+      assert fieldOI instanceof PrimitiveObjectInspector;
+      return serializeEnum((PrimitiveObjectInspector) fieldOI, structFieldData, schema);
+    }
+    switch (typeInfo.getCategory()) {
+      case PRIMITIVE:
+        assert fieldOI instanceof PrimitiveObjectInspector;
+        return serializePrimitive((PrimitiveObjectInspector) fieldOI, structFieldData, schema);
+      case MAP:
+        assert fieldOI instanceof MapObjectInspector;
+        assert typeInfo instanceof MapTypeInfo;
+        return serializeMap((MapTypeInfo) typeInfo, (MapObjectInspector) fieldOI, structFieldData, schema);
+      case LIST:
+        assert fieldOI instanceof ListObjectInspector;
+        assert typeInfo instanceof ListTypeInfo;
+        return serializeList((ListTypeInfo) typeInfo, (ListObjectInspector) fieldOI, structFieldData, schema);
+      case UNION:
+        assert fieldOI instanceof UnionObjectInspector;
+        assert typeInfo instanceof UnionTypeInfo;
+        return serializeUnion((UnionTypeInfo) typeInfo, (UnionObjectInspector) fieldOI, structFieldData, schema);
+      case STRUCT:
+        assert fieldOI instanceof StructObjectInspector;
+        assert typeInfo instanceof StructTypeInfo;
+        return serializeStruct((StructTypeInfo) typeInfo, (StructObjectInspector) fieldOI, structFieldData, schema);
+      default:
+        throw new HoodieException("Ran out of TypeInfo Categories: " + typeInfo.getCategory());
+    }
+  }
+
+  /** private cache to avoid lots of EnumSymbol creation while serializing.
+   *  Two levels because the enum symbol is specific to a schema.
+   *  Object because we want to avoid the overhead of repeated toString calls while maintaining compatability.
+   *  Provided there are few enum types per record, and few symbols per enum, memory use should be moderate.
+   *  eg 20 types with 50 symbols each as length-10 Strings should be on the order of 100KB per AvroSerializer.
+   */
+  final InstanceCache<Schema, InstanceCache<Object, GenericEnumSymbol>> enums = new InstanceCache<Schema, InstanceCache<Object, GenericEnumSymbol>>() {
+    @Override
+    protected InstanceCache<Object, GenericEnumSymbol> makeInstance(final Schema schema,
+                                                                    Set<Schema> seenSchemas) {
+      return new InstanceCache<Object, GenericEnumSymbol>() {
+        @Override
+        protected GenericEnumSymbol makeInstance(Object seed, Set<Object> seenSchemas) {
+          return new GenericData.EnumSymbol(schema, seed.toString());
+        }
+      };
+    }
+  };
+
+  private Object serializeEnum(PrimitiveObjectInspector fieldOI, Object structFieldData, Schema schema) throws HoodieException {
+    try {
+      return enums.retrieve(schema).retrieve(serializePrimitive(fieldOI, structFieldData, schema));
+    } catch (Exception e) {
+      throw new HoodieException(e);
+    }
+  }
+
+  private Object serializeStruct(StructTypeInfo typeInfo, StructObjectInspector ssoi, Object o, Schema schema) {
+    int size = schema.getFields().size();
+    List<? extends StructField> allStructFieldRefs = ssoi.getAllStructFieldRefs();
+    List<Object> structFieldsDataAsList = ssoi.getStructFieldsDataAsList(o);
+    GenericData.Record record = new GenericData.Record(schema);
+    ArrayList<TypeInfo> allStructFieldTypeInfos = typeInfo.getAllStructFieldTypeInfos();
+
+    for (int i  = 0; i < size; i++) {
+      Schema.Field field = schema.getFields().get(i);
+      setUpRecordFieldFromWritable(allStructFieldTypeInfos.get(i), structFieldsDataAsList.get(i),
+          allStructFieldRefs.get(i).getFieldObjectInspector(), record, field);
+    }
+    return record;
+  }
+
+  private Object serializePrimitive(PrimitiveObjectInspector fieldOI, Object structFieldData, Schema schema) throws HoodieException {
+    switch (fieldOI.getPrimitiveCategory()) {
+      case BINARY:
+        if (schema.getType() == Schema.Type.BYTES) {
+          return AvroSerdeUtils.getBufferFromBytes((byte[])fieldOI.getPrimitiveJavaObject(structFieldData));
+        } else if (schema.getType() == Schema.Type.FIXED) {
+          GenericData.Fixed fixed = new GenericData.Fixed(schema, (byte[])fieldOI.getPrimitiveJavaObject(structFieldData));
+          return fixed;
+        } else {
+          throw new HoodieException("Unexpected Avro schema for Binary TypeInfo: " + schema.getType());
+        }
+      case DECIMAL:
+        HiveDecimal dec = (HiveDecimal)fieldOI.getPrimitiveJavaObject(structFieldData);
+        LogicalTypes.Decimal decimal = (LogicalTypes.Decimal)schema.getLogicalType();
+        BigDecimal bd = new BigDecimal(dec.toString()).setScale(decimal.getScale());
+        return HoodieAvroUtils.DECIMAL_CONVERSION.toFixed(bd, schema, decimal);
+      case CHAR:
+        HiveChar ch = (HiveChar)fieldOI.getPrimitiveJavaObject(structFieldData);
+        return new Utf8(ch.getStrippedValue());
+      case VARCHAR:
+        HiveVarchar vc = (HiveVarchar)fieldOI.getPrimitiveJavaObject(structFieldData);
+        return new Utf8(vc.getValue());
+      case STRING:
+        String string = (String)fieldOI.getPrimitiveJavaObject(structFieldData);
+        return new Utf8(string);
+      case DATE:
+        return DateWritable.dateToDays(((DateObjectInspector)fieldOI).getPrimitiveJavaObject(structFieldData));
+      case TIMESTAMP:
+        Timestamp timestamp =
+            ((TimestampObjectInspector) fieldOI).getPrimitiveJavaObject(structFieldData);
+        return timestamp.getTime();
+      case INT:
+        if (schema.getLogicalType() != null && schema.getLogicalType().getName().equals("date")) {
+          return DateWritable.dateToDays(new WritableDateObjectInspector().getPrimitiveJavaObject(structFieldData));
+        }
+        return fieldOI.getPrimitiveJavaObject(structFieldData);
+      case UNKNOWN:
+        throw new HoodieException("Received UNKNOWN primitive category.");
+      case VOID:
+        return null;
+      default: // All other primitive types are simple
+        return fieldOI.getPrimitiveJavaObject(structFieldData);
+    }
+  }
+
+  private Object serializeUnion(UnionTypeInfo typeInfo, UnionObjectInspector fieldOI, Object structFieldData, Schema schema) throws HoodieException {
+    byte tag = fieldOI.getTag(structFieldData);
+
+    // Invariant that Avro's tag ordering must match Hive's.
+    return serialize(typeInfo.getAllUnionObjectTypeInfos().get(tag),
+        fieldOI.getObjectInspectors().get(tag),
+        fieldOI.getField(structFieldData),
+        schema.getTypes().get(tag));
+  }
+
+  private Object serializeList(ListTypeInfo typeInfo, ListObjectInspector fieldOI, Object structFieldData, Schema schema) throws HoodieException {
+    List<?> list = fieldOI.getList(structFieldData);
+    List<Object> deserialized = new GenericData.Array<Object>(list.size(), schema);
+
+    TypeInfo listElementTypeInfo = typeInfo.getListElementTypeInfo();
+    ObjectInspector listElementObjectInspector = fieldOI.getListElementObjectInspector();
+    Schema elementType = schema.getElementType().getField("element") == null ? schema.getElementType() : schema.getElementType().getField("element").schema();
+
+    for (int i = 0; i < list.size(); i++) {
+      Object childFieldData = list.get(i);
+      if (childFieldData instanceof ArrayWritable && ((ArrayWritable) childFieldData).get().length != ((StructTypeInfo) listElementTypeInfo).getAllStructFieldNames().size()) {
+        deserialized.add(i, serialize(listElementTypeInfo, listElementObjectInspector, ((ArrayWritable) childFieldData).get()[0], elementType));
+      } else {
+        deserialized.add(i, serialize(listElementTypeInfo, listElementObjectInspector, childFieldData, elementType));
+      }
+    }
+    return deserialized;
+  }
+
+  private Object serializeMap(MapTypeInfo typeInfo, MapObjectInspector fieldOI, Object structFieldData, Schema schema) throws HoodieException {
+    // Avro only allows maps with string keys
+    if (!mapHasStringKey(fieldOI.getMapKeyObjectInspector())) {
+      throw new HoodieException("Avro only supports maps with keys as Strings.  Current Map is: " + typeInfo.toString());
+    }
+
+    ObjectInspector mapKeyObjectInspector = fieldOI.getMapKeyObjectInspector();
+    ObjectInspector mapValueObjectInspector = fieldOI.getMapValueObjectInspector();
+    TypeInfo mapKeyTypeInfo = typeInfo.getMapKeyTypeInfo();
+    TypeInfo mapValueTypeInfo = typeInfo.getMapValueTypeInfo();
+    Map<?,?> map = fieldOI.getMap(structFieldData);
+    Schema valueType = schema.getValueType();
+
+    Map<Object, Object> deserialized = new LinkedHashMap<Object, Object>(fieldOI.getMapSize(structFieldData));
+
+    for (Map.Entry<?, ?> entry : map.entrySet()) {
+      deserialized.put(serialize(mapKeyTypeInfo, mapKeyObjectInspector, entry.getKey(), STRING_SCHEMA),
+          serialize(mapValueTypeInfo, mapValueObjectInspector, entry.getValue(), valueType));
+    }
+
+    return deserialized;
+  }
+
+  private boolean mapHasStringKey(ObjectInspector mapKeyObjectInspector) {
+    return mapKeyObjectInspector instanceof PrimitiveObjectInspector
+        && ((PrimitiveObjectInspector) mapKeyObjectInspector).getPrimitiveCategory().equals(PrimitiveObjectInspector.PrimitiveCategory.STRING);
+  }
+
+  public static GenericRecord rewriteRecordIgnoreResultCheck(GenericRecord oldRecord, Schema newSchema) {
+    GenericRecord newRecord = new GenericData.Record(newSchema);
+    boolean isSpecificRecord = oldRecord instanceof SpecificRecordBase;
+    for (Schema.Field f : newSchema.getFields()) {
+      if (!(isSpecificRecord && isMetadataField(f.name()))) {
+        copyOldValueOrSetDefault(oldRecord, newRecord, f);
+      }
+    }
+    return newRecord;
+  }
+
+  private static void copyOldValueOrSetDefault(GenericRecord oldRecord, GenericRecord newRecord, Schema.Field field) {
+    Schema oldSchema = oldRecord.getSchema();
+    Object fieldValue = oldSchema.getField(field.name()) == null ? null : oldRecord.get(field.name());
+
+    if (fieldValue != null) {
+      // In case field's value is a nested record, we have to rewrite it as well
+      Object newFieldValue;
+      if (fieldValue instanceof GenericRecord) {
+        GenericRecord record = (GenericRecord) fieldValue;
+        newFieldValue = rewriteRecordIgnoreResultCheck(record, resolveUnionSchema(field.schema(), record.getSchema().getFullName()));
+      } else {
+        newFieldValue = fieldValue;
+      }
+      newRecord.put(field.name(), newFieldValue);
+    } else if (field.defaultVal() instanceof JsonProperties.Null) {
+      newRecord.put(field.name(), null);
+    } else {
+      newRecord.put(field.name(), field.defaultVal());
+    }
+  }
+}
+
diff --git a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
index 4b351d1205..44e6dd7f93 100644
--- a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
+++ b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
@@ -19,6 +19,7 @@
 package org.apache.hudi.hadoop.utils;
 
 import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.serde.serdeConstants;
 import org.apache.hadoop.hive.serde2.ColumnProjectionUtils;
 import org.apache.hadoop.mapred.FileSplit;
 import org.apache.hadoop.mapred.JobConf;
@@ -31,6 +32,10 @@ import org.apache.hudi.hadoop.realtime.RealtimeSplit;
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
 import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
 
 public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils {
@@ -78,7 +83,7 @@ public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils {
     return conf;
   }
 
-  public static void addRequiredProjectionFields(Configuration configuration, Option<HoodieVirtualKeyInfo> hoodieVirtualKeyInfo) {
+  public static void addRequiredProjectionFields(Configuration configuration, Option<HoodieVirtualKeyInfo> hoodieVirtualKeyInfo, Option<String> preCombineKeyOpt) {
     // Need this to do merge records in HoodieRealtimeRecordReader
     if (!hoodieVirtualKeyInfo.isPresent()) {
       addProjectionField(configuration, HoodieRecord.RECORD_KEY_METADATA_FIELD, HoodieInputFormatUtils.HOODIE_RECORD_KEY_COL_POS);
@@ -91,6 +96,18 @@ public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils {
         addProjectionField(configuration, hoodieVirtualKey.getPartitionPathField().get(), hoodieVirtualKey.getPartitionPathFieldIndex().get());
       }
     }
+
+    if (preCombineKeyOpt.isPresent()) {
+      // infer col pos
+      String preCombineKey = preCombineKeyOpt.get();
+      List<String> columnNameList = Arrays.stream(configuration.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList());
+      int pos = columnNameList.indexOf(preCombineKey);
+      if (pos != -1) {
+        addProjectionField(configuration, preCombineKey, pos);
+        LOG.info(String.format("add preCombineKey: %s to project columns with position %s", preCombineKey, pos));
+      }
+    }
+
   }
 
   public static boolean requiredProjectionFieldsExistInConf(Configuration configuration, Option<HoodieVirtualKeyInfo> hoodieVirtualKeyInfo) {
diff --git a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java
index bf4cbff666..e3466a6401 100644
--- a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java
+++ b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java
@@ -30,6 +30,7 @@ import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hive.serde2.io.DateWritable;
 import org.apache.hadoop.hive.serde2.io.DoubleWritable;
 import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable;
+import org.apache.hadoop.hive.serde2.io.TimestampWritable;
 import org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo;
 import org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils;
 import org.apache.hadoop.io.ArrayWritable;
@@ -54,6 +55,7 @@ import java.io.IOException;
 import java.nio.ByteBuffer;
 import java.util.ArrayList;
 import java.util.Arrays;
+import java.sql.Timestamp;
 import java.util.LinkedHashSet;
 import java.util.List;
 import java.util.Map;
@@ -176,6 +178,9 @@ public class HoodieRealtimeRecordReaderUtils {
         }
         return new IntWritable((Integer) value);
       case LONG:
+        if (schema.getLogicalType() != null && "timestamp-micros".equals(schema.getLogicalType().getName())) {
+          return new TimestampWritable(new Timestamp((Long) value));
+        }
         return new LongWritable((Long) value);
       case FLOAT:
         return new FloatWritable((Float) value);
diff --git a/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/utils/TestHiveAvroSerializer.java b/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/utils/TestHiveAvroSerializer.java
new file mode 100644
index 0000000000..9de4630877
--- /dev/null
+++ b/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/utils/TestHiveAvroSerializer.java
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.utils;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+
+import org.apache.avro.LogicalTypes;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector;
+import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.Writable;
+import org.junit.jupiter.api.Test;
+
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class TestHiveAvroSerializer {
+
+  private static final String SIMPLE_SCHEMA = "{\"type\":\"record\",\"name\":\"h0_record\",\"namespace\":\"hoodie.h0\",\"fields\""
+      + ":[{\"name\":\"id\",\"type\":[\"null\",\"int\"],\"default\":null},"
+      + "{\"name\":\"col1\",\"type\":[\"null\",\"long\"],\"default\":null},"
+      + "{\"name\":\"col2\",\"type\":[\"null\",\"float\"],\"default\":null},"
+      + "{\"name\":\"col3\",\"type\":[\"null\",\"double\"],\"default\":null},"
+      + "{\"name\":\"col4\",\"type\":[\"null\",{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.h0.h0_record.col4\","
+      + "\"size\":5,\"logicalType\":\"decimal\",\"precision\":10,\"scale\":4}],\"default\":null},"
+      + "{\"name\":\"col5\",\"type\":[\"null\",\"string\"],\"default\":null},"
+      + "{\"name\":\"col6\",\"type\":[\"null\",{\"type\":\"int\",\"logicalType\":\"date\"}],\"default\":null},"
+      + "{\"name\":\"col7\",\"type\":[\"null\",{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"}],\"default\":null},"
+      + "{\"name\":\"col8\",\"type\":[\"null\",\"boolean\"],\"default\":null},"
+      + "{\"name\":\"col9\",\"type\":[\"null\",\"bytes\"],\"default\":null},"
+      + "{\"name\":\"par\",\"type\":[\"null\",{\"type\":\"int\",\"logicalType\":\"date\"}],\"default\":null}]}";
+  private static final String NESTED_CHEMA = "{\"name\":\"MyClass\",\"type\":\"record\",\"namespace\":\"com.acme.avro\",\"fields\":["
+      + "{\"name\":\"firstname\",\"type\":\"string\"},"
+      + "{\"name\":\"lastname\",\"type\":\"string\"},"
+      + "{\"name\":\"student\",\"type\":{\"name\":\"student\",\"type\":\"record\",\"fields\":["
+      + "{\"name\":\"firstname\",\"type\":[\"null\" ,\"string\"],\"default\": null},{\"name\":\"lastname\",\"type\":[\"null\" ,\"string\"],\"default\": null}]}}]}";
+
+  @Test
+  public void testSerialize() {
+    Schema avroSchema = new Schema.Parser().parse(SIMPLE_SCHEMA);
+    // create a test record with avroSchema
+    GenericData.Record avroRecord = new GenericData.Record(avroSchema);
+    avroRecord.put("id", 1);
+    avroRecord.put("col1", 1000L);
+    avroRecord.put("col2", -5.001f);
+    avroRecord.put("col3", 12.999d);
+    Schema currentDecimalType = avroSchema.getField("col4").schema().getTypes().get(1);
+    BigDecimal bd = new BigDecimal("123.456").setScale(((LogicalTypes.Decimal) currentDecimalType.getLogicalType()).getScale());
+    avroRecord.put("col4", HoodieAvroUtils.DECIMAL_CONVERSION.toFixed(bd, currentDecimalType, currentDecimalType.getLogicalType()));
+    avroRecord.put("col5", "2011-01-01");
+    avroRecord.put("col6", 18987);
+    avroRecord.put("col7", 1640491505000000L);
+    avroRecord.put("col8", false);
+    ByteBuffer bb = ByteBuffer.wrap(new byte[]{97, 48, 53});
+    avroRecord.put("col9", bb);
+    assertTrue(GenericData.get().validate(avroSchema, avroRecord));
+    ArrayWritable writable = (ArrayWritable) HoodieRealtimeRecordReaderUtils.avroToArrayWritable(avroRecord, avroSchema);
+
+    List<Writable> writableList = Arrays.stream(writable.get()).collect(Collectors.toList());
+    writableList.remove(writableList.size() - 1);
+    ArrayWritable clipWritable = new ArrayWritable(writable.getValueClass(), writableList.toArray(new Writable[0]));
+
+    List<TypeInfo> columnTypeList = createHiveTypeInfoFrom("int,bigint,float,double,decimal(10,4),string,date,timestamp,boolean,binary,date");
+    List<String> columnNameList = createHiveColumnsFrom("id,col1,col2,col3,col4,col5,col6,col7,col8,col9,par");
+    StructTypeInfo rowTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(columnNameList, columnTypeList);
+    GenericRecord testRecord = new HiveAvroSerializer(new ArrayWritableObjectInspector(rowTypeInfo), columnNameList, columnTypeList).serialize(writable, avroSchema);
+    assertTrue(GenericData.get().validate(avroSchema, testRecord));
+    // test
+    List<TypeInfo> columnTypeListClip = createHiveTypeInfoFrom("int,bigint,float,double,decimal(10,4),string,date,timestamp,boolean,binary");
+    List<String> columnNameListClip = createHiveColumnsFrom("id,col1,col2,col3,col4,col5,col6,col7,col8,col9");
+    StructTypeInfo rowTypeInfoClip = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(columnNameListClip, columnTypeListClip);
+    GenericRecord testRecordClip = new HiveAvroSerializer(new ArrayWritableObjectInspector(rowTypeInfoClip), columnNameListClip, columnTypeListClip).serialize(clipWritable, avroSchema);
+    assertTrue(GenericData.get().validate(avroSchema, testRecordClip));
+
+  }
+
+  @Test
+  public void testNestedValueSerialize() {
+    Schema nestedSchema = new Schema.Parser().parse(NESTED_CHEMA);
+    GenericRecord avroRecord = new GenericData.Record(nestedSchema);
+    avroRecord.put("firstname", "person1");
+    avroRecord.put("lastname", "person2");
+    GenericRecord studentRecord = new GenericData.Record(avroRecord.getSchema().getField("student").schema());
+    studentRecord.put("firstname", "person1");
+    studentRecord.put("lastname", "person2");
+    avroRecord.put("student", studentRecord);
+
+    assertTrue(GenericData.get().validate(nestedSchema, avroRecord));
+    ArrayWritable writable = (ArrayWritable) HoodieRealtimeRecordReaderUtils.avroToArrayWritable(avroRecord, nestedSchema);
+
+    List<TypeInfo> columnTypeList = createHiveTypeInfoFrom("string,string,struct<firstname:string,lastname:string>");
+    List<String> columnNameList = createHiveColumnsFrom("firstname,lastname,student");
+    StructTypeInfo rowTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(columnNameList, columnTypeList);
+    GenericRecord testRecord = new HiveAvroSerializer(new ArrayWritableObjectInspector(rowTypeInfo), columnNameList, columnTypeList).serialize(writable, nestedSchema);
+    assertTrue(GenericData.get().validate(nestedSchema, testRecord));
+  }
+
+  private List<String> createHiveColumnsFrom(final String columnNamesStr) {
+    List<String> columnNames;
+    if (columnNamesStr.length() == 0) {
+      columnNames = new ArrayList<>();
+    } else {
+      columnNames = Arrays.asList(columnNamesStr.split(","));
+    }
+
+    return columnNames;
+  }
+
+  private List<TypeInfo> createHiveTypeInfoFrom(final String columnsTypeStr) {
+    List<TypeInfo> columnTypes;
+
+    if (columnsTypeStr.length() == 0) {
+      columnTypes = new ArrayList<>();
+    } else {
+      columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnsTypeStr);
+    }
+
+    return columnTypes;
+  }
+}


[hudi] 39/45: [HUDI-5350] Fix oom cause compaction event lost problem (#7408)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit e0faa0bbe1021204281e5dccc51fb99610326987
Author: Bingeng Huang <30...@qq.com>
AuthorDate: Fri Dec 9 15:24:29 2022 +0800

    [HUDI-5350] Fix oom cause compaction event lost problem (#7408)
    
    Co-authored-by: hbg <bi...@shopee.com>
    (cherry picked from commit 115584c46e30998e0369b0e5550cc60eac8295ab)
---
 .../main/java/org/apache/hudi/sink/utils/NonThrownExecutor.java   | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/NonThrownExecutor.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/NonThrownExecutor.java
index 4364d1d16d..4ed1716545 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/NonThrownExecutor.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/NonThrownExecutor.java
@@ -136,15 +136,15 @@ public class NonThrownExecutor implements AutoCloseable {
   }
 
   private void handleException(Throwable t, ExceptionHook hook, Supplier<String> actionString) {
-    // if we have a JVM critical error, promote it immediately, there is a good
-    // chance the
-    // logging or job failing will not succeed any more
-    ExceptionUtils.rethrowIfFatalErrorOrOOM(t);
     final String errMsg = String.format("Executor executes action [%s] error", actionString.get());
     logger.error(errMsg, t);
     if (hook != null) {
       hook.apply(errMsg, t);
     }
+    // if we have a JVM critical error, promote it immediately, there is a good
+    // chance the
+    // logging or job failing will not succeed any more
+    ExceptionUtils.rethrowIfFatalErrorOrOOM(t);
   }
 
   private Supplier<String> getActionString(String actionName, Object... actionParams) {


[hudi] 33/45: Merge branch 'release-0.12.1' of https://git.woa.com/data-lake-technology/hudi into release-0.12.1

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit b1f204fa5510bd02c758bb6403107232243150a9
Merge: 619b7504ca c070e0963a
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Wed Dec 7 13:16:44 2022 +0800

    Merge branch 'release-0.12.1' of https://git.woa.com/data-lake-technology/hudi into release-0.12.1

 .../src/main/java/org/apache/hudi/common/fs/FSUtils.java   |  2 ++
 .../hudi/metadata/FileSystemBackedTableMetadata.java       |  2 ++
 .../apache/hudi/sink/StreamWriteOperatorCoordinator.java   |  5 +++++
 .../src/main/java/org/apache/hudi/util/HoodiePipeline.java | 14 ++++++++++++++
 4 files changed, 23 insertions(+)


[hudi] 16/45: [HUDI-2624] Implement Non Index type for HUDI

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 8ba01dc70a718ddab3044b45f96a545f0aae4084
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Fri Oct 28 12:33:22 2022 +0800

    [HUDI-2624] Implement Non Index type for HUDI
---
 .../org/apache/hudi/config/HoodieIndexConfig.java  |  43 ++++
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  12 +
 .../java/org/apache/hudi/index/HoodieIndex.java    |   2 +-
 .../java/org/apache/hudi/io/HoodieMergeHandle.java |   3 +-
 .../apache/hudi/keygen/EmptyAvroKeyGenerator.java  |  65 +++++
 .../hudi/table/action/commit/BucketInfo.java       |   4 +
 .../hudi/table/action/commit/BucketType.java       |   2 +-
 .../apache/hudi/index/FlinkHoodieIndexFactory.java |   2 +
 .../org/apache/hudi/index/FlinkHoodieNonIndex.java |  65 +++++
 .../apache/hudi/index/SparkHoodieIndexFactory.java |   3 +
 .../hudi/index/nonindex/SparkHoodieNonIndex.java   |  73 ++++++
 .../hudi/io/storage/row/HoodieRowCreateHandle.java |   5 +-
 .../org/apache/hudi/keygen/EmptyKeyGenerator.java  |  80 +++++++
 .../commit/BaseSparkCommitActionExecutor.java      |  17 ++
 .../table/action/commit/UpsertPartitioner.java     |  35 ++-
 .../org/apache/hudi/common/model/FileSlice.java    |  13 +
 .../org/apache/hudi/common/model/HoodieKey.java    |   2 +
 .../table/log/HoodieMergedLogRecordScanner.java    |   3 +-
 .../apache/hudi/configuration/OptionsResolver.java |   4 +
 .../org/apache/hudi/sink/StreamWriteFunction.java  |  32 ++-
 .../hudi/sink/StreamWriteOperatorCoordinator.java  |   5 +
 .../sink/nonindex/NonIndexStreamWriteFunction.java | 265 +++++++++++++++++++++
 .../sink/nonindex/NonIndexStreamWriteOperator.java |  25 +-
 .../java/org/apache/hudi/sink/utils/Pipelines.java |   7 +
 .../org/apache/hudi/table/HoodieTableFactory.java  |  12 +
 .../org/apache/hudi/sink/TestWriteMergeOnRead.java |  54 +++++
 .../hudi/sink/utils/InsertFunctionWrapper.java     |   6 +
 .../sink/utils/StreamWriteFunctionWrapper.java     |  23 +-
 .../hudi/sink/utils/TestFunctionWrapper.java       |   6 +
 .../org/apache/hudi/sink/utils/TestWriteBase.java  |  48 ++++
 .../test/java/org/apache/hudi/utils/TestData.java  |  34 +++
 .../test/scala/org/apache/hudi/TestNonIndex.scala  | 110 +++++++++
 32 files changed, 1030 insertions(+), 30 deletions(-)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
index ee5b83a43a..b5edaf4abc 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
@@ -270,6 +270,34 @@ public class HoodieIndexConfig extends HoodieConfig {
       .withDocumentation("Index key. It is used to index the record and find its file group. "
           + "If not set, use record key field as default");
 
+  /**
+   *   public static final String NON_INDEX_PARTITION_FILE_GROUP_CACHE_INTERVAL_MINUTE = "hoodie.non.index.partition.file.group.cache.interval.minute";  //minutes
+   *   public static final String DEFAULT_NON_INDEX_PARTITION_FILE_GROUP_CACHE_INTERVAL_MINUTE = "1800";
+   *
+   *   public static final String NON_INDEX_PARTITION_FILE_GROUP_STORAGE_TYPE = "hoodie.non.index.partition.file.group.storage.type";
+   *   public static final String DEFAULT_NON_INDEX_PARTITION_FILE_GROUP_CACHE_STORAGE_TYPE = "IN_MEMORY";
+   *
+   *   public static final String NON_INDEX_PARTITION_FILE_GROUP_CACHE_SIZE = "hoodie.non.index.partition.file.group.cache.size";  //byte
+   *   public static final String DEFAULT_NON_INDEX_PARTITION_FILE_GROUP_CACHE_SIZE = String.valueOf(1048576000);
+   */
+  public static final ConfigProperty<Integer> NON_INDEX_PARTITION_FILE_GROUP_CACHE_INTERVAL_MINUTE = ConfigProperty
+      .key("hoodie.non.index.partition.file.group.cache.interval.minute")
+      .defaultValue(1800)
+      .withDocumentation("Only applies if index type is BUCKET. Determine the number of buckets in the hudi table, "
+          + "and each partition is divided to N buckets.");
+
+  public static final ConfigProperty<String> NON_INDEX_PARTITION_FILE_GROUP_STORAGE_TYPE = ConfigProperty
+      .key("hoodie.non.index.partition.file.group.storage.type")
+      .defaultValue("IN_MEMORY")
+      .withDocumentation("Only applies if index type is BUCKET. Determine the number of buckets in the hudi table, "
+          + "and each partition is divided to N buckets.");
+
+  public static final ConfigProperty<Long> NON_INDEX_PARTITION_FILE_GROUP_CACHE_SIZE = ConfigProperty
+      .key("hoodie.non.index.partition.file.group.cache.size")
+      .defaultValue(1024 * 1024 * 1024L)
+      .withDocumentation("Only applies if index type is BUCKET. Determine the number of buckets in the hudi table, "
+          + "and each partition is divided to N buckets.");
+
   /**
    * Deprecated configs. These are now part of {@link HoodieHBaseIndexConfig}.
    */
@@ -606,6 +634,21 @@ public class HoodieIndexConfig extends HoodieConfig {
       return this;
     }
 
+    public Builder withPartitionFileGroupCacheInterval(int cacheInterval) {
+      hoodieIndexConfig.setValue(NON_INDEX_PARTITION_FILE_GROUP_CACHE_INTERVAL_MINUTE, String.valueOf(cacheInterval));
+      return this;
+    }
+
+    public Builder withPartitionFileGroupStorageType(String storageType) {
+      hoodieIndexConfig.setValue(NON_INDEX_PARTITION_FILE_GROUP_STORAGE_TYPE, storageType);
+      return this;
+    }
+
+    public Builder withPartitionFileGroupCacheSize(long size) {
+      hoodieIndexConfig.setValue(NON_INDEX_PARTITION_FILE_GROUP_CACHE_SIZE, String.valueOf(size));
+      return this;
+    }
+
     public HoodieIndexConfig build() {
       hoodieIndexConfig.setDefaultValue(INDEX_TYPE, getDefaultIndexType(engineType));
       hoodieIndexConfig.setDefaults(HoodieIndexConfig.class.getName());
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index 9610ad382b..034519e64a 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -1673,6 +1673,18 @@ public class HoodieWriteConfig extends HoodieConfig {
     return getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD);
   }
 
+  public int getNonIndexPartitionFileGroupCacheIntervalMinute() {
+    return getIntOrDefault(HoodieIndexConfig.NON_INDEX_PARTITION_FILE_GROUP_CACHE_INTERVAL_MINUTE);
+  }
+
+  public String getNonIndexPartitionFileGroupStorageType() {
+    return getString(HoodieIndexConfig.NON_INDEX_PARTITION_FILE_GROUP_STORAGE_TYPE);
+  }
+
+  public long getNonIndexPartitionFileGroupCacheSize() {
+    return getLong(HoodieIndexConfig.NON_INDEX_PARTITION_FILE_GROUP_CACHE_SIZE);
+  }
+
   /**
    * storage properties.
    */
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java
index 7ebd94748a..074a90fe5d 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java
@@ -141,7 +141,7 @@ public abstract class HoodieIndex<I, O> implements Serializable {
   }
 
   public enum IndexType {
-    HBASE, INMEMORY, BLOOM, GLOBAL_BLOOM, SIMPLE, GLOBAL_SIMPLE, BUCKET, FLINK_STATE
+    HBASE, INMEMORY, BLOOM, GLOBAL_BLOOM, SIMPLE, GLOBAL_SIMPLE, BUCKET, FLINK_STATE, NON_INDEX
   }
 
   public enum BucketIndexEngineType {
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
index 82c6de5761..e629c6a51e 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
@@ -333,8 +333,9 @@ public class HoodieMergeHandle<T extends HoodieRecordPayload, I, K, O> extends H
    */
   public void write(GenericRecord oldRecord) {
     String key = KeyGenUtils.getRecordKeyFromGenericRecord(oldRecord, keyGeneratorOpt);
+    boolean isEmptyRecordKey = key.equals(HoodieKey.EMPTY_RECORD_KEY);
     boolean copyOldRecord = true;
-    if (keyToNewRecords.containsKey(key)) {
+    if (!isEmptyRecordKey && keyToNewRecords.containsKey(key)) {
       // If we have duplicate records that we are updating, then the hoodie record will be deflated after
       // writing the first record. So make a copy of the record to be merged
       HoodieRecord<T> hoodieRecord = keyToNewRecords.get(key).newInstance();
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/EmptyAvroKeyGenerator.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/EmptyAvroKeyGenerator.java
new file mode 100644
index 0000000000..01536f95e4
--- /dev/null
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/EmptyAvroKeyGenerator.java
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.avro.generic.GenericRecord;
+
+import org.apache.hudi.common.model.HoodieKey;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+
+import java.util.Collections;
+import java.util.List;
+import java.util.Arrays;
+import java.util.stream.Collectors;
+
+/**
+ * Avro key generator for empty record key Hudi tables.
+ */
+public class EmptyAvroKeyGenerator extends BaseKeyGenerator {
+
+  private static final Logger LOG = LogManager.getLogger(EmptyAvroKeyGenerator.class);
+  public static final String EMPTY_RECORD_KEY = HoodieKey.EMPTY_RECORD_KEY;
+  private static final List<String> EMPTY_RECORD_KEY_FIELD_LIST = Collections.emptyList();
+
+  public EmptyAvroKeyGenerator(TypedProperties props) {
+    super(props);
+    if (config.containsKey(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key())) {
+      LOG.warn(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key() + " will be ignored while using "
+          + this.getClass().getSimpleName());
+    }
+    this.recordKeyFields = EMPTY_RECORD_KEY_FIELD_LIST;
+    this.partitionPathFields = Arrays.stream(props.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key())
+        .split(",")).map(String::trim).filter(s -> !s.isEmpty()).collect(Collectors.toList());
+  }
+
+  @Override
+  public String getRecordKey(GenericRecord record) {
+    return EMPTY_RECORD_KEY;
+  }
+
+  @Override
+  public String getPartitionPath(GenericRecord record) {
+    return KeyGenUtils.getRecordPartitionPath(record, getPartitionPathFields(), hiveStylePartitioning, encodePartitionPath, isConsistentLogicalTimestampEnabled());
+  }
+}
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketInfo.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketInfo.java
index 6547da6425..6db5b270b0 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketInfo.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketInfo.java
@@ -48,6 +48,10 @@ public class BucketInfo implements Serializable {
     return partitionPath;
   }
 
+  public void setBucketType(BucketType bucketType) {
+    this.bucketType = bucketType;
+  }
+
   @Override
   public String toString() {
     final StringBuilder sb = new StringBuilder("BucketInfo {");
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketType.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketType.java
index 70ee473d24..b32d32db40 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketType.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketType.java
@@ -19,5 +19,5 @@
 package org.apache.hudi.table.action.commit;
 
 public enum BucketType {
-  UPDATE, INSERT
+  UPDATE, INSERT, APPEND
 }
diff --git a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/index/FlinkHoodieIndexFactory.java b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/index/FlinkHoodieIndexFactory.java
index b10014b918..ba2f5a39fd 100644
--- a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/index/FlinkHoodieIndexFactory.java
+++ b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/index/FlinkHoodieIndexFactory.java
@@ -60,6 +60,8 @@ public final class FlinkHoodieIndexFactory {
         return new HoodieGlobalSimpleIndex(config, Option.empty());
       case BUCKET:
         return new HoodieSimpleBucketIndex(config);
+      case NON_INDEX:
+        return new FlinkHoodieNonIndex(config);
       default:
         throw new HoodieIndexException("Unsupported index type " + config.getIndexType());
     }
diff --git a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/index/FlinkHoodieNonIndex.java b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/index/FlinkHoodieNonIndex.java
new file mode 100644
index 0000000000..a7eb2c2262
--- /dev/null
+++ b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/index/FlinkHoodieNonIndex.java
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index;
+
+import java.util.List;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.table.HoodieTable;
+
+public class FlinkHoodieNonIndex<T extends HoodieRecordPayload> extends FlinkHoodieIndex<T> {
+
+  public FlinkHoodieNonIndex(HoodieWriteConfig config) {
+    super(config);
+  }
+
+  @Override
+  public boolean rollbackCommit(String instantTime) {
+    return true;
+  }
+
+  @Override
+  public boolean isGlobal() {
+    throw new UnsupportedOperationException("Unsupport operation.");
+  }
+
+  @Override
+  public boolean canIndexLogFiles() {
+    return false;
+  }
+
+  @Override
+  public boolean isImplicitWithStorage() {
+    return true;
+  }
+
+  @Override
+  public List<WriteStatus> updateLocation(List<WriteStatus> writeStatuses, HoodieEngineContext context, HoodieTable hoodieTable) throws HoodieIndexException {
+    return null;
+  }
+
+  @Override
+  public List<HoodieRecord<T>> tagLocation(List<HoodieRecord<T>> hoodieRecords, HoodieEngineContext context, HoodieTable hoodieTable) throws HoodieIndexException {
+    return null;
+  }
+}
diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java
index 4525490c8d..4f07a7f2aa 100644
--- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java
+++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java
@@ -32,6 +32,7 @@ import org.apache.hudi.index.bucket.HoodieSimpleBucketIndex;
 import org.apache.hudi.index.bucket.HoodieSparkConsistentBucketIndex;
 import org.apache.hudi.index.hbase.SparkHoodieHBaseIndex;
 import org.apache.hudi.index.inmemory.HoodieInMemoryHashIndex;
+import org.apache.hudi.index.nonindex.SparkHoodieNonIndex;
 import org.apache.hudi.index.simple.HoodieGlobalSimpleIndex;
 import org.apache.hudi.index.simple.HoodieSimpleIndex;
 import org.apache.hudi.keygen.BaseKeyGenerator;
@@ -74,6 +75,8 @@ public final class SparkHoodieIndexFactory {
           default:
             throw new HoodieIndexException("Unknown bucket index engine type: " + config.getBucketIndexEngineType());
         }
+      case NON_INDEX:
+        return new SparkHoodieNonIndex<>(config);
       default:
         throw new HoodieIndexException("Index type unspecified, set " + config.getIndexType());
     }
diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/nonindex/SparkHoodieNonIndex.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/nonindex/SparkHoodieNonIndex.java
new file mode 100644
index 0000000000..d239b7d6f7
--- /dev/null
+++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/nonindex/SparkHoodieNonIndex.java
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.nonindex;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.index.SparkHoodieIndex;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.spark.api.java.JavaRDD;
+
+public class SparkHoodieNonIndex<T extends HoodieRecordPayload<T>> extends SparkHoodieIndex<T> {
+
+  public SparkHoodieNonIndex(HoodieWriteConfig config) {
+    super(config);
+  }
+
+  @Override
+  public JavaRDD<WriteStatus> updateLocation(JavaRDD<WriteStatus> writeStatusRDD,
+                                             HoodieEngineContext context,
+                                             HoodieTable hoodieTable)
+      throws HoodieIndexException {
+    return writeStatusRDD;
+  }
+
+  @Override
+  public boolean rollbackCommit(String instantTime) {
+    return true;
+  }
+
+  @Override
+  public boolean isGlobal() {
+    return false;
+  }
+
+  @Override
+  public boolean canIndexLogFiles() {
+    return false;
+  }
+
+  @Override
+  public boolean isImplicitWithStorage() {
+    return true;
+  }
+
+  @Override
+  public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> records,
+                                              HoodieEngineContext context,
+                                              HoodieTable hoodieTable)
+      throws HoodieIndexException {
+    throw new UnsupportedOperationException("Unsupport operation.");
+  }
+}
diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java
index 9da04f7260..da63a0f2c5 100644
--- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java
+++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java
@@ -169,7 +169,10 @@ public class HoodieRowCreateHandle implements Serializable {
       //          and [[String]])
       //          - Repeated computations (for ex, converting file-path to [[UTF8String]] over and
       //          over again)
-      UTF8String recordKey = row.getUTF8String(HoodieRecord.RECORD_KEY_META_FIELD_ORD);
+      UTF8String recordKey = null;
+      if (!row.isNullAt(HoodieRecord.RECORD_KEY_META_FIELD_ORD)) {
+        recordKey = row.getUTF8String(HoodieRecord.RECORD_KEY_META_FIELD_ORD);
+      }
       UTF8String partitionPath = row.getUTF8String(HoodieRecord.PARTITION_PATH_META_FIELD_ORD);
       // This is the only meta-field that is generated dynamically, hence conversion b/w
       // [[String]] and [[UTF8String]] is unavoidable if preserveHoodieMetadata is false
diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/EmptyKeyGenerator.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/EmptyKeyGenerator.java
new file mode 100644
index 0000000000..9e4090a537
--- /dev/null
+++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/EmptyKeyGenerator.java
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+
+import java.util.Arrays;
+import java.util.stream.Collectors;
+
+/**
+ * Key generator for Hudi tables without record key.
+ */
+public class EmptyKeyGenerator extends BuiltinKeyGenerator {
+
+  public static final String EMPTY_RECORD_KEY = EmptyAvroKeyGenerator.EMPTY_RECORD_KEY;
+
+  private final EmptyAvroKeyGenerator emptyAvroKeyGenerator;
+
+  public EmptyKeyGenerator(TypedProperties config) {
+    super(config);
+    this.emptyAvroKeyGenerator = new EmptyAvroKeyGenerator(config);
+    this.recordKeyFields = emptyAvroKeyGenerator.getRecordKeyFieldNames();
+    this.partitionPathFields = Arrays.stream(config.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key())
+        .split(",")).map(String::trim).filter(s -> !s.isEmpty()).collect(Collectors.toList());
+  }
+
+  @Override
+  public String getRecordKey(GenericRecord record) {
+    return EMPTY_RECORD_KEY;
+  }
+
+  @Override
+  public String getRecordKey(Row row) {
+    return EMPTY_RECORD_KEY;
+  }
+
+  @Override
+  public UTF8String getRecordKey(InternalRow internalRow, StructType schema) {
+    return combineCompositeRecordKeyUnsafe(EMPTY_RECORD_KEY);
+  }
+
+  @Override
+  public String getPartitionPath(GenericRecord record) {
+    return emptyAvroKeyGenerator.getPartitionPath(record);
+  }
+
+  @Override
+  public String getPartitionPath(Row row) {
+    tryInitRowAccessor(row.schema());
+    return combinePartitionPath(rowAccessor.getRecordPartitionPathValues(row));
+  }
+
+  @Override
+  public UTF8String getPartitionPath(InternalRow row, StructType schema) {
+    tryInitRowAccessor(schema);
+    return combinePartitionPathUnsafe(rowAccessor.getRecordPartitionPathValues(row));
+  }
+}
diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
index f8e4b31ff6..bb9e00e74e 100644
--- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
+++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
@@ -45,6 +45,7 @@ import org.apache.hudi.exception.HoodieUpsertException;
 import org.apache.hudi.execution.SparkLazyInsertIterable;
 import org.apache.hudi.io.CreateHandleFactory;
 import org.apache.hudi.io.HoodieConcatHandle;
+import org.apache.hudi.io.HoodieAppendHandle;
 import org.apache.hudi.io.HoodieMergeHandle;
 import org.apache.hudi.io.HoodieSortedMergeHandle;
 import org.apache.hudi.keygen.BaseKeyGenerator;
@@ -320,6 +321,8 @@ public abstract class BaseSparkCommitActionExecutor<T extends HoodieRecordPayloa
         return handleInsert(binfo.fileIdPrefix, recordItr);
       } else if (btype.equals(BucketType.UPDATE)) {
         return handleUpdate(binfo.partitionPath, binfo.fileIdPrefix, recordItr);
+      } else if (btype.equals(BucketType.APPEND)) {
+        return handleAppend(binfo.partitionPath, binfo.fileIdPrefix, recordItr);
       } else {
         throw new HoodieUpsertException("Unknown bucketType " + btype + " for partition :" + partition);
       }
@@ -335,6 +338,20 @@ public abstract class BaseSparkCommitActionExecutor<T extends HoodieRecordPayloa
     return handleUpsertPartition(instantTime, partition, recordItr, partitioner);
   }
 
+  @SuppressWarnings("unchecked")
+  protected Iterator<List<WriteStatus>> handleAppend(String partitionPath, String fileId,
+                                                    Iterator<HoodieRecord<T>> recordItr) throws IOException {
+    // This is needed since sometimes some buckets are never picked in getPartition() and end up with 0 records
+    if (!recordItr.hasNext()) {
+      LOG.info("Empty partition with fileId => " + fileId);
+      return Collections.singletonList((List<WriteStatus>) Collections.EMPTY_LIST).iterator();
+    }
+    HoodieAppendHandle<?, ?, ?, ?> appendHandle = new HoodieAppendHandle<>(config, instantTime,
+        table, partitionPath, fileId, recordItr, taskContextSupplier);
+    appendHandle.doAppend();
+    return Collections.singletonList(appendHandle.close()).iterator();
+  }
+
   @Override
   public Iterator<List<WriteStatus>> handleUpdate(String partitionPath, String fileId,
                                                   Iterator<HoodieRecord<T>> recordItr)
diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
index c2f5a43066..66cef31ff4 100644
--- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
+++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
@@ -33,6 +33,7 @@ import org.apache.hudi.common.util.NumericUtils;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.index.nonindex.SparkHoodieNonIndex;
 import org.apache.hudi.table.HoodieTable;
 import org.apache.hudi.table.WorkloadProfile;
 import org.apache.hudi.table.WorkloadStat;
@@ -84,6 +85,8 @@ public class UpsertPartitioner<T extends HoodieRecordPayload<T>> extends SparkHo
 
   protected final HoodieWriteConfig config;
 
+  private int recordCnt;
+
   public UpsertPartitioner(WorkloadProfile profile, HoodieEngineContext context, HoodieTable table,
       HoodieWriteConfig config) {
     super(profile, table);
@@ -106,7 +109,7 @@ public class UpsertPartitioner<T extends HoodieRecordPayload<T>> extends SparkHo
       WorkloadStat outputWorkloadStats = profile.getOutputPartitionPathStatMap().getOrDefault(partitionStat.getKey(), new WorkloadStat());
       for (Map.Entry<String, Pair<String, Long>> updateLocEntry :
           partitionStat.getValue().getUpdateLocationToCount().entrySet()) {
-        addUpdateBucket(partitionStat.getKey(), updateLocEntry.getKey());
+        addNewBucket(partitionStat.getKey(), updateLocEntry.getKey());
         if (profile.hasOutputWorkLoadStats()) {
           HoodieRecordLocation hoodieRecordLocation = new HoodieRecordLocation(updateLocEntry.getValue().getKey(), updateLocEntry.getKey());
           outputWorkloadStats.addUpdates(hoodieRecordLocation, updateLocEntry.getValue().getValue());
@@ -118,10 +121,16 @@ public class UpsertPartitioner<T extends HoodieRecordPayload<T>> extends SparkHo
     }
   }
 
-  private int addUpdateBucket(String partitionPath, String fileIdHint) {
+  private int addNewBucket(String partitionPath, String fileIdHint) {
+    BucketInfo bucketInfo;
+    if (table.getIndex() instanceof SparkHoodieNonIndex) {
+      bucketInfo = new BucketInfo(BucketType.APPEND, fileIdHint, partitionPath);
+    } else {
+      bucketInfo = new BucketInfo(BucketType.UPDATE, fileIdHint, partitionPath);
+    }
+
     int bucket = totalBuckets;
     updateLocationToBucket.put(fileIdHint, bucket);
-    BucketInfo bucketInfo = new BucketInfo(BucketType.UPDATE, fileIdHint, partitionPath);
     bucketInfoMap.put(totalBuckets, bucketInfo);
     totalBuckets++;
     return bucket;
@@ -196,7 +205,7 @@ public class UpsertPartitioner<T extends HoodieRecordPayload<T>> extends SparkHo
               bucket = updateLocationToBucket.get(smallFile.location.getFileId());
               LOG.info("Assigning " + recordsToAppend + " inserts to existing update bucket " + bucket);
             } else {
-              bucket = addUpdateBucket(partitionPath, smallFile.location.getFileId());
+              bucket = addNewBucket(partitionPath, smallFile.location.getFileId());
               LOG.info("Assigning " + recordsToAppend + " inserts to new update bucket " + bucket);
             }
             if (profile.hasOutputWorkLoadStats()) {
@@ -336,12 +345,18 @@ public class UpsertPartitioner<T extends HoodieRecordPayload<T>> extends SparkHo
     } else {
       String partitionPath = keyLocation._1().getPartitionPath();
       List<InsertBucketCumulativeWeightPair> targetBuckets = partitionPathToInsertBucketInfos.get(partitionPath);
-      // pick the target bucket to use based on the weights.
-      final long totalInserts = Math.max(1, profile.getWorkloadStat(partitionPath).getNumInserts());
-      final long hashOfKey = NumericUtils.getMessageDigestHash("MD5", keyLocation._1().getRecordKey());
-      final double r = 1.0 * Math.floorMod(hashOfKey, totalInserts) / totalInserts;
-
-      int index = Collections.binarySearch(targetBuckets, new InsertBucketCumulativeWeightPair(new InsertBucket(), r));
+      int index;
+      if (keyLocation._1.getRecordKey().equals(HoodieKey.EMPTY_RECORD_KEY)) {
+        // round robing
+        index = (++recordCnt) % targetBuckets.size();
+      } else {
+        // pick the target bucket to use based on the weights.
+        final long totalInserts = Math.max(1, profile.getWorkloadStat(partitionPath).getNumInserts());
+        final long hashOfKey = NumericUtils.getMessageDigestHash("MD5", keyLocation._1().getRecordKey());
+        final double r = 1.0 * Math.floorMod(hashOfKey, totalInserts) / totalInserts;
+
+        index = Collections.binarySearch(targetBuckets, new InsertBucketCumulativeWeightPair(new InsertBucket(), r));
+      }
 
       if (index >= 0) {
         return targetBuckets.get(index).getKey().bucketNumber;
diff --git a/hudi-common/src/main/java/org/apache/hudi/common/model/FileSlice.java b/hudi-common/src/main/java/org/apache/hudi/common/model/FileSlice.java
index 0fc580db0b..6cb1ed90a7 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/model/FileSlice.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/model/FileSlice.java
@@ -116,6 +116,19 @@ public class FileSlice implements Serializable {
     return (baseFile == null) && (logFiles.isEmpty());
   }
 
+  public long getFileGroupSize() {
+    long totalSize = 0;
+    if (baseFile != null) {
+      totalSize += baseFile.getFileSize();
+    }
+    if (logFiles != null) {
+      for (HoodieLogFile logFile : logFiles) {
+        totalSize += logFile.getFileSize();
+      }
+    }
+    return totalSize;
+  }
+
   @Override
   public String toString() {
     final StringBuilder sb = new StringBuilder("FileSlice {");
diff --git a/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieKey.java b/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieKey.java
index 9030204099..68ae58e5f8 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieKey.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieKey.java
@@ -29,6 +29,8 @@ import java.util.Objects;
  */
 public class HoodieKey implements Serializable {
 
+  public static final String EMPTY_RECORD_KEY = "_hoodie_empty_record_key_";
+
   private String recordKey;
   private String partitionPath;
 
diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java
index e3d8554d00..5ef0a6821f 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java
@@ -77,6 +77,7 @@ public class HoodieMergedLogRecordScanner extends AbstractHoodieLogRecordReader
   private long maxMemorySizeInBytes;
   // Stores the total time taken to perform reading and merging of log blocks
   private long totalTimeTakenToReadAndMergeBlocks;
+  private int emptyKeySuffix;
 
   @SuppressWarnings("unchecked")
   protected HoodieMergedLogRecordScanner(FileSystem fs, String basePath, List<String> logFilePaths, Schema readerSchema,
@@ -158,7 +159,7 @@ public class HoodieMergedLogRecordScanner extends AbstractHoodieLogRecordReader
       }
     } else {
       // Put the record as is
-      records.put(key, hoodieRecord);
+      records.put(key.equals(HoodieKey.EMPTY_RECORD_KEY) ? key + (emptyKeySuffix++) : key, hoodieRecord);
     }
   }
 
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java
index 00ebf09426..72f8f39011 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java
@@ -119,6 +119,10 @@ public class OptionsResolver {
     return conf.getString(FlinkOptions.INDEX_TYPE).equals(HoodieIndex.IndexType.BUCKET.name());
   }
 
+  public static boolean isNonIndexType(Configuration conf) {
+    return conf.getString(FlinkOptions.INDEX_TYPE).equals(HoodieIndex.IndexType.NON_INDEX.name());
+  }
+
   /**
    * Returns whether the source should emit changelog.
    *
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java
index 2748af5290..a0f994f04a 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java
@@ -202,7 +202,7 @@ public class StreamWriteFunction<I> extends AbstractStreamWriteFunction<I> {
    * <p>A {@link HoodieRecord} was firstly transformed into a {@link DataItem}
    * for buffering, it then transforms back to the {@link HoodieRecord} before flushing.
    */
-  private static class DataItem {
+  protected static class DataItem {
     private final String key; // record key
     private final String instant; // 'U' or 'I'
     private final HoodieRecordPayload<?> data; // record payload
@@ -283,7 +283,7 @@ public class StreamWriteFunction<I> extends AbstractStreamWriteFunction<I> {
    * Tool to detect if to flush out the existing buffer.
    * Sampling the record to compute the size with 0.01 percentage.
    */
-  private static class BufferSizeDetector {
+  protected static class BufferSizeDetector {
     private final Random random = new Random(47);
     private static final int DENOMINATOR = 100;
 
@@ -292,11 +292,11 @@ public class StreamWriteFunction<I> extends AbstractStreamWriteFunction<I> {
     private long lastRecordSize = -1L;
     private long totalSize = 0L;
 
-    BufferSizeDetector(double batchSizeMb) {
+    public BufferSizeDetector(double batchSizeMb) {
       this.batchSizeBytes = batchSizeMb * 1024 * 1024;
     }
 
-    boolean detect(Object record) {
+    public boolean detect(Object record) {
       if (lastRecordSize == -1 || sampling()) {
         lastRecordSize = ObjectSizeCalculator.getObjectSize(record);
       }
@@ -304,15 +304,29 @@ public class StreamWriteFunction<I> extends AbstractStreamWriteFunction<I> {
       return totalSize > this.batchSizeBytes;
     }
 
-    boolean sampling() {
+    public boolean detect(long recordSize) {
+      lastRecordSize = recordSize;
+      totalSize += lastRecordSize;
+      return totalSize > this.batchSizeBytes;
+    }
+
+    public boolean sampling() {
       // 0.01 sampling percentage
       return random.nextInt(DENOMINATOR) == 1;
     }
 
-    void reset() {
+    public void reset() {
       this.lastRecordSize = -1L;
       this.totalSize = 0L;
     }
+
+    public void setTotalSize(long totalSize) {
+      this.totalSize = totalSize;
+    }
+
+    public long getLastRecordSize() {
+      return lastRecordSize;
+    }
   }
 
   /**
@@ -381,7 +395,7 @@ public class StreamWriteFunction<I> extends AbstractStreamWriteFunction<I> {
 
     bucket.records.add(item);
 
-    boolean flushBucket = bucket.detector.detect(item);
+    boolean flushBucket = shouldFlushBucket(bucket.detector, item, value.getPartitionPath());
     boolean flushBuffer = this.tracer.trace(bucket.detector.lastRecordSize);
     if (flushBucket) {
       if (flushBucket(bucket)) {
@@ -484,4 +498,8 @@ public class StreamWriteFunction<I> extends AbstractStreamWriteFunction<I> {
     // blocks flushing until the coordinator starts a new instant
     this.confirming = true;
   }
+
+  protected boolean shouldFlushBucket(BufferSizeDetector detector, DataItem item, String partitionPath) {
+    return detector.detect(item);
+  }
 }
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
index c87d5b2443..670748b90f 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
@@ -562,6 +562,11 @@ public class StreamWriteOperatorCoordinator
     return instant;
   }
 
+  @VisibleForTesting
+  public HoodieFlinkWriteClient getWriteClient() {
+    return writeClient;
+  }
+
   @VisibleForTesting
   public Context getContext() {
     return context;
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/nonindex/NonIndexStreamWriteFunction.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/nonindex/NonIndexStreamWriteFunction.java
new file mode 100644
index 0000000000..517234fc92
--- /dev/null
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/nonindex/NonIndexStreamWriteFunction.java
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink.nonindex;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.stream.Collectors;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.runtime.state.FunctionInitializationContext;
+import org.apache.flink.runtime.state.FunctionSnapshotContext;
+import org.apache.flink.streaming.api.functions.ProcessFunction;
+import org.apache.flink.util.Collector;
+import org.apache.hudi.client.FlinkTaskContextSupplier;
+import org.apache.hudi.client.common.HoodieFlinkEngineContext;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.table.view.SyncableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.configuration.HadoopConfigurations;
+import org.apache.hudi.sink.StreamWriteFunction;
+import org.apache.hudi.table.HoodieFlinkTable;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.util.StreamerUtil;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+/**
+ * Used for MOR table if index type is non_index.
+ */
+public class NonIndexStreamWriteFunction<I>
+    extends StreamWriteFunction<I> {
+
+  private static final Logger LOG = LogManager.getLogger(NonIndexStreamWriteFunction.class);
+
+  private transient Map<String, FileGroupInfo> partitionPathToFileId;
+
+  private transient HoodieWriteConfig writeConfig;
+
+  private transient PartitionFileGroupHandle partitionFileGroupHandle;
+
+  /**
+   * Constructs a StreamingSinkFunction.
+   *
+   * @param config The config options
+   */
+  public NonIndexStreamWriteFunction(Configuration config) {
+    super(config);
+  }
+
+  @Override
+  public void initializeState(FunctionInitializationContext context) throws Exception {
+    super.initializeState(context);
+    partitionPathToFileId = new ConcurrentHashMap<>();
+    writeConfig = StreamerUtil.getHoodieClientConfig(this.config);
+    partitionFileGroupHandle = getPartitionHandle(PartitionFileGroupStorageType.valueOf(writeConfig.getNonIndexPartitionFileGroupStorageType()));
+    partitionFileGroupHandle.init();
+  }
+
+  @Override
+  public void snapshotState(FunctionSnapshotContext functionSnapshotContext) throws Exception {
+    super.snapshotState(functionSnapshotContext);
+    partitionFileGroupHandle.reset();
+  }
+
+  @Override
+  public void processElement(I value, ProcessFunction<I, Object>.Context context, Collector<Object> collector) throws Exception {
+    HoodieRecord<?> record = (HoodieRecord<?>) value;
+    assignFileId(record);
+    bufferRecord(record);
+  }
+
+  private void assignFileId(HoodieRecord<?> record) {
+    String partitionPath = record.getPartitionPath();
+    FileGroupInfo fileGroupInfo = partitionPathToFileId.computeIfAbsent(partitionPath,
+        key -> getLatestFileGroupInfo(record.getPartitionPath()));
+    if (fileGroupInfo == null || !fileGroupInfo.canAssign(record)) {
+      fileGroupInfo = new FileGroupInfo(FSUtils.getFileId(FSUtils.createNewFileIdPfx()));
+      partitionPathToFileId.put(partitionPath, fileGroupInfo);
+    }
+    record.unseal();
+    record.setCurrentLocation(new HoodieRecordLocation("U", fileGroupInfo.getFileId()));
+    record.seal();
+  }
+
+  private FileGroupInfo getLatestFileGroupInfo(String partitionPath) {
+    return partitionFileGroupHandle.getLatestFileGroupInfo(partitionPath);
+  }
+
+  @Override
+  public void endInput() {
+    super.endInput();
+    partitionPathToFileId.clear();
+  }
+
+  private abstract class PartitionFileGroupHandle {
+
+    abstract void init();
+
+    abstract FileGroupInfo getLatestFileGroupInfo(String partitionPath);
+
+    abstract void reset();
+
+  }
+
+  private class InMemoryPartitionFileGroupHandle extends PartitionFileGroupHandle {
+
+    private long cacheIntervalMills;
+
+    @Override
+    public void init() {
+      cacheIntervalMills = writeConfig.getNonIndexPartitionFileGroupCacheIntervalMinute() * 60 * 1000;
+    }
+
+    @Override
+    public FileGroupInfo getLatestFileGroupInfo(String partitionPath) {
+      return null;
+    }
+
+    @Override
+    public void reset() {
+      long curTime = System.currentTimeMillis();
+      for (Entry<String, FileGroupInfo> entry : partitionPathToFileId.entrySet()) {
+        if (curTime - entry.getValue().getCreateTime() >= cacheIntervalMills) {
+          partitionPathToFileId.remove(entry.getKey());
+        }
+      }
+    }
+  }
+
+  @Override
+  protected boolean shouldFlushBucket(BufferSizeDetector detector, DataItem item, String partitionPath) {
+    FileGroupInfo fileGroupInfo = partitionPathToFileId.get(partitionPath);
+    return fileGroupInfo == null || fileGroupInfo.getLastRecordSize() == -1 ? detector.detect(item)
+        : detector.detect(fileGroupInfo.getLastRecordSize());
+  }
+
+  private class FileSystemPartitionFileGroupHandle extends PartitionFileGroupHandle {
+
+    private String writeToken;
+
+    private HoodieTable<?, ?, ?, ?> table;
+
+    private HoodieFlinkEngineContext engineContext;
+
+    private SyncableFileSystemView fileSystemView;
+
+    private boolean needSync = false;
+
+    @Override
+    public void init() {
+      this.engineContext = new HoodieFlinkEngineContext(
+          new SerializableConfiguration(HadoopConfigurations.getHadoopConf(config)),
+          new FlinkTaskContextSupplier(getRuntimeContext()));
+      this.writeToken = FSUtils.makeWriteToken(
+          engineContext.getTaskContextSupplier().getPartitionIdSupplier().get(),
+          engineContext.getTaskContextSupplier().getStageIdSupplier().get(),
+          engineContext.getTaskContextSupplier().getAttemptIdSupplier().get());
+      this.table = HoodieFlinkTable.create(writeConfig, engineContext);
+      this.fileSystemView = table.getHoodieView();
+    }
+
+    @Override
+    public FileGroupInfo getLatestFileGroupInfo(String partitionPath) {
+      if (needSync) {
+        fileSystemView.sync();
+        needSync = false;
+      }
+      List<HoodieLogFile> hoodieLogFiles = fileSystemView.getAllFileSlices(partitionPath)
+          .map(FileSlice::getLatestLogFile)
+          .filter(Option::isPresent).map(Option::get)
+          .filter(logFile -> FSUtils.getWriteTokenFromLogPath(logFile.getPath()).equals(writeToken))
+          .sorted(HoodieLogFile.getReverseLogFileComparator()).collect(Collectors.toList());
+      if (hoodieLogFiles.size() > 0) {
+        HoodieLogFile hoodieLogFile = hoodieLogFiles.get(0);
+        Option<FileSlice> fileSlice = fileSystemView.getLatestFileSlice(partitionPath, hoodieLogFile.getFileId());
+        if (fileSlice.isPresent()) {
+          return new FileGroupInfo(FSUtils.getFileIdFromFilePath(hoodieLogFile.getPath()),
+              fileSlice.get().getFileGroupSize());
+        }
+        LOG.warn("Can location fileSlice, partitionPath: " + partitionPath + ", fileId: "
+            + hoodieLogFile.getFileId());
+      }
+      return null;
+    }
+
+    @Override
+    public void reset() {
+      partitionPathToFileId.clear();
+      needSync = true;
+    }
+  }
+
+  private PartitionFileGroupHandle getPartitionHandle(PartitionFileGroupStorageType storageType) {
+    switch (storageType) {
+      case IN_MEMORY:
+        return new InMemoryPartitionFileGroupHandle();
+      case FILE_SYSTEM:
+        return new FileSystemPartitionFileGroupHandle();
+      default:
+        throw new IllegalArgumentException("UnSupport storage type :" + storageType.name());
+    }
+  }
+
+  private enum PartitionFileGroupStorageType {
+    IN_MEMORY, FILE_SYSTEM
+  }
+
+  private class FileGroupInfo {
+
+    private final String fileId;
+    private final long createTime;
+    private final BufferSizeDetector detector;
+
+    public FileGroupInfo(String fileId) {
+      this.fileId = fileId;
+      this.createTime = System.currentTimeMillis();
+      this.detector = new BufferSizeDetector((double) writeConfig.getNonIndexPartitionFileGroupCacheSize() / 1024 / 1024);
+    }
+
+    public FileGroupInfo(String fileId, long initFileSize) {
+      this(fileId);
+      detector.setTotalSize(initFileSize);
+    }
+
+    public boolean canAssign(HoodieRecord<?> record) {
+      return !detector.detect(record);
+    }
+
+    public String getFileId() {
+      return fileId;
+    }
+
+    public long getCreateTime() {
+      return createTime;
+    }
+
+    public long getLastRecordSize() {
+      return detector.getLastRecordSize();
+    }
+  }
+
+}
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketType.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/nonindex/NonIndexStreamWriteOperator.java
similarity index 51%
copy from hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketType.java
copy to hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/nonindex/NonIndexStreamWriteOperator.java
index 70ee473d24..560716aa9c 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketType.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/nonindex/NonIndexStreamWriteOperator.java
@@ -7,7 +7,7 @@
  * "License"); you may not use this file except in compliance
  * with the License.  You may obtain a copy of the License at
  *
- *      http://www.apache.org/licenses/LICENSE-2.0
+ * http://www.apache.org/licenses/LICENSE-2.0
  *
  * Unless required by applicable law or agreed to in writing, software
  * distributed under the License is distributed on an "AS IS" BASIS,
@@ -16,8 +16,25 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.table.action.commit;
+package org.apache.hudi.sink.nonindex;
 
-public enum BucketType {
-  UPDATE, INSERT
+import org.apache.hudi.sink.common.AbstractWriteOperator;
+import org.apache.hudi.sink.common.WriteOperatorFactory;
+
+import org.apache.flink.configuration.Configuration;
+
+/**
+ * Operator for {@link NonIndexStreamWriteFunction}.
+ *
+ * @param <I> The input type
+ */
+public class NonIndexStreamWriteOperator<I> extends AbstractWriteOperator<I> {
+
+  public NonIndexStreamWriteOperator(Configuration conf) {
+    super(new NonIndexStreamWriteFunction<>(conf));
+  }
+
+  public static <I> WriteOperatorFactory<I> getFactory(Configuration conf) {
+    return WriteOperatorFactory.instance(conf, new NonIndexStreamWriteOperator<>(conf));
+  }
 }
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
index 82761adf73..55428610d9 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
@@ -44,6 +44,7 @@ import org.apache.hudi.sink.compact.CompactionCommitEvent;
 import org.apache.hudi.sink.compact.CompactionCommitSink;
 import org.apache.hudi.sink.compact.CompactionPlanEvent;
 import org.apache.hudi.sink.compact.CompactionPlanOperator;
+import org.apache.hudi.sink.nonindex.NonIndexStreamWriteOperator;
 import org.apache.hudi.sink.partitioner.BucketAssignFunction;
 import org.apache.hudi.sink.partitioner.BucketIndexPartitioner;
 import org.apache.hudi.sink.transform.RowDataToHoodieFunctions;
@@ -323,6 +324,12 @@ public class Pipelines {
           .transform(opIdentifier("bucket_write", conf), TypeInformation.of(Object.class), operatorFactory)
           .uid("uid_bucket_write" + conf.getString(FlinkOptions.TABLE_NAME))
           .setParallelism(conf.getInteger(FlinkOptions.WRITE_TASKS));
+    } else if (OptionsResolver.isNonIndexType(conf)) {
+      WriteOperatorFactory<HoodieRecord> operatorFactory = NonIndexStreamWriteOperator.getFactory(conf);
+      return dataStream.transform("non_index_write",
+              TypeInformation.of(Object.class), operatorFactory)
+          .uid("uid_non_index_write" + conf.getString(FlinkOptions.TABLE_NAME))
+          .setParallelism(conf.getInteger(FlinkOptions.WRITE_TASKS));
     } else {
       WriteOperatorFactory<HoodieRecord> operatorFactory = StreamWriteOperator.getFactory(conf);
       return dataStream
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
index 1718175240..b0380c5878 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
@@ -19,12 +19,14 @@
 package org.apache.hudi.table;
 
 import org.apache.hudi.common.model.DefaultHoodieRecordPayload;
+import org.apache.hudi.common.model.WriteOperationType;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.configuration.FlinkOptions;
 import org.apache.hudi.configuration.OptionsResolver;
 import org.apache.hudi.exception.HoodieValidationException;
 import org.apache.hudi.index.HoodieIndex;
 import org.apache.hudi.keygen.ComplexAvroKeyGenerator;
+import org.apache.hudi.keygen.EmptyAvroKeyGenerator;
 import org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator;
 import org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator;
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
@@ -56,6 +58,7 @@ import java.util.concurrent.TimeUnit;
 import java.util.stream.Collectors;
 
 import static org.apache.hudi.common.util.ValidationUtils.checkArgument;
+import static org.apache.hudi.configuration.FlinkOptions.REALTIME_SKIP_MERGE;
 
 /**
  * Hoodie data source/sink factory.
@@ -185,6 +188,15 @@ public class HoodieTableFactory implements DynamicTableSourceFactory, DynamicTab
     setupWriteOptions(conf);
     // infer avro schema from physical DDL schema
     inferAvroSchema(conf, schema.toPhysicalRowDataType().notNull().getLogicalType());
+    setupDefaultOptionsForNonIndex(conf);
+  }
+
+  private static void setupDefaultOptionsForNonIndex(Configuration conf) {
+    if (OptionsResolver.isNonIndexType(conf)) {
+      conf.setString(FlinkOptions.KEYGEN_CLASS_NAME, EmptyAvroKeyGenerator.class.getName());
+      conf.setString(FlinkOptions.OPERATION, WriteOperationType.INSERT.value());
+      conf.setString(FlinkOptions.MERGE_TYPE, REALTIME_SKIP_MERGE);
+    }
   }
 
   /**
diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteMergeOnRead.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteMergeOnRead.java
index df01fc9076..7efdc181e9 100644
--- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteMergeOnRead.java
+++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteMergeOnRead.java
@@ -20,7 +20,13 @@ package org.apache.hudi.sink;
 
 import org.apache.hudi.common.model.EventTimeAvroPayload;
 import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.keygen.EmptyAvroKeyGenerator;
 import org.apache.hudi.utils.TestData;
 
 import org.apache.flink.configuration.Configuration;
@@ -141,6 +147,54 @@ public class TestWriteMergeOnRead extends TestWriteCopyOnWrite {
         .end();
   }
 
+  @Test
+  public void testInsertNonIndexOnMemoryMode() throws Exception {
+    testInsertNonIndex("IN_MEMORY");
+  }
+
+  @Test
+  public void testInsertNonIndexOnFileSystemMode() throws Exception {
+    testInsertNonIndex("FILE_SYSTEM");
+  }
+
+  public void testInsertNonIndex(String storageType) throws Exception {
+    // open the function and ingest data
+    conf.setString(FlinkOptions.INDEX_TYPE, HoodieIndex.IndexType.NON_INDEX.name());
+    conf.setString(FlinkOptions.OPERATION, WriteOperationType.INSERT.name());
+    conf.setString(FlinkOptions.KEYGEN_CLASS_NAME, EmptyAvroKeyGenerator.class.getName());
+    conf.setLong(HoodieIndexConfig.NON_INDEX_PARTITION_FILE_GROUP_CACHE_SIZE.key(), 2000L);
+    conf.setInteger(HoodieIndexConfig.NON_INDEX_PARTITION_FILE_GROUP_CACHE_INTERVAL_MINUTE.key(), 0);
+    conf.setString(HoodieIndexConfig.NON_INDEX_PARTITION_FILE_GROUP_STORAGE_TYPE.key(), storageType);
+    conf.setString(FileSystemViewStorageConfig.VIEW_TYPE.key(), FileSystemViewStorageConfig.VIEW_TYPE.defaultValue().name());
+    conf.setString(HoodieWriteConfig.EMBEDDED_TIMELINE_SERVER_ENABLE.key(),"false");
+
+    TestHarness harness = TestHarness.instance().preparePipeline(tempFile, conf)
+        .consume(TestData.DATA_SET_INSERT_NOM_INDEX)
+        .emptyEventBuffer()
+        .checkpoint(1)
+        .assertPartitionFileGroups("par1", 2)
+        .assertPartitionFileGroups("par2", 1)
+        .assertNextEvent()
+        .checkpointComplete(1)
+        .consume(TestData.DATA_SET_INSERT_NOM_INDEX)
+        .emptyEventBuffer()
+        .checkpoint(2);
+
+    if (storageType.equals("FILE_SYSTEM")) {
+      harness = harness.assertPartitionFileGroups("par2", 1);
+    } else if (storageType.equals("IN_MEMORY")) {
+      harness = harness
+          .assertPartitionFileGroups("par1", 4)
+          .assertPartitionFileGroups("par2", 2);
+    }
+
+    harness
+        .assertNextEvent()
+        .checkpointComplete(2)
+        .checkWrittenData(EXPECTED6, 4)
+        .end();
+  }
+
   @Override
   public void testInsertClustering() {
     // insert clustering is only valid for cow table.
diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/InsertFunctionWrapper.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/InsertFunctionWrapper.java
index 707fe45c47..3666940b4e 100644
--- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/InsertFunctionWrapper.java
+++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/InsertFunctionWrapper.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.sink.utils;
 
+import org.apache.hudi.client.HoodieFlinkWriteClient;
 import org.apache.hudi.configuration.OptionsResolver;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.sink.StreamWriteOperatorCoordinator;
@@ -144,6 +145,11 @@ public class InsertFunctionWrapper<I> implements TestFunctionWrapper<I> {
     return coordinator;
   }
 
+  @Override
+  public HoodieFlinkWriteClient getWriteClient() {
+    return coordinator.getWriteClient();
+  }
+
   @Override
   public void close() throws Exception {
     this.coordinator.close();
diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java
index a1a14456e3..b83f3cc478 100644
--- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java
+++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.sink.utils;
 
+import org.apache.hudi.client.HoodieFlinkWriteClient;
 import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.configuration.FlinkOptions;
@@ -27,6 +28,7 @@ import org.apache.hudi.sink.StreamWriteFunction;
 import org.apache.hudi.sink.StreamWriteOperatorCoordinator;
 import org.apache.hudi.sink.bootstrap.BootstrapOperator;
 import org.apache.hudi.sink.event.WriteMetadataEvent;
+import org.apache.hudi.sink.nonindex.NonIndexStreamWriteFunction;
 import org.apache.hudi.sink.partitioner.BucketAssignFunction;
 import org.apache.hudi.sink.transform.RowDataToHoodieFunction;
 import org.apache.hudi.utils.TestConfigurations;
@@ -164,9 +166,13 @@ public class StreamWriteFunctionWrapper<I> implements TestFunctionWrapper<I> {
   public void invoke(I record) throws Exception {
     HoodieRecord<?> hoodieRecord = toHoodieFunction.map((RowData) record);
     ScalaCollector<HoodieRecord<?>> collector = ScalaCollector.getInstance();
-    bucketAssignerFunction.processElement(hoodieRecord, null, collector);
-    bucketAssignFunctionContext.setCurrentKey(hoodieRecord.getRecordKey());
-    writeFunction.processElement(collector.getVal(), null, null);
+    if (OptionsResolver.isNonIndexType(conf)) {
+      writeFunction.processElement(hoodieRecord, null, null);
+    } else {
+      bucketAssignerFunction.processElement(hoodieRecord, null, collector);
+      bucketAssignFunctionContext.setCurrentKey(hoodieRecord.getRecordKey());
+      writeFunction.processElement(collector.getVal(), null, null);
+    }
   }
 
   public WriteMetadataEvent[] getEventBuffer() {
@@ -233,6 +239,11 @@ public class StreamWriteFunctionWrapper<I> implements TestFunctionWrapper<I> {
     return coordinator;
   }
 
+  @Override
+  public HoodieFlinkWriteClient getWriteClient() {
+    return coordinator.getWriteClient();
+  }
+
   public MockOperatorCoordinatorContext getCoordinatorContext() {
     return coordinatorContext;
   }
@@ -254,7 +265,11 @@ public class StreamWriteFunctionWrapper<I> implements TestFunctionWrapper<I> {
   // -------------------------------------------------------------------------
 
   private void setupWriteFunction() throws Exception {
-    writeFunction = new StreamWriteFunction<>(conf);
+    if (OptionsResolver.isNonIndexType(conf)) {
+      writeFunction = new NonIndexStreamWriteFunction<>(conf);
+    } else {
+      writeFunction = new StreamWriteFunction<>(conf);
+    }
     writeFunction.setRuntimeContext(runtimeContext);
     writeFunction.setOperatorEventGateway(gateway);
     writeFunction.initializeState(this.stateInitializationContext);
diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestFunctionWrapper.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestFunctionWrapper.java
index d2fe819650..967fdd3b24 100644
--- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestFunctionWrapper.java
+++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestFunctionWrapper.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.sink.utils;
 
+import org.apache.hudi.client.HoodieFlinkWriteClient;
 import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.sink.StreamWriteOperatorCoordinator;
@@ -68,6 +69,11 @@ public interface TestFunctionWrapper<I> {
    */
   StreamWriteOperatorCoordinator getCoordinator();
 
+  /**
+   * Returns the write client.
+   */
+  HoodieFlinkWriteClient getWriteClient();
+
   /**
    * Returns the data buffer of the write task.
    */
diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestWriteBase.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestWriteBase.java
index b6ae0767d6..05ce879bcb 100644
--- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestWriteBase.java
+++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestWriteBase.java
@@ -18,12 +18,15 @@
 
 package org.apache.hudi.sink.utils;
 
+import org.apache.hudi.client.HoodieFlinkWriteClient;
 import org.apache.hudi.client.WriteStatus;
 import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieFileGroup;
 import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.table.timeline.HoodieInstant;
 import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.view.SyncableFileSystemView;
 import org.apache.hudi.configuration.OptionsResolver;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.sink.event.WriteMetadataEvent;
@@ -69,6 +72,8 @@ public class TestWriteBase {
 
   protected static final Map<String, List<String>> EXPECTED5 = new HashMap<>();
 
+  protected static final Map<String, String> EXPECTED6 = new HashMap<>();
+
   static {
     EXPECTED1.put("par1", "[id1,par1,id1,Danny,23,1,par1, id2,par1,id2,Stephen,33,2,par1]");
     EXPECTED1.put("par2", "[id3,par2,id3,Julian,53,3,par2, id4,par2,id4,Fabian,31,4,par2]");
@@ -102,6 +107,38 @@ public class TestWriteBase {
         "id1,par1,id1,Danny,23,3,par1",
         "id1,par1,id1,Danny,23,4,par1",
         "id1,par1,id1,Danny,23,4,par1"));
+
+    EXPECTED6.put("par1", "["
+        + "_hoodie_empty_record_key_,par1,id1,Danny,23,1,par1, "
+        + "_hoodie_empty_record_key_,par1,id1,Danny,23,1,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1, "
+        + "_hoodie_empty_record_key_,par1,id2,Stephen,33,2,par1]");
+
+    EXPECTED6.put("par2", "["
+        + "_hoodie_empty_record_key_,par2,id3,Julian,53,3,par2, "
+        + "_hoodie_empty_record_key_,par2,id3,Julian,53,3,par2]");
+
+    EXPECTED6.put("par3", "["
+        + "_hoodie_empty_record_key_,par3,id5,Sophia,18,5,par3, "
+        + "_hoodie_empty_record_key_,par3,id5,Sophia,18,5,par3]");
+
+    EXPECTED6.put("par4", "["
+        + "_hoodie_empty_record_key_,par4,id7,Bob,44,7,par4, "
+        + "_hoodie_empty_record_key_,par4,id7,Bob,44,7,par4]");
   }
 
   // -------------------------------------------------------------------------
@@ -224,6 +261,17 @@ public class TestWriteBase {
       return this;
     }
 
+    public TestHarness assertPartitionFileGroups(String partitionPath, int numFileGroups) {
+      HoodieFlinkWriteClient writeClient = this.pipeline.getWriteClient();
+      SyncableFileSystemView fileSystemView = writeClient.getHoodieTable().getHoodieView();
+      fileSystemView.sync();
+      //check part1
+      List<HoodieFileGroup> fileGroups = fileSystemView.getAllFileGroups(partitionPath).collect(
+          Collectors.toList());
+      assertEquals(numFileGroups, fileGroups.size());
+      return this;
+    }
+
     /**
      * Checkpoints the pipeline, which triggers the data write and event send.
      */
diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java
index 8e1dd9964c..6d0183053f 100644
--- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java
+++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java
@@ -286,6 +286,40 @@ public class TestData {
           TimestampData.fromEpochMillis(5), StringData.fromString("par1"))
   );
 
+  public static List<RowData> DATA_SET_INSERT_NOM_INDEX = Arrays.asList(
+      //partition1
+      insertRow(StringData.fromString("id1"), StringData.fromString("Danny"), 23,
+          TimestampData.fromEpochMillis(1), StringData.fromString("par1")),
+      insertRow(StringData.fromString("id2"), StringData.fromString("Stephen"), 33,
+          TimestampData.fromEpochMillis(2), StringData.fromString("par1")),
+      insertRow(StringData.fromString("id2"), StringData.fromString("Stephen"), 33,
+          TimestampData.fromEpochMillis(2), StringData.fromString("par1")),
+      insertRow(StringData.fromString("id2"), StringData.fromString("Stephen"), 33,
+          TimestampData.fromEpochMillis(2), StringData.fromString("par1")),
+      insertRow(StringData.fromString("id2"), StringData.fromString("Stephen"), 33,
+          TimestampData.fromEpochMillis(2), StringData.fromString("par1")),
+      insertRow(StringData.fromString("id2"), StringData.fromString("Stephen"), 33,
+          TimestampData.fromEpochMillis(2), StringData.fromString("par1")),
+      insertRow(StringData.fromString("id2"), StringData.fromString("Stephen"), 33,
+          TimestampData.fromEpochMillis(2), StringData.fromString("par1")),
+      insertRow(StringData.fromString("id2"), StringData.fromString("Stephen"), 33,
+          TimestampData.fromEpochMillis(2), StringData.fromString("par1")),
+      insertRow(StringData.fromString("id2"), StringData.fromString("Stephen"), 33,
+          TimestampData.fromEpochMillis(2), StringData.fromString("par1")),
+
+      //partition2
+      insertRow(StringData.fromString("id3"), StringData.fromString("Julian"), 53,
+          TimestampData.fromEpochMillis(3), StringData.fromString("par2")),
+
+      //partition3
+      insertRow(StringData.fromString("id5"), StringData.fromString("Sophia"), 18,
+          TimestampData.fromEpochMillis(5), StringData.fromString("par3")),
+
+      //partition4
+      insertRow(StringData.fromString("id7"), StringData.fromString("Bob"), 44,
+          TimestampData.fromEpochMillis(7), StringData.fromString("par4"))
+  );
+
   public static List<RowData> DATA_SET_SINGLE_INSERT = Collections.singletonList(
       insertRow(StringData.fromString("id1"), StringData.fromString("Danny"), 23,
           TimestampData.fromEpochMillis(1), StringData.fromString("par1")));
diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestNonIndex.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestNonIndex.scala
new file mode 100644
index 0000000000..cfe9564a35
--- /dev/null
+++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestNonIndex.scala
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings
+import org.apache.hudi.config.{HoodieIndexConfig, HoodieWriteConfig}
+import org.apache.hudi.index.HoodieIndex
+import org.apache.hudi.keygen.EmptyKeyGenerator
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions
+import org.apache.hudi.testutils.{DataSourceTestUtils, HoodieClientTestBase}
+import org.apache.spark.sql.{Dataset, Row, SaveMode}
+import org.junit.jupiter.api.Test
+
+import scala.collection.JavaConversions._
+import scala.collection.JavaConverters
+
+class TestNonIndex extends HoodieClientTestBase {
+  val commonOpts = Map(
+    "hoodie.insert.shuffle.parallelism" -> "4",
+    "hoodie.upsert.shuffle.parallelism" -> "4",
+    KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key() -> "partition",
+    DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "timestamp",
+    HoodieIndexConfig.INDEX_TYPE.key() -> HoodieIndex.IndexType.NON_INDEX.name(),
+    HoodieWriteConfig.KEYGENERATOR_CLASS_NAME.key() -> classOf[EmptyKeyGenerator].getName
+  )
+
+  @Test
+  def testNonIndexMORInsert(): Unit = {
+    val spark = sqlContext.sparkSession
+
+    val records1 = recordsToStrings(dataGen.generateInserts("001", 100)).toList
+    // first insert, parquet files
+    val inputDF1: Dataset[Row] = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+    inputDF1.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .option(DataSourceWriteOptions.TABLE_TYPE.key(), DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+
+    val records2 = recordsToStrings(dataGen.generateInserts("002", 100)).toList
+    // second insert, log files
+    val inputDF2: Dataset[Row] = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+    inputDF2.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+      .option(DataSourceWriteOptions.TABLE_TYPE.key(), DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
+      .mode(SaveMode.Append)
+      .save(basePath)
+
+    val fs = FSUtils.getFs(basePath, spark.sparkContext.hadoopConfiguration)
+    assert(fs.globStatus(new Path(basePath, "201*/*/*/.*.log*")).length > 0)
+    assert(fs.globStatus(new Path(basePath, "201*/*/*/*.parquet")).length > 0)
+
+    // check data
+    val result = spark.read.format("org.apache.hudi").load(basePath + "/*/*/*/*")
+      .selectExpr(inputDF2.schema.map(_.name): _*)
+    val inputAll = inputDF1.unionAll(inputDF2)
+    assert(result.except(inputAll).count() == 0)
+    assert(inputAll.except(result).count == 0)
+  }
+
+  @Test
+  def testBulkInsertDatasetWithOutIndex(): Unit = {
+    val spark = sqlContext.sparkSession
+    val schema = DataSourceTestUtils.getStructTypeExampleSchema
+
+    // create a new table
+    val fooTableModifier = commonOpts
+      .updated(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
+      .updated(DataSourceWriteOptions.ENABLE_ROW_WRITER.key, "true")
+      .updated(HoodieIndexConfig.INDEX_TYPE.key(), HoodieIndex.IndexType.NON_INDEX.toString)
+      .updated(HoodieWriteConfig.AVRO_SCHEMA_STRING.key(), schema.toString)
+      .updated(HoodieWriteConfig.BULKINSERT_PARALLELISM_VALUE.key(), "4")
+      .updated("path", basePath)
+    val fooTableParams = HoodieWriterUtils.parametersWithWriteDefaults(fooTableModifier)
+
+    val structType = AvroConversionUtils.convertAvroSchemaToStructType(schema)
+    for (_ <- 0 to 0) {
+      // generate the inserts
+      val records = DataSourceTestUtils.generateRandomRows(200)
+      val recordsSeq = JavaConverters.asScalaIteratorConverter(records.iterator).asScala.toSeq
+      val df = spark.createDataFrame(spark.sparkContext.parallelize(recordsSeq), structType)
+      // write to Hudi
+      HoodieSparkSqlWriter.write(sqlContext, SaveMode.Append, fooTableParams ++ Seq(), df)
+
+      // Fetch records from entire dataset
+      val actualDf = sqlContext.read.format("org.apache.hudi").load(basePath + "/*/*/*")
+
+      assert(actualDf.where("_hoodie_record_key = '_hoodie_empty_record_key_'").count() == 200)
+    }
+  }
+}


[hudi] 21/45: [MINOR] add integrity check of merged parquet file for HoodieMergeHandle.

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 3c364bdf721651ed20980c30ee9b521e3535286e
Author: xiarixiaoyao <me...@qq.com>
AuthorDate: Wed Sep 28 15:05:26 2022 +0800

    [MINOR] add integrity check of merged parquet file for HoodieMergeHandle.
---
 .../src/main/java/org/apache/hudi/io/HoodieMergeHandle.java    | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
index e629c6a51e..88db25bac4 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
@@ -34,6 +34,7 @@ import org.apache.hudi.common.model.IOType;
 import org.apache.hudi.common.util.DefaultSizeEstimator;
 import org.apache.hudi.common.util.HoodieRecordSizeEstimator;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ParquetUtils;
 import org.apache.hudi.common.util.ValidationUtils;
 import org.apache.hudi.common.util.collection.ExternalSpillableMap;
 import org.apache.hudi.config.HoodieWriteConfig;
@@ -65,6 +66,8 @@ import java.util.NoSuchElementException;
 import java.util.Map;
 import java.util.Set;
 
+import static org.apache.hudi.common.model.HoodieFileFormat.PARQUET;
+
 @SuppressWarnings("Duplicates")
 /**
  * Handle to merge incoming records to those in storage.
@@ -447,6 +450,13 @@ public class HoodieMergeHandle<T extends HoodieRecordPayload, I, K, O> extends H
       return;
     }
 
+    // Fast verify the integrity of the parquet file.
+    // only check the readable of parquet metadata.
+    final String extension = FSUtils.getFileExtension(newFilePath.toString());
+    if (PARQUET.getFileExtension().equals(extension)) {
+      new ParquetUtils().readMetadata(hoodieTable.getHadoopConf(), newFilePath);
+    }
+
     long oldNumWrites = 0;
     try {
       HoodieFileReader reader = HoodieFileReaderFactory.getFileReader(hoodieTable.getHadoopConf(), oldFilePath);


[hudi] 43/45: check parquet file does not exist

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit d0b3b36e96d326d9f683f1d557527e4e50d92e78
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Wed Dec 28 10:06:17 2022 +0800

    check parquet file does not exist
---
 .../src/main/java/org/apache/hudi/io/HoodieMergeHandle.java       | 8 +++++++-
 .../apache/hudi/io/storage/row/HoodieRowDataParquetWriter.java    | 4 +++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
index c569acdda6..17bbb2f7f0 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
@@ -445,7 +445,13 @@ public class HoodieMergeHandle<T extends HoodieRecordPayload, I, K, O> extends H
       return;
     }
 
-    IOUtils.checkParquetFileVaid(hoodieTable.getHadoopConf(), newFilePath);
+    try {
+      if (fs.exists(newFilePath)) {
+        IOUtils.checkParquetFileVaid(hoodieTable.getHadoopConf(), newFilePath);
+      }
+    } catch (IOException e) {
+      throw new HoodieUpsertException("Failed to check for merge data validation", e);
+    }
 
     long oldNumWrites = 0;
     try {
diff --git a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowDataParquetWriter.java b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowDataParquetWriter.java
index fd1edaab84..66a830887a 100644
--- a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowDataParquetWriter.java
+++ b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowDataParquetWriter.java
@@ -75,6 +75,8 @@ public class HoodieRowDataParquetWriter extends ParquetWriter<RowData>
   @Override
   public void close() throws IOException {
     super.close();
-    IOUtils.checkParquetFileVaid(fs.getConf(), file);
+    if (fs.exists(file)) {
+      IOUtils.checkParquetFileVaid(fs.getConf(), file);
+    }
   }
 }


[hudi] 15/45: Remove proxy

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ee07cc6a3ba5287c11fb682faf40ee20975d4fb7
Author: simonssu <ba...@gmail.com>
AuthorDate: Thu Oct 20 14:17:43 2022 +0800

    Remove proxy
    
    (cherry picked from commit b90d8a2e101b1cbc2bca85ee10eb8d6740caf5b6)
---
 dev/settings.xml | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/dev/settings.xml b/dev/settings.xml
index 5f5dfd4fa6..cad54797c9 100644
--- a/dev/settings.xml
+++ b/dev/settings.xml
@@ -1,23 +1,4 @@
 <settings>
-    <proxies>
-        <proxy>
-            <id>dev http</id>
-            <active>true</active>
-            <protocol>http</protocol>
-            <host>web-proxy.oa.com</host>
-            <port>8080</port>
-            <nonProxyHosts>mirrors.tencent.com|qq.com|localhost|127.0.0.1|*.oa.com|repo.maven.apache.org|packages.confluent.io</nonProxyHosts>
-        </proxy>
-        <proxy>
-            <id>dev https</id>
-            <active>true</active>
-            <protocol>https</protocol>
-            <host>web-proxy.oa.com</host>
-            <port>8080</port>
-            <nonProxyHosts>mirrors.tencent.com|qq.com|localhost|127.0.0.1|*.oa.com|repo.maven.apache.org|packages.confluent.io</nonProxyHosts>
-        </proxy>
-    </proxies>
-
     <offline>false</offline>
 
     <profiles>


[hudi] 30/45: Merge branch 'optimize_schema_settings' into 'release-0.12.1' (merge request !108)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ab5deef087a4c23bf10106c4f8a3ff5474e8ea73
Merge: f7fe437faf 00c3443cb4
Author: forwardxu <fo...@tencent.com>
AuthorDate: Thu Nov 24 08:31:35 2022 +0000

    Merge branch 'optimize_schema_settings' into 'release-0.12.1' (merge request !108)
    
    Add schema settings with stream api
    优化`stream api`的使用,无需用户对每个`field`都进行`.column()`,现只需`.schema()`即可。

 .../src/main/java/org/apache/hudi/util/HoodiePipeline.java | 14 ++++++++++++++
 1 file changed, 14 insertions(+)


[hudi] 17/45: temp_view_support (#6990)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 97ce2b7f7ba905337ea1d3b55e6febca1cf52833
Author: 苏承祥 <sc...@aliyun.com>
AuthorDate: Wed Oct 26 13:11:15 2022 +0800

    temp_view_support (#6990)
    
    Co-authored-by: 苏承祥 <su...@tuya.com>
    
    (cherry picked from commit e13b2129dc144ca505e39c0d7fa479c47362bb56)
---
 .../hudi/command/procedures/CopyToTempView.scala   | 114 ++++++++++++++
 .../hudi/command/procedures/HoodieProcedures.scala |   1 +
 .../procedure/TestCopyToTempViewProcedure.scala    | 168 +++++++++++++++++++++
 3 files changed, 283 insertions(+)

diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempView.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempView.scala
new file mode 100644
index 0000000000..13259c4964
--- /dev/null
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempView.scala
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.hudi.DataSourceReadOptions
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
+
+import java.util.function.Supplier
+
+class CopyToTempView extends BaseProcedure with ProcedureBuilder with Logging {
+
+  private val PARAMETERS = Array[ProcedureParameter](
+    ProcedureParameter.required(0, "table", DataTypes.StringType, None),
+    ProcedureParameter.optional(1, "query_type", DataTypes.StringType, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL),
+    ProcedureParameter.required(2, "view_name", DataTypes.StringType, None),
+    ProcedureParameter.optional(3, "begin_instance_time", DataTypes.StringType, ""),
+    ProcedureParameter.optional(4, "end_instance_time", DataTypes.StringType, ""),
+    ProcedureParameter.optional(5, "as_of_instant", DataTypes.StringType, ""),
+    ProcedureParameter.optional(6, "replace", DataTypes.BooleanType, false),
+    ProcedureParameter.optional(7, "global", DataTypes.BooleanType, false)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+    StructField("status", DataTypes.IntegerType, nullable = true, Metadata.empty))
+  )
+
+  def parameters: Array[ProcedureParameter] = PARAMETERS
+
+  def outputType: StructType = OUTPUT_TYPE
+
+  override def call(args: ProcedureArgs): Seq[Row] = {
+    super.checkArgs(PARAMETERS, args)
+
+    val tableName = getArgValueOrDefault(args, PARAMETERS(0))
+    val queryType = getArgValueOrDefault(args, PARAMETERS(1)).get.asInstanceOf[String]
+    val viewName = getArgValueOrDefault(args, PARAMETERS(2)).get.asInstanceOf[String]
+    val beginInstance = getArgValueOrDefault(args, PARAMETERS(3)).get.asInstanceOf[String]
+    val endInstance = getArgValueOrDefault(args, PARAMETERS(4)).get.asInstanceOf[String]
+    val asOfInstant = getArgValueOrDefault(args, PARAMETERS(5)).get.asInstanceOf[String]
+    val replace = getArgValueOrDefault(args, PARAMETERS(6)).get.asInstanceOf[Boolean]
+    val global = getArgValueOrDefault(args, PARAMETERS(7)).get.asInstanceOf[Boolean]
+
+    val tablePath = getBasePath(tableName)
+
+    val sourceDataFrame = queryType match {
+      case DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL => if (asOfInstant.nonEmpty) {
+        sparkSession.read
+          .format("org.apache.hudi")
+          .option(DataSourceReadOptions.QUERY_TYPE.key, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
+          .option(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key, asOfInstant)
+          .load(tablePath)
+      } else {
+        sparkSession.read
+          .format("org.apache.hudi")
+          .option(DataSourceReadOptions.QUERY_TYPE.key, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
+          .load(tablePath)
+      }
+      case DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL =>
+        assert(beginInstance.nonEmpty && endInstance.nonEmpty, "when the query_type is incremental, begin_instance_time and end_instance_time can not be null.")
+        sparkSession.read
+          .format("org.apache.hudi")
+          .option(DataSourceReadOptions.QUERY_TYPE.key, DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+          .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key, beginInstance)
+          .option(DataSourceReadOptions.END_INSTANTTIME.key, endInstance)
+          .load(tablePath)
+      case DataSourceReadOptions.QUERY_TYPE_READ_OPTIMIZED_OPT_VAL =>
+        sparkSession.read
+          .format("org.apache.hudi")
+          .option(DataSourceReadOptions.QUERY_TYPE.key, DataSourceReadOptions.QUERY_TYPE_READ_OPTIMIZED_OPT_VAL)
+          .load(tablePath)
+    }
+    if (global) {
+      if (replace) {
+        sourceDataFrame.createOrReplaceGlobalTempView(viewName)
+      } else {
+        sourceDataFrame.createGlobalTempView(viewName)
+      }
+    } else {
+      if (replace) {
+        sourceDataFrame.createOrReplaceTempView(viewName)
+      } else {
+        sourceDataFrame.createTempView(viewName)
+      }
+    }
+    Seq(Row(0))
+  }
+
+  override def build = new CopyToTempView()
+}
+
+object CopyToTempView {
+  val NAME = "copy_to_temp_view"
+
+  def builder: Supplier[ProcedureBuilder] = new Supplier[ProcedureBuilder] {
+    override def get() = new CopyToTempView()
+  }
+}
diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
index 0917c2b70e..b308480c6d 100644
--- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
@@ -81,6 +81,7 @@ object HoodieProcedures {
       ,(ShowInvalidParquetProcedure.NAME, ShowInvalidParquetProcedure.builder)
       ,(HiveSyncProcedure.NAME, HiveSyncProcedure.builder)
       ,(BackupInvalidParquetProcedure.NAME, BackupInvalidParquetProcedure.builder)
+      ,(CopyToTempView.NAME, CopyToTempView.builder)
     )
   }
 }
diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCopyToTempViewProcedure.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCopyToTempViewProcedure.scala
new file mode 100644
index 0000000000..13da259df1
--- /dev/null
+++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCopyToTempViewProcedure.scala
@@ -0,0 +1,168 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.procedure
+
+import org.apache.spark.sql.hudi.HoodieSparkSqlTestBase
+
+class TestCopyToTempViewProcedure extends HoodieSparkSqlTestBase {
+
+
+  test("Test Call copy_to_temp_view Procedure with default params") {
+    withTempDir { tmp =>
+      val tableName = generateTableName
+      // create table
+      spark.sql(
+        s"""
+           |create table $tableName (
+           |  id int,
+           |  name string,
+           |  price double,
+           |  ts long
+           |) using hudi
+           | location '${tmp.getCanonicalPath}/$tableName'
+           | tblproperties (
+           |  primaryKey = 'id',
+           |  preCombineField = 'ts'
+           | )
+       """.stripMargin)
+
+      // insert data to table
+      spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
+      spark.sql(s"insert into $tableName select 2, 'a2', 20, 1500")
+      spark.sql(s"insert into $tableName select 3, 'a3', 30, 2000")
+      spark.sql(s"insert into $tableName select 4, 'a4', 40, 2500")
+
+      // Check required fields
+      checkExceptionContain(s"call copy_to_temp_view(table=>'$tableName')")(s"Argument: view_name is required")
+
+      val viewName = generateTableName
+
+      val row = spark.sql(s"""call copy_to_temp_view(table=>'$tableName',view_name=>'$viewName')""").collectAsList()
+      assert(row.size() == 1 && row.get(0).get(0) == 0)
+      val copyTableCount = spark.sql(s"""select count(1) from $viewName""").collectAsList()
+      assert(copyTableCount.size() == 1 && copyTableCount.get(0).get(0) == 4)
+    }
+  }
+
+  test("Test Call copy_to_temp_view Procedure with replace params") {
+    withTempDir { tmp =>
+      val tableName = generateTableName
+      // create table
+      spark.sql(
+        s"""
+           |create table $tableName (
+           |  id int,
+           |  name string,
+           |  price double,
+           |  ts long
+           |) using hudi
+           | location '${tmp.getCanonicalPath}/$tableName'
+           | tblproperties (
+           |  primaryKey = 'id',
+           |  preCombineField = 'ts'
+           | )
+       """.stripMargin)
+
+      // insert data to table
+      spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
+      spark.sql(s"insert into $tableName select 2, 'a2', 20, 1500")
+      spark.sql(s"insert into $tableName select 3, 'a3', 30, 2000")
+      spark.sql(s"insert into $tableName select 4, 'a4', 40, 2500")
+
+      // Check required fields
+      checkExceptionContain(s"call copy_to_temp_view(table=>'$tableName')")(s"Argument: view_name is required")
+
+      // 1: copyToTempView
+      val viewName = generateTableName
+      val row = spark.sql(s"""call copy_to_temp_view(table=>'$tableName',view_name=>'$viewName')""").collectAsList()
+      assert(row.size() == 1 && row.get(0).get(0) == 0)
+      val copyTableCount = spark.sql(s"""select count(1) from $viewName""").collectAsList()
+      assert(copyTableCount.size() == 1 && copyTableCount.get(0).get(0) == 4)
+
+      // 2: add new record to hudi table
+      spark.sql(s"insert into $tableName select 5, 'a5', 40, 2500")
+
+      // 3: copyToTempView with replace=false
+      checkExceptionContain(s"""call copy_to_temp_view(table=>'$tableName',view_name=>'$viewName',replace=>false)""")(s"Temporary view '$viewName' already exists")
+      // 4: copyToTempView with replace=true
+      val row2 = spark.sql(s"""call copy_to_temp_view(table=>'$tableName',view_name=>'$viewName',replace=>true)""").collectAsList()
+      assert(row2.size() == 1 && row2.get(0).get(0) == 0)
+      // 5: query new replace view ,count=5
+      val newViewCount = spark.sql(s"""select count(1) from $viewName""").collectAsList()
+      assert(newViewCount.size() == 1 && newViewCount.get(0).get(0) == 5)
+    }
+  }
+
+  test("Test Call copy_to_temp_view Procedure with global params") {
+    withTempDir { tmp =>
+      val tableName = generateTableName
+      // create table
+      spark.sql(
+        s"""
+           |create table $tableName (
+           |  id int,
+           |  name string,
+           |  price double,
+           |  ts long
+           |) using hudi
+           | location '${tmp.getCanonicalPath}/$tableName'
+           | tblproperties (
+           |  primaryKey = 'id',
+           |  preCombineField = 'ts'
+           | )
+       """.stripMargin)
+
+      // insert data to table
+      spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
+      spark.sql(s"insert into $tableName select 2, 'a2', 20, 1500")
+      spark.sql(s"insert into $tableName select 3, 'a3', 30, 2000")
+      spark.sql(s"insert into $tableName select 4, 'a4', 40, 2500")
+
+      // Check required fields
+      checkExceptionContain(s"call copy_to_temp_view(table=>'$tableName')")(s"Argument: view_name is required")
+
+      // 1: copyToTempView with global=false
+      val viewName = generateTableName
+      val row = spark.sql(s"""call copy_to_temp_view(table=>'$tableName',view_name=>'$viewName',global=>false)""").collectAsList()
+      assert(row.size() == 1 && row.get(0).get(0) == 0)
+      val copyTableCount = spark.sql(s"""select count(1) from $viewName""").collectAsList()
+      assert(copyTableCount.size() == 1 && copyTableCount.get(0).get(0) == 4)
+
+      // 2: query view in other session
+      var newSession = spark.newSession()
+      var hasException = false
+      val errorMsg = s"Table or view not found: $viewName"
+      try {
+        newSession.sql(s"""select count(1) from $viewName""")
+      } catch {
+        case e: Throwable if e.getMessage.contains(errorMsg) => hasException = true
+        case f: Throwable => fail("Exception should contain: " + errorMsg + ", error message: " + f.getMessage, f)
+      }
+      assertResult(true)(hasException)
+      // 3: copyToTempView with global=true,
+      val row2 = spark.sql(s"""call copy_to_temp_view(table=>'$tableName',view_name=>'$viewName',global=>true,replace=>true)""").collectAsList()
+      assert(row2.size() == 1 && row2.get(0).get(0) == 0)
+
+      newSession = spark.newSession()
+      // 4: query view in other session
+      val newViewCount = spark.sql(s"""select count(1) from $viewName""").collectAsList()
+      assert(newViewCount.size() == 1 && newViewCount.get(0).get(0) == 4)
+
+    }
+  }
+}


[hudi] 13/45: fix RowDataProjection with project and projectAsValues's NPE

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 4e66857849d1ac793ad77d211d602295d08f827f
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Mon Aug 15 15:06:03 2022 +0800

    fix RowDataProjection with project and projectAsValues's NPE
---
 .../org/apache/hudi/util/RowDataProjection.java     | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/RowDataProjection.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/RowDataProjection.java
index 8076d982b9..51df29faae 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/RowDataProjection.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/RowDataProjection.java
@@ -24,6 +24,8 @@ import org.apache.flink.table.data.GenericRowData;
 import org.apache.flink.table.data.RowData;
 import org.apache.flink.table.types.logical.LogicalType;
 import org.apache.flink.table.types.logical.RowType;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
 
 import java.io.Serializable;
 import java.util.Arrays;
@@ -33,14 +35,19 @@ import java.util.List;
  * Utilities to project the row data with given positions.
  */
 public class RowDataProjection implements Serializable {
+  private static final Logger LOG = LogManager.getLogger(RowDataProjection.class);
+
   private static final long serialVersionUID = 1L;
 
   private final RowData.FieldGetter[] fieldGetters;
 
+  private final LogicalType[] types;
+
   private RowDataProjection(LogicalType[] types, int[] positions) {
     ValidationUtils.checkArgument(types.length == positions.length,
         "types and positions should have the equal number");
     this.fieldGetters = new RowData.FieldGetter[types.length];
+    this.types = types;
     for (int i = 0; i < types.length; i++) {
       final LogicalType type = types[i];
       final int pos = positions[i];
@@ -69,7 +76,12 @@ public class RowDataProjection implements Serializable {
   public RowData project(RowData rowData) {
     GenericRowData genericRowData = new GenericRowData(this.fieldGetters.length);
     for (int i = 0; i < this.fieldGetters.length; i++) {
-      final Object val = this.fieldGetters[i].getFieldOrNull(rowData);
+      Object val = null;
+      try {
+        val = rowData.isNullAt(i) ? null : this.fieldGetters[i].getFieldOrNull(rowData);
+      } catch (Throwable e) {
+        LOG.error(String.format("position=%s, fieldType=%s,\n data=%s", i, types[i].toString(), rowData.toString()));
+      }
       genericRowData.setField(i, val);
     }
     return genericRowData;
@@ -81,7 +93,12 @@ public class RowDataProjection implements Serializable {
   public Object[] projectAsValues(RowData rowData) {
     Object[] values = new Object[this.fieldGetters.length];
     for (int i = 0; i < this.fieldGetters.length; i++) {
-      final Object val = this.fieldGetters[i].getFieldOrNull(rowData);
+      Object val = null;
+      try {
+        val = rowData.isNullAt(i) ? null : this.fieldGetters[i].getFieldOrNull(rowData);
+      } catch (Throwable e) {
+        LOG.error(String.format("position=%s, fieldType=%s,\n data=%s", i, types[i].toString(), rowData.toString()));
+      }
       values[i] = val;
     }
     return values;


[hudi] 18/45: [HUDI-5105] Add Call show_commit_extra_metadata for spark sql (#7091)

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 90c09053da765e735fee90a19ede2d78dba62a2b
Author: ForwardXu <fo...@gmail.com>
AuthorDate: Mon Oct 31 18:21:29 2022 +0800

    [HUDI-5105] Add Call show_commit_extra_metadata for spark sql (#7091)
    
    * [HUDI-5105] Add Call show_commit_extra_metadata for spark sql
    
    (cherry picked from commit 79ad3571db62b51e8fe8cc9183c8c787e9ef57fe)
---
 .../hudi/command/procedures/HoodieProcedures.scala |   1 +
 .../ShowCommitExtraMetadataProcedure.scala         | 138 +++++++++++++++++++++
 .../sql/hudi/procedure/TestCommitsProcedure.scala  |  54 +++++++-
 3 files changed, 187 insertions(+), 6 deletions(-)

diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
index b308480c6d..fabfda9367 100644
--- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
@@ -82,6 +82,7 @@ object HoodieProcedures {
       ,(HiveSyncProcedure.NAME, HiveSyncProcedure.builder)
       ,(BackupInvalidParquetProcedure.NAME, BackupInvalidParquetProcedure.builder)
       ,(CopyToTempView.NAME, CopyToTempView.builder)
+      ,(ShowCommitExtraMetadataProcedure.NAME, ShowCommitExtraMetadataProcedure.builder)
     )
   }
 }
diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowCommitExtraMetadataProcedure.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowCommitExtraMetadataProcedure.scala
new file mode 100644
index 0000000000..1a8f4dd9e4
--- /dev/null
+++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowCommitExtraMetadataProcedure.scala
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.hudi.HoodieCLIUtils
+import org.apache.hudi.common.model.{HoodieCommitMetadata, HoodieReplaceCommitMetadata}
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.common.table.timeline.{HoodieInstant, HoodieTimeline}
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
+
+import java.util
+import java.util.function.Supplier
+import scala.collection.JavaConversions._
+
+class ShowCommitExtraMetadataProcedure() extends BaseProcedure with ProcedureBuilder {
+  private val PARAMETERS = Array[ProcedureParameter](
+    ProcedureParameter.required(0, "table", DataTypes.StringType, None),
+    ProcedureParameter.optional(1, "limit", DataTypes.IntegerType, 100),
+    ProcedureParameter.optional(2, "instant_time", DataTypes.StringType, None),
+    ProcedureParameter.optional(3, "metadata_key", DataTypes.StringType, None)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+    StructField("instant_time", DataTypes.StringType, nullable = true, Metadata.empty),
+    StructField("action", DataTypes.StringType, nullable = true, Metadata.empty),
+    StructField("metadata_key", DataTypes.StringType, nullable = true, Metadata.empty),
+    StructField("metadata_value", DataTypes.StringType, nullable = true, Metadata.empty)
+  ))
+
+  def parameters: Array[ProcedureParameter] = PARAMETERS
+
+  def outputType: StructType = OUTPUT_TYPE
+
+  override def call(args: ProcedureArgs): Seq[Row] = {
+    super.checkArgs(PARAMETERS, args)
+
+    val table = getArgValueOrDefault(args, PARAMETERS(0)).get.asInstanceOf[String]
+    val limit = getArgValueOrDefault(args, PARAMETERS(1)).get.asInstanceOf[Int]
+    val instantTime = getArgValueOrDefault(args, PARAMETERS(2))
+    val metadataKey = getArgValueOrDefault(args, PARAMETERS(3))
+
+    val hoodieCatalogTable = HoodieCLIUtils.getHoodieCatalogTable(sparkSession, table)
+    val basePath = hoodieCatalogTable.tableLocation
+    val metaClient = HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build
+    val activeTimeline = metaClient.getActiveTimeline
+    val timeline = activeTimeline.getCommitsTimeline.filterCompletedInstants
+
+    val hoodieInstantOption: Option[HoodieInstant] = if (instantTime.isEmpty) {
+      getCommitForLastInstant(timeline)
+    } else {
+      getCommitForInstant(timeline, instantTime.get.asInstanceOf[String])
+    }
+
+    if (hoodieInstantOption.isEmpty) {
+      throw new HoodieException(s"Commit $instantTime not found in Commits $timeline.")
+    }
+
+    val commitMetadataOptional = getHoodieCommitMetadata(timeline, hoodieInstantOption)
+
+    if (commitMetadataOptional.isEmpty) {
+      throw new HoodieException(s"Commit $instantTime not found commitMetadata in Commits $timeline.")
+    }
+
+    val meta = commitMetadataOptional.get
+    val timestamp: String = hoodieInstantOption.get.getTimestamp
+    val action: String = hoodieInstantOption.get.getAction
+    val metadatas: util.Map[String, String] = if (metadataKey.isEmpty) {
+      meta.getExtraMetadata
+    } else {
+      meta.getExtraMetadata.filter(r => r._1.equals(metadataKey.get.asInstanceOf[String].trim))
+    }
+
+    val rows = new util.ArrayList[Row]
+    metadatas.foreach(r => rows.add(Row(timestamp, action, r._1, r._2)))
+    rows.stream().limit(limit).toArray().map(r => r.asInstanceOf[Row]).toList
+  }
+
+  override def build: Procedure = new ShowCommitExtraMetadataProcedure()
+
+  private def getCommitForLastInstant(timeline: HoodieTimeline): Option[HoodieInstant] = {
+    val instantOptional = timeline.getReverseOrderedInstants
+      .findFirst
+    if (instantOptional.isPresent) {
+      Option.apply(instantOptional.get())
+    } else {
+      Option.empty
+    }
+  }
+
+  private def getCommitForInstant(timeline: HoodieTimeline, instantTime: String): Option[HoodieInstant] = {
+    val instants: util.List[HoodieInstant] = util.Arrays.asList(
+      new HoodieInstant(false, HoodieTimeline.COMMIT_ACTION, instantTime),
+      new HoodieInstant(false, HoodieTimeline.REPLACE_COMMIT_ACTION, instantTime),
+      new HoodieInstant(false, HoodieTimeline.DELTA_COMMIT_ACTION, instantTime))
+
+    val hoodieInstant: Option[HoodieInstant] = instants.find((i: HoodieInstant) => timeline.containsInstant(i))
+    hoodieInstant
+  }
+
+  private def getHoodieCommitMetadata(timeline: HoodieTimeline, hoodieInstant: Option[HoodieInstant]): Option[HoodieCommitMetadata] = {
+    if (hoodieInstant.isDefined) {
+      if (hoodieInstant.get.getAction == HoodieTimeline.REPLACE_COMMIT_ACTION) {
+        Option(HoodieReplaceCommitMetadata.fromBytes(timeline.getInstantDetails(hoodieInstant.get).get,
+          classOf[HoodieReplaceCommitMetadata]))
+      } else {
+        Option(HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(hoodieInstant.get).get,
+          classOf[HoodieCommitMetadata]))
+      }
+    } else {
+      Option.empty
+    }
+  }
+}
+
+object ShowCommitExtraMetadataProcedure {
+  val NAME = "show_commit_extra_metadata"
+
+  def builder: Supplier[ProcedureBuilder] = new Supplier[ProcedureBuilder] {
+    override def get() = new ShowCommitExtraMetadataProcedure()
+  }
+}
diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCommitsProcedure.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCommitsProcedure.scala
index 2840b22434..03cf26800d 100644
--- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCommitsProcedure.scala
+++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestCommitsProcedure.scala
@@ -61,9 +61,7 @@ class TestCommitsProcedure extends HoodieSparkProcedureTestBase {
       // collect archived commits for table
       val endTs = commits(0).get(0).toString
       val archivedCommits = spark.sql(s"""call show_archived_commits(table => '$tableName', end_ts => '$endTs')""").collect()
-      assertResult(4) {
-        archivedCommits.length
-      }
+      assertResult(4){archivedCommits.length}
     }
   }
 
@@ -109,9 +107,7 @@ class TestCommitsProcedure extends HoodieSparkProcedureTestBase {
       // collect archived commits for table
       val endTs = commits(0).get(0).toString
       val archivedCommits = spark.sql(s"""call show_archived_commits_metadata(table => '$tableName', end_ts => '$endTs')""").collect()
-      assertResult(4) {
-        archivedCommits.length
-      }
+      assertResult(4){archivedCommits.length}
     }
   }
 
@@ -288,4 +284,50 @@ class TestCommitsProcedure extends HoodieSparkProcedureTestBase {
       assertResult(1){result.length}
     }
   }
+
+  test("Test Call show_commit_extra_metadata Procedure") {
+    withTempDir { tmp =>
+      val tableName = generateTableName
+      // create table
+      spark.sql(
+        s"""
+           |create table $tableName (
+           |  id int,
+           |  name string,
+           |  price double,
+           |  ts long
+           |) using hudi
+           | location '${tmp.getCanonicalPath}/$tableName'
+           | tblproperties (
+           |  primaryKey = 'id',
+           |  preCombineField = 'ts'
+           | )
+     """.stripMargin)
+
+      // insert data to table
+      spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
+      spark.sql(s"insert into $tableName select 2, 'a2', 20, 1500")
+
+      // Check required fields
+      checkExceptionContain(s"""call show_commit_extra_metadata()""")(
+        s"arguments is empty")
+
+      // collect commits for table
+      val commits = spark.sql(s"""call show_commits(table => '$tableName', limit => 10)""").collect()
+      assertResult(2){commits.length}
+
+      val instant_time = commits(0).get(0).toString
+      // get specify instantTime's extraMetadatas
+      val metadatas1 = spark.sql(s"""call show_commit_extra_metadata(table => '$tableName', instant_time => '$instant_time')""").collect()
+      assertResult(true){metadatas1.length > 0}
+
+      // get last instantTime's extraMetadatas
+      val metadatas2 = spark.sql(s"""call show_commit_extra_metadata(table => '$tableName')""").collect()
+      assertResult(true){metadatas2.length > 0}
+
+      // get last instantTime's extraMetadatas and filter extraMetadatas with metadata_key
+      val metadatas3 = spark.sql(s"""call show_commit_extra_metadata(table => '$tableName', metadata_key => 'schema')""").collect()
+      assertResult(1){metadatas3.length}
+    }
+  }
 }


[hudi] 27/45: fix none index partition format

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f02fef936b7db27137229c4bd64397a7456b915c
Author: superche <su...@tencent.com>
AuthorDate: Thu Nov 17 16:02:31 2022 +0800

    fix none index partition format
---
 .../java/org/apache/hudi/keygen/EmptyAvroKeyGenerator.java    | 11 ++++++++---
 .../apache/hudi/keygen/TimestampBasedAvroKeyGenerator.java    |  4 ++--
 .../main/java/org/apache/hudi/keygen/EmptyKeyGenerator.java   |  3 ++-
 .../main/java/org/apache/hudi/table/HoodieTableFactory.java   |  7 +++++++
 4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/EmptyAvroKeyGenerator.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/EmptyAvroKeyGenerator.java
index 01536f95e4..6759c3dc8e 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/EmptyAvroKeyGenerator.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/EmptyAvroKeyGenerator.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.keygen;
 
+import java.io.IOException;
 import org.apache.avro.generic.GenericRecord;
 
 import org.apache.hudi.common.model.HoodieKey;
@@ -36,13 +37,13 @@ import java.util.stream.Collectors;
 /**
  * Avro key generator for empty record key Hudi tables.
  */
-public class EmptyAvroKeyGenerator extends BaseKeyGenerator {
+public class EmptyAvroKeyGenerator extends TimestampBasedAvroKeyGenerator {
 
   private static final Logger LOG = LogManager.getLogger(EmptyAvroKeyGenerator.class);
   public static final String EMPTY_RECORD_KEY = HoodieKey.EMPTY_RECORD_KEY;
   private static final List<String> EMPTY_RECORD_KEY_FIELD_LIST = Collections.emptyList();
 
-  public EmptyAvroKeyGenerator(TypedProperties props) {
+  public EmptyAvroKeyGenerator(TypedProperties props) throws IOException {
     super(props);
     if (config.containsKey(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key())) {
       LOG.warn(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key() + " will be ignored while using "
@@ -60,6 +61,10 @@ public class EmptyAvroKeyGenerator extends BaseKeyGenerator {
 
   @Override
   public String getPartitionPath(GenericRecord record) {
-    return KeyGenUtils.getRecordPartitionPath(record, getPartitionPathFields(), hiveStylePartitioning, encodePartitionPath, isConsistentLogicalTimestampEnabled());
+    if (this.timestampType == TimestampType.NO_TIMESTAMP) {
+      return KeyGenUtils.getRecordPartitionPath(record, getPartitionPathFields(), hiveStylePartitioning, encodePartitionPath, isConsistentLogicalTimestampEnabled());
+    } else {
+      return super.getPartitionPath(record);
+    }
   }
 }
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/TimestampBasedAvroKeyGenerator.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/TimestampBasedAvroKeyGenerator.java
index 60ccc694f9..77863fd869 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/TimestampBasedAvroKeyGenerator.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/TimestampBasedAvroKeyGenerator.java
@@ -49,11 +49,11 @@ import static java.util.concurrent.TimeUnit.SECONDS;
  */
 public class TimestampBasedAvroKeyGenerator extends SimpleAvroKeyGenerator {
   public enum TimestampType implements Serializable {
-    UNIX_TIMESTAMP, DATE_STRING, MIXED, EPOCHMILLISECONDS, SCALAR
+    UNIX_TIMESTAMP, DATE_STRING, MIXED, EPOCHMILLISECONDS, SCALAR, NO_TIMESTAMP
   }
 
   private final TimeUnit timeUnit;
-  private final TimestampType timestampType;
+  protected final TimestampType timestampType;
   private final String outputDateFormat;
   private transient Option<DateTimeFormatter> inputFormatter;
   private transient DateTimeFormatter partitionFormatter;
diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/EmptyKeyGenerator.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/EmptyKeyGenerator.java
index 9e4090a537..2ba0d5cf32 100644
--- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/EmptyKeyGenerator.java
+++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/EmptyKeyGenerator.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.keygen;
 
+import java.io.IOException;
 import org.apache.avro.generic.GenericRecord;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
@@ -38,7 +39,7 @@ public class EmptyKeyGenerator extends BuiltinKeyGenerator {
 
   private final EmptyAvroKeyGenerator emptyAvroKeyGenerator;
 
-  public EmptyKeyGenerator(TypedProperties config) {
+  public EmptyKeyGenerator(TypedProperties config) throws IOException {
     super(config);
     this.emptyAvroKeyGenerator = new EmptyAvroKeyGenerator(config);
     this.recordKeyFields = emptyAvroKeyGenerator.getRecordKeyFieldNames();
diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
index b0380c5878..612aa623e5 100644
--- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
+++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
@@ -196,6 +196,13 @@ public class HoodieTableFactory implements DynamicTableSourceFactory, DynamicTab
       conf.setString(FlinkOptions.KEYGEN_CLASS_NAME, EmptyAvroKeyGenerator.class.getName());
       conf.setString(FlinkOptions.OPERATION, WriteOperationType.INSERT.value());
       conf.setString(FlinkOptions.MERGE_TYPE, REALTIME_SKIP_MERGE);
+      TimestampBasedAvroKeyGenerator.TimestampType timestampType = TimestampBasedAvroKeyGenerator.TimestampType
+          .valueOf(conf.toMap().getOrDefault(KeyGeneratorOptions.Config.TIMESTAMP_TYPE_FIELD_PROP, TimestampBasedAvroKeyGenerator.TimestampType.NO_TIMESTAMP.name()));
+      if (timestampType == TimestampBasedAvroKeyGenerator.TimestampType.NO_TIMESTAMP) {
+        conf.setString(KeyGeneratorOptions.Config.TIMESTAMP_TYPE_FIELD_PROP, TimestampBasedAvroKeyGenerator.TimestampType.NO_TIMESTAMP.name());
+        // the option is actually useless, it only works for validation
+        conf.setString(KeyGeneratorOptions.Config.TIMESTAMP_OUTPUT_DATE_FORMAT_PROP, FlinkOptions.PARTITION_FORMAT_HOUR);
+      }
     }
   }
 


[hudi] 08/45: fix zhiyan reporter for metadata

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 6dbe53e6232de8b85c7548fefda670d6f4359ec1
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Sun Jun 5 15:06:30 2022 +0800

    fix zhiyan reporter for metadata
---
 .../main/java/org/apache/hudi/config/HoodieWriteConfig.java   |  8 ++++++++
 .../apache/hudi/metadata/HoodieBackedTableMetadataWriter.java | 11 +++++++++++
 2 files changed, 19 insertions(+)

diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index 23bc0ee329..9610ad382b 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -2234,6 +2234,8 @@ public class HoodieWriteConfig extends HoodieConfig {
     private boolean isPreCommitValidationConfigSet = false;
     private boolean isMetricsJmxConfigSet = false;
     private boolean isMetricsGraphiteConfigSet = false;
+
+    private boolean isMetricsZhiyanConfig = false;
     private boolean isLayoutConfigSet = false;
     private boolean isTdbankConfigSet = false;
 
@@ -2429,6 +2431,12 @@ public class HoodieWriteConfig extends HoodieConfig {
       return this;
     }
 
+    public Builder withMetricsZhiyanConfig(HoodieMetricsZhiyanConfig metricsZhiyanConfig) {
+      writeConfig.getProps().putAll(metricsZhiyanConfig.getProps());
+      isMetricsZhiyanConfig = true;
+      return this;
+    }
+
     public Builder withPreCommitValidatorConfig(HoodiePreCommitValidatorConfig validatorConfig) {
       writeConfig.getProps().putAll(validatorConfig.getProps());
       isPreCommitValidationConfigSet = true;
diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 962875fb92..405db43a51 100644
--- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -65,6 +65,7 @@ import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.config.metrics.HoodieMetricsConfig;
 import org.apache.hudi.config.metrics.HoodieMetricsGraphiteConfig;
 import org.apache.hudi.config.metrics.HoodieMetricsJmxConfig;
+import org.apache.hudi.config.metrics.HoodieMetricsZhiyanConfig;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIndexException;
 import org.apache.hudi.exception.HoodieMetadataException;
@@ -316,6 +317,16 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableMeta
               .toJmxHost(writeConfig.getJmxHost())
               .build());
           break;
+        case ZHIYAN:
+          builder.withMetricsZhiyanConfig(HoodieMetricsZhiyanConfig.newBuilder()
+              .withReportServiceUrl(writeConfig.getZhiyanReportServiceURL())
+              .withApiTimeout(writeConfig.getZhiyanApiTimeoutSeconds())
+              .withAppMask(writeConfig.getZhiyanAppMask())
+              .withReportPeriodSeconds(writeConfig.getZhiyanReportPeriodSeconds())
+              .withSeclvlEnvName(writeConfig.getZhiyanSeclvlEnvName())
+              .withJobName(writeConfig.getZhiyanHoodieJobName())
+              .build());
+          break;
         case DATADOG:
         case PROMETHEUS:
         case PROMETHEUS_PUSHGATEWAY:


[hudi] 11/45: fix file not exists for getFileSize

Posted by fo...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch release-0.12.1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit e45564102b0d7e0d4ff35152262274de7737af0e
Author: XuQianJin-Stars <fo...@apache.com>
AuthorDate: Sun Oct 9 17:20:41 2022 +0800

    fix file not exists for getFileSize
---
 .../src/main/java/org/apache/hudi/common/fs/FSUtils.java   | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
index 0257f8015b..1350108a11 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
@@ -191,8 +191,18 @@ public class FSUtils {
     return fullFileName.split("_")[2].split("\\.")[0];
   }
 
-  public static long getFileSize(FileSystem fs, Path path) throws IOException {
-    return fs.getFileStatus(path).getLen();
+  public static long getFileSize(FileSystem fs, Path path) {
+    try {
+      if (fs.exists(path)) {
+        return fs.getFileStatus(path).getLen();
+      } else {
+        LOG.warn("getFileSize: " + path + " file not exists!");
+        return 0L;
+      }
+    } catch (IOException e) {
+      LOG.error("getFileSize: " + path + " error:", e);
+      return 0L;
+    }
   }
 
   public static String getFileId(String fullFileName) {