You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Yao Zhang (Jira)" <ji...@apache.org> on 2022/09/01 09:53:00 UTC

[jira] [Created] (HUDI-4765) Compared inserting data via spark-sql with spark-shell,_hoodie_record_key generation logic is different, which might affects data upsert

Yao Zhang created HUDI-4765:
-------------------------------

             Summary: Compared inserting data via spark-sql with spark-shell,_hoodie_record_key generation logic is different, which might affects data upsert
                 Key: HUDI-4765
                 URL: https://issues.apache.org/jira/browse/HUDI-4765
             Project: Apache Hudi
          Issue Type: Bug
          Components: spark, spark-sql
    Affects Versions: 0.11.1
         Environment: Spark 3.1.1
Hudi 0.11.1
            Reporter: Yao Zhang


Create table using spark-sql:
{code:java}
create table hudi_mor_tbl (
  id int,
  name string,
  price double,
  ts bigint
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts'
)
location 'hdfs:///hudi/hudi_mor_tbl'; {code}
And then insert data via spark-shell and spark-sql respectively:
{code:java}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val fields = Array(
      StructField("id", IntegerType, true),
      StructField("name", StringType, true),
      StructField("price", DoubleType, true),
      StructField("ts", LongType, true)
  )
val simpleSchema = StructType(fields)
val data = Seq(Row(2, "a2", 200.0, 100L))
val df = spark.createDataFrame(data, simpleSchema)
df.write.format("hudi").
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "id").
  option(TABLE_NAME, "hudi_mor_tbl").
  option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ").
  mode(Append).
  save("hdfs:///hudi/hudi_mor_tbl") {code}
{code:java}
insert into hudi_mor_tbl select 1, 'a1', 20, 1000; {code}
After that we query the table, we can see those two rows are as below:
{code:java}
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|name|price|  ts|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
|  20220902012710792|20220902012710792...|                 2|                      |c3eff8c8-fa47-48c...|  2|  a2|200.0| 100|
|  20220902012813658|20220902012813658...|              id:1|                      |c3eff8c8-fa47-48c...|  1|  a1| 20.0|1000|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+ {code}
'_hoodie_record_key' field for spark_sql inserted data is 'id:1' while that for spark-shell is 2. It seems that spark_sql uses '[primaryKey_field_name]:[primaryKey_field_value]' to construct the '_hoodie_record_key' field, which is different from spark-shell.

As a result, if we inserted one row via spark-sql and then upserted it via spark-shell, we would get two duplicated rows. That is not what we expected.

Did I miss some configurations that might lead to this issue? If not, personally I think we should make the default record key generation logic consistent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)