You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2022/09/07 01:05:00 UTC
[jira] [Updated] (HUDI-4765) Compared inserting data via spark-sql with spark-shell,_hoodie_record_key generation logic is different, which might affects data upsert

     [ https://issues.apache.org/jira/browse/HUDI-4765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

sivabalan narayanan updated HUDI-4765:
--------------------------------------
    Fix Version/s: 0.12.1

> Compared inserting data via spark-sql with spark-shell,_hoodie_record_key generation logic is different, which might affects data upsert
> ----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-4765
>                 URL: https://issues.apache.org/jira/browse/HUDI-4765
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: spark, spark-sql
>    Affects Versions: 0.11.1
>         Environment: Spark 3.1.1
> Hudi 0.11.1
>            Reporter: Yao Zhang
>            Priority: Critical
>             Fix For: 0.12.1
>
>
> Create table using spark-sql:
> {code:java}
> create table hudi_mor_tbl (
>   id int,
>   name string,
>   price double,
>   ts bigint
> ) using hudi
> tblproperties (
>   type = 'mor',
>   primaryKey = 'id',
>   preCombineField = 'ts'
> )
> location 'hdfs:///hudi/hudi_mor_tbl'; {code}
> And then insert data via spark-shell and spark-sql respectively:
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val fields = Array(
>       StructField("id", IntegerType, true),
>       StructField("name", StringType, true),
>       StructField("price", DoubleType, true),
>       StructField("ts", LongType, true)
>   )
> val simpleSchema = StructType(fields)
> val data = Seq(Row(2, "a2", 200.0, 100L))
> val df = spark.createDataFrame(data, simpleSchema)
> df.write.format("hudi").
>   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>   option(RECORDKEY_FIELD_OPT_KEY, "id").
>   option(TABLE_NAME, "hudi_mor_tbl").
>   option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ").
>   mode(Append).
>   save("hdfs:///hudi/hudi_mor_tbl") {code}
> {code:java}
> insert into hudi_mor_tbl select 1, 'a1', 20, 1000; {code}
> After that we query the table, we can see those two rows are as below:
> {code:java}
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|name|price|  ts|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
> |  20220902012710792|20220902012710792...|                 2|                      |c3eff8c8-fa47-48c...|  2|  a2|200.0| 100|
> |  20220902012813658|20220902012813658...|              id:1|                      |c3eff8c8-fa47-48c...|  1|  a1| 20.0|1000|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+ {code}
> '_hoodie_record_key' field for spark_sql inserted data is 'id:1' while that for spark-shell is 2. It seems that spark_sql uses '[primaryKey_field_name]:[primaryKey_field_value]' to construct the '_hoodie_record_key' field, which is different from spark-shell.
> As a result, if we inserted one row via spark-sql and then upserted it via spark-shell, we would get two duplicated rows. That is not what we expected.
> Did I miss some configurations that might lead to this issue? If not, personally I think we should make the default record key generation logic consistent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)