You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Yao Zhang (Jira)" <ji...@apache.org> on 2022/09/01 09:53:00 UTC
[jira] [Created] (HUDI-4765) Compared inserting data via spark-sql with spark-shell,_hoodie_record_key generation logic is different, which might affects data upsert
Yao Zhang created HUDI-4765:
-------------------------------
Summary: Compared inserting data via spark-sql with spark-shell,_hoodie_record_key generation logic is different, which might affects data upsert
Key: HUDI-4765
URL: https://issues.apache.org/jira/browse/HUDI-4765
Project: Apache Hudi
Issue Type: Bug
Components: spark, spark-sql
Affects Versions: 0.11.1
Environment: Spark 3.1.1
Hudi 0.11.1
Reporter: Yao Zhang
Create table using spark-sql:
{code:java}
create table hudi_mor_tbl (
id int,
name string,
price double,
ts bigint
) using hudi
tblproperties (
type = 'mor',
primaryKey = 'id',
preCombineField = 'ts'
)
location 'hdfs:///hudi/hudi_mor_tbl'; {code}
And then insert data via spark-shell and spark-sql respectively:
{code:java}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val fields = Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("price", DoubleType, true),
StructField("ts", LongType, true)
)
val simpleSchema = StructType(fields)
val data = Seq(Row(2, "a2", 200.0, 100L))
val df = spark.createDataFrame(data, simpleSchema)
df.write.format("hudi").
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "id").
option(TABLE_NAME, "hudi_mor_tbl").
option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ").
mode(Append).
save("hdfs:///hudi/hudi_mor_tbl") {code}
{code:java}
insert into hudi_mor_tbl select 1, 'a1', 20, 1000; {code}
After that we query the table, we can see those two rows are as below:
{code:java}
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|price| ts|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
| 20220902012710792|20220902012710792...| 2| |c3eff8c8-fa47-48c...| 2| a2|200.0| 100|
| 20220902012813658|20220902012813658...| id:1| |c3eff8c8-fa47-48c...| 1| a1| 20.0|1000|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+ {code}
'_hoodie_record_key' field for spark_sql inserted data is 'id:1' while that for spark-shell is 2. It seems that spark_sql uses '[primaryKey_field_name]:[primaryKey_field_value]' to construct the '_hoodie_record_key' field, which is different from spark-shell.
As a result, if we inserted one row via spark-sql and then upserted it via spark-shell, we would get two duplicated rows. That is not what we expected.
Did I miss some configurations that might lead to this issue? If not, personally I think we should make the default record key generation logic consistent.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)