You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "xi chaomin (Jira)" <ji...@apache.org> on 2022/10/18 08:22:00 UTC
[jira] [Updated] (HUDI-5047) Set hoodie.datasource.write.drop.partition.columns=true, the update record cann't be read in mor table.
[ https://issues.apache.org/jira/browse/HUDI-5047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
xi chaomin updated HUDI-5047:
-----------------------------
Description:
When I sync hive with hoodie.datasource.write.drop.partition.columns=false, the query with partition in where clause cann't return the matched records. eg: "select * from mor_table where partition=$partition".
So I set hive with hoodie.datasource.write.drop.partition.columns = true.
Steps to reproduce:
# write data and query
{code:java}
val df1 = Seq(
("100", "1001", "2022-01-01"),
("200", "1002", "2022-01-01"),
("300", "1003", "2022-01-01"),
("400", "1004", "2022-01-02"),
("500", "1005", "2022-01-02"),
("600", "1006", "2022-01-02")
).toDF("id", "name", "dt") val hudiOptions = Map(
"hoodie.table.name" -> tableName,
"hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
"hoodie.datasource.write.operation" -> "upsert",
"hoodie.datasource.write.recordkey.field" -> "id",
"hoodie.datasource.write.precombine.field" -> "name",
"hoodie.datasource.write.partitionpath.field" -> "dt",
"hoodie.index.type" -> "BLOOM",
"hoodie.table.keygenerator.class"-> "org.apache.hudi.keygen.SimpleKeyGenerator",
"hoodie.datasource.write.drop.partition.columns"->"true"
) df1.write.format("hudi")
.options(hudiOptions)
.mode(Append)
.save(basePath)
val viewDF = spark
.read
.format("org.apache.hudi")
.load(basePath) viewDF.createOrReplaceTempView(tableName)
spark.sql(s"select * from $tableName where dt='2022-01-01'").show() {code}
the query works well.
# update record and query
{code:java}
val df2 = Seq(
("100", "10010", "2022-01-01"),
("200", "10020", "2022-01-01"),
("300", "10030", "2022-01-01"),
("400", "10040", "2022-01-02"),
("500", "10050", "2022-01-02"),
("600", "10060", "2022-01-02")
).toDF("id", "name", "dt") df2.write.format("hudi")
.options(hudiOptions)
.mode(Append)
.save(basePath)
val viewDF2 = spark
.read
.format("org.apache.hudi")
.load(basePath) viewDF2.createOrReplaceTempView(tableName)
spark.sql(s"select * from $tableName where dt='2022-01-01'").show() {code}
returns 0 record.
> Set hoodie.datasource.write.drop.partition.columns=true, the update record cann't be read in mor table.
> -------------------------------------------------------------------------------------------------------
>
> Key: HUDI-5047
> URL: https://issues.apache.org/jira/browse/HUDI-5047
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: xi chaomin
> Priority: Major
>
> When I sync hive with hoodie.datasource.write.drop.partition.columns=false, the query with partition in where clause cann't return the matched records. eg: "select * from mor_table where partition=$partition".
> So I set hive with hoodie.datasource.write.drop.partition.columns = true.
> Steps to reproduce:
> # write data and query
> {code:java}
> val df1 = Seq(
> ("100", "1001", "2022-01-01"),
> ("200", "1002", "2022-01-01"),
> ("300", "1003", "2022-01-01"),
> ("400", "1004", "2022-01-02"),
> ("500", "1005", "2022-01-02"),
> ("600", "1006", "2022-01-02")
> ).toDF("id", "name", "dt") val hudiOptions = Map(
> "hoodie.table.name" -> tableName,
> "hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
> "hoodie.datasource.write.operation" -> "upsert",
> "hoodie.datasource.write.recordkey.field" -> "id",
> "hoodie.datasource.write.precombine.field" -> "name",
> "hoodie.datasource.write.partitionpath.field" -> "dt",
> "hoodie.index.type" -> "BLOOM",
> "hoodie.table.keygenerator.class"-> "org.apache.hudi.keygen.SimpleKeyGenerator",
> "hoodie.datasource.write.drop.partition.columns"->"true"
> ) df1.write.format("hudi")
> .options(hudiOptions)
> .mode(Append)
> .save(basePath)
> val viewDF = spark
> .read
> .format("org.apache.hudi")
> .load(basePath) viewDF.createOrReplaceTempView(tableName)
> spark.sql(s"select * from $tableName where dt='2022-01-01'").show() {code}
> the query works well.
> # update record and query
> {code:java}
> val df2 = Seq(
> ("100", "10010", "2022-01-01"),
> ("200", "10020", "2022-01-01"),
> ("300", "10030", "2022-01-01"),
> ("400", "10040", "2022-01-02"),
> ("500", "10050", "2022-01-02"),
> ("600", "10060", "2022-01-02")
> ).toDF("id", "name", "dt") df2.write.format("hudi")
> .options(hudiOptions)
> .mode(Append)
> .save(basePath)
> val viewDF2 = spark
> .read
> .format("org.apache.hudi")
> .load(basePath) viewDF2.createOrReplaceTempView(tableName)
> spark.sql(s"select * from $tableName where dt='2022-01-01'").show() {code}
> returns 0 record.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)