You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "xi chaomin (Jira)" <ji...@apache.org> on 2022/10/18 08:22:00 UTC
[jira] [Updated] (HUDI-5047) Set hoodie.datasource.write.drop.partition.columns=true, the update record cann't be read in mor table.

     [ https://issues.apache.org/jira/browse/HUDI-5047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

xi chaomin updated HUDI-5047:
-----------------------------
    Description: 
When I sync hive with hoodie.datasource.write.drop.partition.columns=false, the query with partition in where clause cann't return the matched records. eg: "select * from mor_table where partition=$partition".

So I set hive with hoodie.datasource.write.drop.partition.columns = true.

Steps to reproduce:
 # write data and query

{code:java}
    val df1 = Seq(
      ("100", "1001", "2022-01-01"),
      ("200", "1002", "2022-01-01"),
      ("300", "1003", "2022-01-01"),
      ("400", "1004", "2022-01-02"),
      ("500", "1005", "2022-01-02"),
      ("600", "1006", "2022-01-02")
    ).toDF("id", "name", "dt")    val hudiOptions = Map(
      "hoodie.table.name" -> tableName,
      "hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
      "hoodie.datasource.write.operation" -> "upsert",
      "hoodie.datasource.write.recordkey.field" -> "id",
      "hoodie.datasource.write.precombine.field" -> "name",
      "hoodie.datasource.write.partitionpath.field" -> "dt",
      "hoodie.index.type" -> "BLOOM",
      "hoodie.table.keygenerator.class"-> "org.apache.hudi.keygen.SimpleKeyGenerator",
      "hoodie.datasource.write.drop.partition.columns"->"true"
    )    df1.write.format("hudi")
      .options(hudiOptions)
      .mode(Append)
      .save(basePath)
    val viewDF = spark
      .read
      .format("org.apache.hudi")
      .load(basePath)    viewDF.createOrReplaceTempView(tableName)
    spark.sql(s"select * from $tableName where dt='2022-01-01'").show() {code}
the query works well.
 # update record and query

{code:java}
    val df2 = Seq(
      ("100", "10010", "2022-01-01"),
      ("200", "10020", "2022-01-01"),
      ("300", "10030", "2022-01-01"),
      ("400", "10040", "2022-01-02"),
      ("500", "10050", "2022-01-02"),
      ("600", "10060", "2022-01-02")
    ).toDF("id", "name", "dt")    df2.write.format("hudi")
      .options(hudiOptions)
      .mode(Append)
      .save(basePath)
    val viewDF2 = spark
      .read
      .format("org.apache.hudi")
      .load(basePath)    viewDF2.createOrReplaceTempView(tableName)
    spark.sql(s"select * from $tableName where dt='2022-01-01'").show() {code}
returns 0 record.

> Set hoodie.datasource.write.drop.partition.columns=true, the update record cann't be read in mor table.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-5047
>                 URL: https://issues.apache.org/jira/browse/HUDI-5047
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: xi chaomin
>            Priority: Major
>
> When I sync hive with hoodie.datasource.write.drop.partition.columns=false, the query with partition in where clause cann't return the matched records. eg: "select * from mor_table where partition=$partition".
> So I set hive with hoodie.datasource.write.drop.partition.columns = true.
> Steps to reproduce:
>  # write data and query
> {code:java}
>     val df1 = Seq(
>       ("100", "1001", "2022-01-01"),
>       ("200", "1002", "2022-01-01"),
>       ("300", "1003", "2022-01-01"),
>       ("400", "1004", "2022-01-02"),
>       ("500", "1005", "2022-01-02"),
>       ("600", "1006", "2022-01-02")
>     ).toDF("id", "name", "dt")    val hudiOptions = Map(
>       "hoodie.table.name" -> tableName,
>       "hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
>       "hoodie.datasource.write.operation" -> "upsert",
>       "hoodie.datasource.write.recordkey.field" -> "id",
>       "hoodie.datasource.write.precombine.field" -> "name",
>       "hoodie.datasource.write.partitionpath.field" -> "dt",
>       "hoodie.index.type" -> "BLOOM",
>       "hoodie.table.keygenerator.class"-> "org.apache.hudi.keygen.SimpleKeyGenerator",
>       "hoodie.datasource.write.drop.partition.columns"->"true"
>     )    df1.write.format("hudi")
>       .options(hudiOptions)
>       .mode(Append)
>       .save(basePath)
>     val viewDF = spark
>       .read
>       .format("org.apache.hudi")
>       .load(basePath)    viewDF.createOrReplaceTempView(tableName)
>     spark.sql(s"select * from $tableName where dt='2022-01-01'").show() {code}
> the query works well.
>  # update record and query
> {code:java}
>     val df2 = Seq(
>       ("100", "10010", "2022-01-01"),
>       ("200", "10020", "2022-01-01"),
>       ("300", "10030", "2022-01-01"),
>       ("400", "10040", "2022-01-02"),
>       ("500", "10050", "2022-01-02"),
>       ("600", "10060", "2022-01-02")
>     ).toDF("id", "name", "dt")    df2.write.format("hudi")
>       .options(hudiOptions)
>       .mode(Append)
>       .save(basePath)
>     val viewDF2 = spark
>       .read
>       .format("org.apache.hudi")
>       .load(basePath)    viewDF2.createOrReplaceTempView(tableName)
>     spark.sql(s"select * from $tableName where dt='2022-01-01'").show() {code}
> returns 0 record.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)