You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2021/01/21 05:59:01 UTC
[jira] [Closed] (HUDI-1196) Record being placed in incorrect partition during upsert on COW/MOR global indexed tables

     [ https://issues.apache.org/jira/browse/HUDI-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinoth Chandar closed HUDI-1196.
--------------------------------
    Resolution: Fixed

> Record being placed in incorrect partition during upsert on COW/MOR global indexed tables
> -----------------------------------------------------------------------------------------
>
>                 Key: HUDI-1196
>                 URL: https://issues.apache.org/jira/browse/HUDI-1196
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Ryan Pifer
>            Assignee: Ryan Pifer
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.7.0
>
>
> When upserting a record in a global index table (global and hbase) where a single batch has multiple versions of the record in different partitions, the record is deduplicated correctly but placed in the incorrect partition. This was with using "hoodie.bloom.update.partition.path=true" as well
>  
> Batch with multiple versions of a record in different partitions:
> ```
> scala> val inputDF = spark.read.format("parquet").load(inputDataPath).show()
> +---------+--------++-----------------------------++-------------               
> |    wbn|    cs_ss|    action_date|          ad|  ad_updated|
> +---------+--------++-----------------------------++-------------
> |12345678|InTransit|1596716921000601|2020-08-06-12|2020-08-06-12|
> |12345678|  Pending|1596716921000602|2020-08-06-12|2020-08-06-12|
> |12345678|  Pending|1596716921000603|2020-08-06-13|2020-08-06-13|
> +---------+--------++-----------------------------++-------------
> ```
>  
> Values when querying _rt and _ro tables:
> ```
> scala> spark.sql("select * from gb_update_partition_1_ro").show()
> +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|  _hoodie_file_name|    wbn|  cs_ss|    action_date|  ad_updated|          ad|
> +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+
> |    20200817220935|  20200817220935_0_1|          12345678|        2020-08-06-12|4dddb6e8-87c4-4bd...|12345678|Pending|1596716921000603|2020-08-06-13|2020-08-06-12|
> +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+
>   
> scala> spark.sql("select * from gb_update_partition_1_rt").show()
> +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|  _hoodie_file_name|    wbn|  cs_ss|    action_date|  ad_updated|          ad|
> +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+
> |    20200817221924|  20200817221924_0_1|          12345678|        2020-08-06-12|4dddb6e8-87c4-4bd...|12345678|Pending|1596716921000603|2020-08-06-13|2020-08-06-12|
> +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+
>  ```
>  
> We can see that record displays most current version of the data except the partition values are from the older versions
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)