You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Ashok Kumar (Jira)" <ji...@apache.org> on 2020/08/27 02:34:00 UTC
[jira] [Updated] (HUDI-1231) Duplicate record while querying

     [ https://issues.apache.org/jira/browse/HUDI-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashok Kumar updated HUDI-1231:
------------------------------
    Description: 
I am writting in upsert mode with precombine flag enabled. Still when i query i see same record available 3 times in same parquet file

 

spark.sql("select _hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name from hudi5_mor_ro where id1=1086187 and timestamp=1598461500 and _hoodie_record_key='timestamp:1598461500,id1:1086187,id2:1872725,flowId:23'").show(10,false)
 +--------------------+--------------------------++--------------------------------------------------------------------------------------++------------------------------------------------------------------------------
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|

+--------------------+--------------------------++--------------------------------------------------------------------------------------++------------------------------------------------------------------------------
|20200826171813|20200826171813_13856_855766|timestamp:1598461500,id1:1086187,id2:1872725,flowId:23|1086187/2020082617|5ecb020f-29be-4eed-b130-8c02ae819603-0_13856-104-296775_20200826171813.parquet|
|20200826171813|20200826171813_13856_855766|timestamp:1598461500,id1:1086187,id2:1872725,flowId:23|1086187/2020082617|5ecb020f-29be-4eed-b130-8c02ae819603-0_13856-104-296775_20200826171813.parquet|
|20200826171813|20200826171813_13856_855766|timestamp:1598461500,id1:1086187,id2:1872725,flowId:23|1086187/2020082617|5ecb020f-29be-4eed-b130-8c02ae819603-0_13856-104-296775_20200826171813.parquet|

+--------------------+--------------------------++--------------------------------------------------------------------------------------++------------------------------------------------------------------------------

 

This issue i am getting with both kind of table i.e COW and MOR. 

I have tried it 0.6.3 version but i had tried 0.5.3 and in that also this bug was coming.

This issue is not coming with small data set. 

 

Strange thing is when i query only parquet file it gives only one record(i.e correct)

df.filter(col("_hoodie_record_key")==="timestamp:1598461500,id1:1086187,id2:1872725,flowId:23").count
res13: Long = 1

  was:
I am writting in upsert mode with precombine flag enabled. Still when i query i see same record available 3 times in same parquet file

 

spark.sql("select _hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name from hudi5_mor_ro where id1=1086187 and timestamp=1598461500 and _hoodie_record_key='timestamp:1598461500,id1:1086187,id2:1872725,flowId:23'").show(10,false)
+-------------------+---------------------------+----------------------------------------------------------------+----------------------+------------------------------------------------------------------------------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key |_hoodie_partition_path|_hoodie_file_name |
+-------------------+---------------------------+----------------------------------------------------------------+----------------------+------------------------------------------------------------------------------+
|20200826171813 |20200826171813_13856_855766|timestamp:1598461500,id1:1086187,id2:1872725,flowId:23|1086187/2020082617 |5ecb020f-29be-4eed-b130-8c02ae819603-0_13856-104-296775_20200826171813.parquet|
|20200826171813 |20200826171813_13856_855766|timestamp:1598461500,id1:1086187,id2:1872725,flowId:23|1086187/2020082617 |5ecb020f-29be-4eed-b130-8c02ae819603-0_13856-104-296775_20200826171813.parquet|
|20200826171813 |20200826171813_13856_855766|timestamp:1598461500,id1:1086187,id2:1872725,flowId:23|1086187/2020082617 |5ecb020f-29be-4eed-b130-8c02ae819603-0_13856-104-296775_20200826171813.parquet|
+-------------------+---------------------------+----------------------------------------------------------------+----------------------+------------------------------------------------------------------------------+

 

This issue i am getting with both kind of table i.e COW and MOR. 

I have tried it 0.6.3 version but i had tried 0.5.3 and in that also this bug was coming.

This issue is not coming with small data set. 


> Duplicate record while querying
> -------------------------------
>
>                 Key: HUDI-1231
>                 URL: https://issues.apache.org/jira/browse/HUDI-1231
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Ashok Kumar
>            Priority: Major
>
> I am writting in upsert mode with precombine flag enabled. Still when i query i see same record available 3 times in same parquet file
>  
> spark.sql("select _hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name from hudi5_mor_ro where id1=1086187 and timestamp=1598461500 and _hoodie_record_key='timestamp:1598461500,id1:1086187,id2:1872725,flowId:23'").show(10,false)
>  +--------------------+--------------------------++--------------------------------------------------------------------------------------++------------------------------------------------------------------------------
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|
> +--------------------+--------------------------++--------------------------------------------------------------------------------------++------------------------------------------------------------------------------
> |20200826171813|20200826171813_13856_855766|timestamp:1598461500,id1:1086187,id2:1872725,flowId:23|1086187/2020082617|5ecb020f-29be-4eed-b130-8c02ae819603-0_13856-104-296775_20200826171813.parquet|
> |20200826171813|20200826171813_13856_855766|timestamp:1598461500,id1:1086187,id2:1872725,flowId:23|1086187/2020082617|5ecb020f-29be-4eed-b130-8c02ae819603-0_13856-104-296775_20200826171813.parquet|
> |20200826171813|20200826171813_13856_855766|timestamp:1598461500,id1:1086187,id2:1872725,flowId:23|1086187/2020082617|5ecb020f-29be-4eed-b130-8c02ae819603-0_13856-104-296775_20200826171813.parquet|
> +--------------------+--------------------------++--------------------------------------------------------------------------------------++------------------------------------------------------------------------------
>  
> This issue i am getting with both kind of table i.e COW and MOR. 
> I have tried it 0.6.3 version but i had tried 0.5.3 and in that also this bug was coming.
> This issue is not coming with small data set. 
>  
> Strange thing is when i query only parquet file it gives only one record(i.e correct)
> df.filter(col("_hoodie_record_key")==="timestamp:1598461500,id1:1086187,id2:1872725,flowId:23").count
> res13: Long = 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)