You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "kazdy (Jira)" <ji...@apache.org> on 2023/02/24 11:36:00 UTC
[jira] [Comment Edited] (HUDI-5839) Insert in non-strict mode deduplices dataset in "append" mode - spark

    [ https://issues.apache.org/jira/browse/HUDI-5839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692748#comment-17692748 ] 

kazdy edited comment on HUDI-5839 at 2/24/23 11:35 AM:
-------------------------------------------------------

Hi [~codope] could take a look at this issue? I'm not sure if this is working "as expected" or if it's a bug.
Looking at behavior in 0.12.1 it is a bug, in 0.12.1 it creates duplicates when I do insert in non-strict mode using spark append mode.


was (Author: JIRAUSER284048):
Hi [~codope] could take a look at this issue? I'm not sure if this is working "as expected" or if it's a bug.

> Insert in non-strict mode deduplices dataset in "append" mode - spark
> ---------------------------------------------------------------------
>
>                 Key: HUDI-5839
>                 URL: https://issues.apache.org/jira/browse/HUDI-5839
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: spark, writer-core
>    Affects Versions: 0.13.0
>            Reporter: kazdy
>            Priority: Major
>
> There seem to be a bug with non-strict insert mode when precombine is not defined (but I have not checked for when it is).
> When using spark datasource it can insert duplicates only in overwrite mode or append mode when data is inserted to the table for the first time, but if I want to insert in append mode for the second time it deduplicates the dataset as if it was working in upsert mode. Found in master (0.13.0).
> I happens to be a regression, because I'm using this functionality in Hudi 0.12.1.
> {code:java}
> from pyspark.sql.functions import expr
> path = "/tmp/huditbl"
> opt_insert = {
>     'hoodie.table.name': 'huditbl',
>     'hoodie.datasource.write.recordkey.field': 'keyid',
>     'hoodie.datasource.write.table.name': 'huditbl',
>     'hoodie.datasource.write.operation': 'insert',
>     'hoodie.sql.insert.mode': 'non-strict',
>     'hoodie.upsert.shuffle.parallelism': 2,
>     'hoodie.insert.shuffle.parallelism': 2,
>     'hoodie.combine.before.upsert': 'false',
>     'hoodie.combine.before.insert': 'false',
>     'hoodie.datasource.write.insert.drop.duplicates': 'false'
> }
> df = spark.range(0, 10).toDF("keyid") \
>   .withColumn("age", expr("keyid + 1000"))
> df.write.format("hudi"). \
> options(**opt_insert). \
> mode("overwrite"). \
> save(path)
> spark.read.format("hudi").load(path).count() # returns 10
> df = df.union(df) # creates duplicates
> df.write.format("hudi"). \
> options(**opt_insert). \
> mode("append"). \
> save(path)
> spark.read.format("hudi").load(path).count() # returns 10 but should return 20 
> # note
> # this works:
> df = df.union(df) # creates duplicates 
> df.write.format("hudi"). \ 
> options(**opt_insert). \ 
> mode("overwrite"). \ 
> save(path)
> spark.read.format("hudi").load(path).count() # returns 20 as it should{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)