You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "whight (via GitHub)" <gi...@apache.org> on 2023/04/17 11:04:59 UTC

[GitHub] [hudi] whight commented on issue #8451: [SUPPORT] Insert write operation pre combined problem

whight commented on issue #8451:
URL: https://github.com/apache/hudi/issues/8451#issuecomment-1511137106

   @ad1happy2go 
   I found the setting's default value is false, but there are still a number of duplicate rows in the table,  it is a BUG?
   
   I viewed the FAQ page, the part of "How does Hudi handle duplicate record keys in an input"  as follows:
   
   > For an insert or bulk_insert operation, no such pre-combining is performed. Thus, if your input contains duplicates, the dataset would also contain duplicates. If you don't want duplicate records either issue an upsert or consider specifying option to de-duplicate input in either [datasource](https://hudi.apache.org/docs/configurations.html#INSERT_DROP_DUPS_OPT_KEY) or [deltastreamer](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L229).
   
   I suggest add the setting tips in this part.
   
   
   > @whight can you try setting hoodie.merge.allow.duplicate.on.inserts as true. I was able to reproduce your error and it got fixed with above setting.
   > 
   > Else you can also use Bulk insert which is fast but will not do small file handling. You can schedule separate clustering job for the same.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org