You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/06/04 01:56:00 UTC

[jira] [Updated] (HUDI-993) Use hoodie.delete.shuffle.parallelism for Delete API

     [ https://issues.apache.org/jira/browse/HUDI-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated HUDI-993:
--------------------------------
    Labels: pull-request-available  (was: )

> Use hoodie.delete.shuffle.parallelism for Delete API
> ----------------------------------------------------
>
>                 Key: HUDI-993
>                 URL: https://issues.apache.org/jira/browse/HUDI-993
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Performance
>            Reporter: Dongwook Kwon
>            Priority: Minor
>              Labels: pull-request-available
>
> While HUDI-328 introduced Delete API, I noticed [deduplicateKeys|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/DeleteHelper.java#L51-L57] method doesn't allow any parallelism for RDD operation while [deduplicateRecords|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/WriteHelper.java#L104] for upsert uses parallelism on RDD.
> {{And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.}}
>  
> I found certain cases, like input RDD has few parallelism but target table has large files, certain Spark job's performance is suffered from low parallelism. so in this case,  upsert performance with "EmptyHoodieRecordPayload" is faster than delete API.
> Also this is due to the fact that "hoodie.combine.before.upsert" is true by default, when it's not enabled, the issue would be the same.
> So I wonder input RDD should be repartition as "hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is false for better performance regardless of "hoodie.combine.before.delete"
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)