You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2021/01/21 06:21:00 UTC

[jira] [Closed] (HUDI-993) Use hoodie.delete.shuffle.parallelism for Delete API

     [ https://issues.apache.org/jira/browse/HUDI-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinoth Chandar closed HUDI-993.
-------------------------------
    Resolution: Fixed

> Use hoodie.delete.shuffle.parallelism for Delete API
> ----------------------------------------------------
>
>                 Key: HUDI-993
>                 URL: https://issues.apache.org/jira/browse/HUDI-993
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Performance
>            Reporter: Dongwook Kwon
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 0.7.0
>
>
> While HUDI-328 introduced Delete API, I noticed [deduplicateKeys|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/DeleteHelper.java#L51-L57] method doesn't allow any parallelism for RDD operation while [deduplicateRecords|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/WriteHelper.java#L104] for upsert uses parallelism on RDD.
> {{And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.}}
>  
> I found certain cases, like input RDD has few parallelism but target table has large files, certain Spark job's performance is suffered from low parallelism. so in this case,  upsert performance with "EmptyHoodieRecordPayload" is faster than delete API.
> Also this is due to the fact that "hoodie.combine.before.upsert" is true by default, when it's not enabled, the issue would be the same.
> So I wonder input RDD should be repartition as "hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is false for better performance regardless of "hoodie.combine.before.delete"
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)