You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2020/08/05 04:09:00 UTC

[jira] [Updated] (HUDI-1054) Address performance issues with finalizing writes on S3

     [ https://issues.apache.org/jira/browse/HUDI-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinoth Chandar updated HUDI-1054:
---------------------------------
    Status: Closed  (was: Patch Available)

> Address performance issues with finalizing writes on S3
> -------------------------------------------------------
>
>                 Key: HUDI-1054
>                 URL: https://issues.apache.org/jira/browse/HUDI-1054
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: bootstrap, Common Core, Performance
>            Reporter: Udit Mehrotra
>            Assignee: Udit Mehrotra
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.6.0
>
>
> I have identified 3 performance bottleneck in the [finalizeWrite|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L378] function, that are manifesting and becoming more prominent with the new bootstrap mechanism on S3:
>  * [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L425]  is a serial operation performed at the driver and it can take a long time when you have several partitions and large number of files.
>  * The invalid data paths are being stored in a List instead of Set and as a result the following operation becomes N^2 taking significant time to compute at the driver: [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L429]
>  * [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L473] does a recursive delete of the marker directory at the driver. This is again extremely expensive when you have large number of partitions and files.
>  
> Upon testing with a 1 TB data set, having 8000 partitions and approximately 190000 files this whole process consumes *35 minutes*. There is scope to address these performance issues with spark parallelization and using appropriate data structures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)