You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Balaji Varadarajan (Jira)" <ji...@apache.org> on 2019/11/05 17:26:00 UTC

[jira] [Commented] (HUDI-80) Incrementalize cleaning based on timeline metadata

    [ https://issues.apache.org/jira/browse/HUDI-80?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967698#comment-16967698 ] 

Balaji Varadarajan commented on HUDI-80:
----------------------------------------

The proposed solution is to

 

(a) Retain clean by versions but have incremental clean be enabled only for clean by commits

(b) Incremental Cleaning removes listing all partitions for looking for files to clean. Instead it looks at next set of partitions for deletion by looking at newer commits in an incremental fashion

(c) We rely on embedded timeline-server still to reduce RPC calls. In the case of deltastreamer running in continuous mode, we can leverage this benefit.

 

> Incrementalize cleaning based on timeline metadata
> --------------------------------------------------
>
>                 Key: HUDI-80
>                 URL: https://issues.apache.org/jira/browse/HUDI-80
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Write Client
>            Reporter: Vinoth Chandar
>            Assignee: Balaji Varadarajan
>            Priority: Major
>             Fix For: 0.5.1
>
>
> Currently, cleaning lists all partitions once and then picks the file groups to clean from DFS. This is partly due to support for retaining last x versions of a file group as well (in additon to the default mode of retaining last x commits). This could be expensive in some cases. See [https://github.com/apache/incubator-hudi/issues/613] for a issue reported. 
>  
> This task tracks work to 
>  * Determine if we can get rid of last X version cleaning mode 
>  * Implement cleaning based on file metadata in hudi timeline itself
>  * Resulting rpc calls to DFS would be O(number of filegroups cleaned)/O(number of partitions touched in last X commits)
>  
> HUDI-1 implements a timeline service for writing, that promotes caching of file system metadata. This can be implemented on top of that. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)