You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2019/12/26 01:15:00 UTC

[jira] [Updated] (HUDI-112) Supporting a Collapse type of operation

     [ https://issues.apache.org/jira/browse/HUDI-112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinoth Chandar updated HUDI-112:
--------------------------------
    Status: New  (was: Open)

> Supporting a Collapse type of operation
> ---------------------------------------
>
>                 Key: HUDI-112
>                 URL: https://issues.apache.org/jira/browse/HUDI-112
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>          Components: Common Core
>            Reporter: Nishith Agarwal
>            Assignee: Nishith Agarwal
>            Priority: Major
>
> Currently, for COPY_ON_WRITE tables Hudi automatically adjusts small file by packing inserts and sending them over to a particular file based on the small file size limits set in the client config.
> One of the side effects of this is that the time taken to rewrite the small files into larger ones is borne by the writer (or the ingestor). In cases where we continuously want really low ingestion latency ( < 5 mins ), having the writer enlarge the small files may not be preferable.
> If there was a way for the writer to schedule a collapse sort of operation that can later be picked up asynchronously by a job/thread (different from the ingestor) that collapses N files into M files, thereby also enlarging the file sizes. 
> The mechanism should support different strategies for scheduling collapse so we can perform even smarter data layout during such rewriting, for eg., group certain record_keys together in a single file from N different files to allow for better query performance and more.
> MERGE_ON_READ on the other hand solves this in a different way. We can send inserts to log files (for a base columnar file) and when the compaction kicks in, it would automatically resize the file. Although, the reader (realtime query) would have to pay a small penalty here to merge the log files with the base columnar files to get freshest data. 
> In any case, we need a mechanism to collapse older smaller files into larger ones while also keeping the query cost low. Creating this ticket to discuss more around this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)