You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "liwei (Jira)" <ji...@apache.org> on 2020/05/14 12:16:00 UTC

[jira] [Assigned] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction

     [ https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

liwei reassigned HUDI-897:
--------------------------

    Assignee: liwei

> hudi support log append scenario with better write and asynchronous compaction
> ------------------------------------------------------------------------------
>
>                 Key: HUDI-897
>                 URL: https://issues.apache.org/jira/browse/HUDI-897
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>          Components: Compaction, Performance
>            Reporter: liwei
>            Assignee: liwei
>            Priority: Major
>             Fix For: 0.6.0
>
>         Attachments: image-2020-05-14-19-51-37-938.png, image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three  scenario.
> [https://databricks.com/product/delta-lake-on-databricks]
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is incrementally written to hudi, and it is also doing bulkload files to hudi. 
> However, there is no good native support for log scenarios (requiring high-throughput writes, no updates, deletions, and focusing on small file scenarios);now can write through inserts without deduplication, but they will still merge on the write side.
>  * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but  every batch small  will cost some time for merge,it will reduce write throughput.  
>  * This scene is not suitable for  merge on read. 
>  * the actual scenario only needs to write parquet in batches when writing, and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>  
> 1.On the write side, just write every batch to parquet file base on the snapshot mechanism,default open the merge,use can close the auto merge for more  write throughput.  
> 2. hudi support asynchronous merge small parquet files like databricks delata lake's  OPTIMIZE command 
> [https://docs.databricks.com/delta/optimizations/file-mgmt.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)