You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "liwei (Jira)" <ji...@apache.org> on 2020/05/14 12:16:00 UTC
[jira] [Assigned] (HUDI-897) hudi support log append scenario with
better write and asynchronous compaction
[ https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liwei reassigned HUDI-897:
--------------------------
Assignee: liwei
> hudi support log append scenario with better write and asynchronous compaction
> ------------------------------------------------------------------------------
>
> Key: HUDI-897
> URL: https://issues.apache.org/jira/browse/HUDI-897
> Project: Apache Hudi (incubating)
> Issue Type: Bug
> Components: Compaction, Performance
> Reporter: liwei
> Assignee: liwei
> Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-05-14-19-51-37-938.png, image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three scenario.
> [https://databricks.com/product/delta-lake-on-databricks]
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is incrementally written to hudi, and it is also doing bulkload files to hudi.
> However, there is no good native support for log scenarios (requiring high-throughput writes, no updates, deletions, and focusing on small file scenarios);now can write through inserts without deduplication, but they will still merge on the write side.
> * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but every batch small will cost some time for merge,it will reduce write throughput.
> * This scene is not suitable for merge on read.
> * the actual scenario only needs to write parquet in batches when writing, and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>
> 1.On the write side, just write every batch to parquet file base on the snapshot mechanism,default open the merge,use can close the auto merge for more write throughput.
> 2. hudi support asynchronous merge small parquet files like databricks delata lake's OPTIMIZE command
> [https://docs.databricks.com/delta/optimizations/file-mgmt.html]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)