You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "lamber-ken (Jira)" <ji...@apache.org> on 2020/05/14 14:51:00 UTC
[jira] [Commented] (HUDI-897) hudi support log append scenario with
better write and asynchronous compaction
[ https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107378#comment-17107378 ]
lamber-ken commented on HUDI-897:
---------------------------------
Gread addtion from my side (y)
> hudi support log append scenario with better write and asynchronous compaction
> ------------------------------------------------------------------------------
>
> Key: HUDI-897
> URL: https://issues.apache.org/jira/browse/HUDI-897
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: Compaction, Performance
> Reporter: liwei
> Assignee: liwei
> Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-05-14-19-51-37-938.png, image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three scenario. [1]
>
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is incrementally written to hudi, and it is also doing bulkload files to hudi.
> However, there is no good native support for log scenarios (requiring high-throughput writes, no updates, deletions, and focusing on small file scenarios);now can write through inserts without deduplication, but they will still merge on the write side.
> * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but every batch small will cost some time for merge,it will reduce write throughput.
> * This scene is not suitable for merge on read.
> * the actual scenario only needs to write parquet in batches when writing, and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>
> 1.On the write side, just write every batch to parquet file base on the snapshot mechanism,default open the merge,use can close the auto merge for more write throughput.
> 2. hudi support asynchronous merge small parquet files like databricks delta lake's OPTIMIZE command [2]
>
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)