You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "ZiyueGuan (Jira)" <ji...@apache.org> on 2021/05/04 10:01:00 UTC

[jira] [Created] (HUDI-1875) Improve perf of MOR table upsert based on HDFS

ZiyueGuan created HUDI-1875:
-------------------------------

             Summary: Improve perf of MOR table upsert based on HDFS
                 Key: HUDI-1875
                 URL: https://issues.apache.org/jira/browse/HUDI-1875
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: ZiyueGuan


Problem: When we use upsert in MOR table, hudi assign one task for one fileId which needs to be created or updated. In such situation, near one million tasks may be created in most of which may simply append few records to a fileId. Such process may be slow and a few skew tasks appear.

Reason: hudi use hsync to guarantee data is stored properly.  Call hsync so much times towards a hdfs cluster in 2 minutes or less will lead to high IOPS for disks. In addition to this, creating too much tasks brings high overhead of scheduling tasks against append two or three records to a file.

TODO: 

Option One: use hflush instead of hsync. This may lead data loss when all DN shutdown at the same time. However, this has a quite low chance to occur when HDFS deploy across AZ.

Option two: make hsync process asynchronous and let more than one writing process run in the same task. This will reduce the task numbers but increase mem use.

I may first try option one as it is simple enough.

When



--
This message was sent by Atlassian Jira
(v8.3.4#803005)