You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Yanjia Gary Li (Jira)" <ji...@apache.org> on 2020/05/08 01:39:00 UTC

[jira] [Comment Edited] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

    [ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101055#comment-17101055 ] 

Yanjia Gary Li edited comment on HUDI-494 at 5/8/20, 1:38 AM:
--------------------------------------------------------------

-Ok, I see what happened here. Root cause is [https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]-

So basically commit 1 wrote a very small file(let's say 200 records) to a new partition day=05. And then when commit 2 was trying to write, it looks back to commit 1 to get an estimated size of every record, but because commit 1 has too little records so it's inaccurate and way too big. So Hudi will calculate record/file using the big record size number and get a very small record/file. This lead to many small files. 


was (Author: garyli1019):
Ok, I see what happened here. Root cause is [https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]

So basically commit 1 wrote a very small file(let's say 200 records) to a new partition day=05. And then when commit 2 trying to write to day=05, it will look up the affected partition and use the Bloom index range from the existing files, so it will use 200 here. Commit 2 has much more records than 200, so it will create tons of files since the Bloom index range is too small.

I am not really familiar with the indexing part of the code. Please let me know if I understand this correctly and we can figure out a fix. [~lamber-ken] [~vinoth]

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -------------------------------------------------------------
>
>                 Key: HUDI-494
>                 URL: https://issues.apache.org/jira/browse/HUDI-494
>             Project: Apache Hudi (incubating)
>          Issue Type: Test
>            Reporter: Yanjia Gary Li
>            Assignee: Yanjia Gary Li
>            Priority: Major
>         Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ folder in my HDFS. In the Spark UI, each task only writes less than 10 records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My first guess would be something related to the bloom filter index. Maybe somewhere trigger the repartitioning with the bloom filter index? But I am not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)