You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Tanuj <ta...@gmail.com> on 2020/06/09 07:38:11 UTC

S3 Performance Issue in finalizing the writes

Hi,
I tried to ingest records in S3 with 2 runs - 20K/50K partitions with bulk_insert mode and a COW table.

I can see all the process are considerably ok except the last process where we finalized the writes i.e. HoodieTable.finalizeWrite as it needs to scan through the whole directory structure in S3 and I can see big pause. Following are the logs -

I tried in Hadoop and its very quick. Has anyone used Hudi in S3 ? What is the recommended no. of partitions in S3 ? We have an average 20M records per table which we need to ingest in S3. 


2020-06-09 05:49:18,641 [Spark Context Cleaner] INFO  org.apache.spark.ContextCleaner - Cleaned accumulator 144
2020-06-09 06:42:15,158 [dispatcher-event-loop-3] INFO  org.apache.spark.scheduler.BlacklistTracker - Removing executors Set(2, 33, 24, 26, 4, 6, 16, 3, 25, 13) from blacklist because the blacklist for those executors has timed out
2020-06-09 06:59:37,471 [Driver] INFO  org.apache.hudi.table.HoodieTable - Removing duplicate data files created due to spark retries before committing. Paths=[s3a:

Re: S3 Performance Issue in finalizing the writes

Posted by Vinoth Chandar <vi...@apache.org>.
Hi,

Hudi is used extensively on top of s3. Do you want to give this a
quick shot using the 0.5.3-RC2 we just put out?
What you are describing sounds close to an issue that was fixed there..
Based on the results, we can proceed from thre..

thanks
vinoth

On Tue, Jun 9, 2020 at 12:38 AM Tanuj <ta...@gmail.com> wrote:

> Hi,
> I tried to ingest records in S3 with 2 runs - 20K/50K partitions with
> bulk_insert mode and a COW table.
>
> I can see all the process are considerably ok except the last process
> where we finalized the writes i.e. HoodieTable.finalizeWrite as it needs to
> scan through the whole directory structure in S3 and I can see big pause.
> Following are the logs -
>
> I tried in Hadoop and its very quick. Has anyone used Hudi in S3 ? What is
> the recommended no. of partitions in S3 ? We have an average 20M records
> per table which we need to ingest in S3.
>
>
> 2020-06-09 05:49:18,641 [Spark Context Cleaner] INFO
> org.apache.spark.ContextCleaner - Cleaned accumulator 144
> 2020-06-09 06:42:15,158 [dispatcher-event-loop-3] INFO
> org.apache.spark.scheduler.BlacklistTracker - Removing executors Set(2, 33,
> 24, 26, 4, 6, 16, 3, 25, 13) from blacklist because the blacklist for those
> executors has timed out
> 2020-06-09 06:59:37,471 [Driver] INFO  org.apache.hudi.table.HoodieTable -
> Removing duplicate data files created due to spark retries before
> committing. Paths=[s3a:
>