You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Enrico Agnoli <en...@workday.com> on 2020/02/07 10:19:58 UTC

Re: performances of S3 writing with many buckets in parallel

I finally found the time to dig a little more on this and found the real problem.
The culprit of the slow-down is this piece of code:
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L543-L551

This alone takes around 4-5 secs, with a total of 6 secs to open the file. Logs from an instrumented call:
2020-02-07 08:51:05,825 INFO  BucketingSink  - openNewPartFile FS verification
2020-02-07 08:51:09,906 INFO  BucketingSink  - openNewPartFile FS verification - done
2020-02-07 08:51:11,181 INFO  BucketingSink  - openNewPartFile FS - completed partPath = s3a://....

This together with the default setup of the bucketing sink with 60 secs inactivity rollover 
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L195 
means that with more than 10 parallel bucket on a slot by the time we finish creating the last bucket the first one became stale, so needs to be rotated generating a blocking situation.

We solved this by deleting the FS check mentioned above (now the file opening takes ~1.2sec) and set the default inactive threshold to 5 mins. With this changes we can easily handle more than 200 buckets per slot (once the job takes speed it will ingest on all the slots so postponing the inactive timeout)

-Enrico

Re: performances of S3 writing with many buckets in parallel

Posted by Kostas Kloudas <kk...@gmail.com>.
Hi Enrico,

Nice to hear from you and thanks for checking it out!

This can be helpful for people using the BucketingSink but I would
recommend you to switch to the StreamingFileSink which is the "new
version" of the BucketingSink. In fact the BucketingSink is going to
be removed in one of the following releases, as it is deprecated for
quite a while.

If you try the StreamingFileSink, let us know if the problem persists.

Cheers,
Kostas


On Fri, Feb 7, 2020 at 11:20 AM Enrico Agnoli <en...@workday.com> wrote:
>
> I finally found the time to dig a little more on this and found the real problem.
> The culprit of the slow-down is this piece of code:
> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L543-L551
>
> This alone takes around 4-5 secs, with a total of 6 secs to open the file. Logs from an instrumented call:
> 2020-02-07 08:51:05,825 INFO  BucketingSink  - openNewPartFile FS verification
> 2020-02-07 08:51:09,906 INFO  BucketingSink  - openNewPartFile FS verification - done
> 2020-02-07 08:51:11,181 INFO  BucketingSink  - openNewPartFile FS - completed partPath = s3a://....
>
> This together with the default setup of the bucketing sink with 60 secs inactivity rollover
> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L195
> means that with more than 10 parallel bucket on a slot by the time we finish creating the last bucket the first one became stale, so needs to be rotated generating a blocking situation.
>
> We solved this by deleting the FS check mentioned above (now the file opening takes ~1.2sec) and set the default inactive threshold to 5 mins. With this changes we can easily handle more than 200 buckets per slot (once the job takes speed it will ingest on all the slots so postponing the inactive timeout)
>
> -Enrico