You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Paul Bernier <pb...@splunk.com.INVALID> on 2020/07/31 17:58:24 UTC

[DISCUSS] StreamingFileSink: parallelizing active buckets checkpointing?

Hi all,

I was trying to use S3 StreamingFileSink with a high number of active buckets (>1000). I found that checkpointing duration will grow linearly with the number of active buckets, which makes achieving high number of active buckets difficult. One reason for that is that each active buckets are snapshotted sequentially in a loop<https://github.com/apache/flink/blob/0a1b0c615c4968afd29f09a3494bbf137a223609/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L245>. Given that operation involves waiting for some data to finish being uploaded to S3 that can become quite a long wait.

My question is: could this loop be safely multi-threaded?
Each Bucket seems independent (they do share the bucketWriter though). I have also done some basic prototyping and validation and it looks ok. So I wondering if I am overlooking anything and if this approach is viable?

Note: the same approach would also need to be applied to the onSuccessfulCompletionOfCheckpoint step with this while loop committing files to S3<https://github.com/apache/flink/blob/0a1b0c615c4968afd29f09a3494bbf137a223609/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L208>.

Thank you.

Paul


Re: [DISCUSS] StreamingFileSink: parallelizing active buckets checkpointing?

Posted by Till Rohrmann <tr...@apache.org>.
Hi Paul,

looking briefly at the code, it should be possible to execute the snapshot
procedure for every Bucket in snapshotActiveBuckets in parallel and to wait
for the result in the very same method. I've also pulled in Klou who
implemented this feature and who might give a more profound feedback.

Cheers,
Till

On Fri, Jul 31, 2020 at 7:59 PM Paul Bernier <pb...@splunk.com.invalid>
wrote:

> Hi all,
>
> I was trying to use S3 StreamingFileSink with a high number of active
> buckets (>1000). I found that checkpointing duration will grow linearly
> with the number of active buckets, which makes achieving high number of
> active buckets difficult. One reason for that is that each active buckets
> are snapshotted sequentially in a loop<
> https://github.com/apache/flink/blob/0a1b0c615c4968afd29f09a3494bbf137a223609/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L245>.
> Given that operation involves waiting for some data to finish being
> uploaded to S3 that can become quite a long wait.
>
> My question is: could this loop be safely multi-threaded?
> Each Bucket seems independent (they do share the bucketWriter though). I
> have also done some basic prototyping and validation and it looks ok. So I
> wondering if I am overlooking anything and if this approach is viable?
>
> Note: the same approach would also need to be applied to the
> onSuccessfulCompletionOfCheckpoint step with this while loop committing
> files to S3<
> https://github.com/apache/flink/blob/0a1b0c615c4968afd29f09a3494bbf137a223609/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L208
> >.
>
> Thank you.
>
> Paul
>
>