You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Till Rohrmann <tr...@apache.org> on 2020/08/03 08:33:44 UTC

Re: StreamingFileSink: any risk parallelizing active buckets checkpointing?

I've responded on the dev ML. Let's continue the discussion there:
https://lists.apache.org/thread.html/r251e395c759193d9c75f97b8bfc4917772219ea48bb1848ccc23d26e%40%3Cdev.flink.apache.org%3E

Cheers,
Till

On Thu, Jul 30, 2020 at 8:57 PM Paul Bernier <pb...@splunk.com> wrote:

> Hi experts,
>
>
>
> I am trying to use S3 StreamingFileSink with a high number of active
> buckets (>1000). I found that checkpointing duration will grow linearly
> with the number of active buckets, which makes achieving high number of
> active buckets difficult. One reason for that is the each active buckets
> are snapshotted sequentially in a loop
> <https://github.com/apache/flink/blob/0a1b0c615c4968afd29f09a3494bbf137a223609/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L245>.
> Given that operation involves waiting for some data to finish being
> uploaded to S3 that can become quite a long wait.
>
>
>
> My question is: could this loop be safely multi-threaded?
>
> Each Bucket seems independent (they do share the bucketWriter though). I
> have also done some basic prototyping and validation and it looks ok. So I
> wondering if I am overlooking anything and if my approach is viable?
>
>
>
> Note: the same approach would also need to be applied to the
> onSuccessfulCompletionOfCheckpoint step with this while loop committing
> files to S3
> <https://github.com/apache/flink/blob/0a1b0c615c4968afd29f09a3494bbf137a223609/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L208>
> .
>
>
>
> Thank you.
>
>
>
> Paul
>