You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Li Peng <li...@doordash.com> on 2019/11/26 01:59:21 UTC

Streaming Files to S3

Hey folks, I'm trying to stream large volume data and write them as csv
files to S3, and one of the restrictions is to try and keep the files to
below 100MB (compressed) and write one file per minute. I wanted to verify
with you guys regarding my understanding of StreamingFileSink:

1. From the docs, StreamingFileSink will use multipart upload with s3, so
even with many workers writing to s3, it will still output only one file
for all of them for each time window, right?
2. StreamingFileSink.forRowFormat can be configured to write individual
rows and then commit to disk as per the above rules, by specifying a
RollingPolicy with the file size limit and the rollover interval, correct?
And the limit and the interval applies to the entire file, not to each part
file?
3. To write to s3, is it enough to just add flink-s3-fs-hadoop as a
dependency and specify the file path as "s3://file"?

Thanks,
Li

Re: Streaming Files to S3

Posted by Arvid Heise <ar...@ververica.com>.

Hi Li,

S3 file sink will write data into prefixes, with as many part-files as the
degree of parallelism. This structure comes from the good ol' Hadoop days,
where an output folder also contained part-files and is independent of S3.
However, each of the part-files will be uploaded in a multipart fashion,
which is S3 specific.

For your questions:
1. It will create one part-file for each parallel instance of the window
operator. If you run the window operator and sink with parallelism of 1,
you will receive exactly 1 file as you wished.
2. RollingPolicy can indeed be used, but again it's on part-file level. So
you would need to use parallelism of 1 again. Also RollingPolicy will
signal when the next part is started. So you probably need to roll when the
file is something like 99 MB and hope that the last record will not go over
100 MB. Alternatively, you live with some files being a tad larger than 100
MB.
3. Yes, exactly. If you also want to use presto s3 system (in the future)
for checkpoints, it's safer to specify "s3a://file".

Best,

Arvid

On Tue, Nov 26, 2019 at 2:59 AM Li Peng <li...@doordash.com> wrote:

> Hey folks, I'm trying to stream large volume data and write them as csv
> files to S3, and one of the restrictions is to try and keep the files to
> below 100MB (compressed) and write one file per minute. I wanted to verify
> with you guys regarding my understanding of StreamingFileSink:
>
> 1. From the docs, StreamingFileSink will use multipart upload with s3, so
> even with many workers writing to s3, it will still output only one file
> for all of them for each time window, right?
> 2. StreamingFileSink.forRowFormat can be configured to write individual
> rows and then commit to disk as per the above rules, by specifying a
> RollingPolicy with the file size limit and the rollover interval, correct?
> And the limit and the interval applies to the entire file, not to each part
> file?
> 3. To write to s3, is it enough to just add flink-s3-fs-hadoop as a
> dependency and specify the file path as "s3://file"?
>
> Thanks,
> Li
>