You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Chirag Dewan via user <us...@flink.apache.org> on 2023/03/29 07:04:57 UTC

Questions on S3 File Sink Behavior

Hi,

 


We are tying to use Flink's File sink to distribute files to AWS S3 storage. We are using Flink provided Hadoop s3a connector as plugin.
We have some observations that we needed to clarify:

1. When using file sink for local filesystem distribution, we can see that the sink creates 3 sets of files - in progress, pending (on rolling) and finished (upon checkpointing). But with S3 file sink we can see only the finished files, in the S3 buckets.
So we wanted to understand where does the sink creates the in-progress and pending files for S3 file sink ?

2. We can also see with local file system sink, the in-progress and pending file names follow the nomenclature:.<prefix>-<uid>-<partFileIndex>.inprogress.uid-<suffix>

There is a dot at the begining of the filename, may be flink is trying to create these files as hidden files. But in the flink documentation this is not mentioned. 
So can we assume that the in-progress and pending filenames shall always start with a dot ?
thanks a lot in advance



Re: Questions on S3 File Sink Behavior

Posted by Mate Czagany <cz...@gmail.com>.
Hi,

1. In case of S3 FileSystem, Flink uses the multipart upload process [1]
for better performance. It might not be obvious at first by looking at the
docs, but it's noted at the bottom of the FileSystem page [2]
For more information you can also check FLINK-9751 and FLINK-9752

2. In case of local FileSystem it always starts with a dot according to
LocalRecoverableWriter [3] but make sure to check the implementation of
RecoverableWriter for the FileSystem you want to use.

Regards,
Mate

[1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
[2]
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/connectors/datastream/filesystem/#s3-specific
[3]
https://github.com/apache/flink/blob/1e0b58aa8d962469fa9dd7b470037aeaece43500/flink-core/src/main/java/org/apache/flink/core/fs/local/LocalRecoverableWriter.java#L129

Chirag Dewan via user <us...@flink.apache.org> ezt írta (időpont: 2023.
márc. 29., Sze, 9:07):

> Hi,
>
>
>
> We are tying to use Flink's File sink to distribute files to AWS S3
> storage. We are using Flink provided Hadoop s3a connector as plugin.
>
> We have some observations that we needed to clarify:
>
> 1. When using file sink for local filesystem distribution, we can see that
> the sink creates 3 sets of files - in progress, pending (on rolling) and
> finished (upon checkpointing). But with S3 file sink we can see only the
> finished files, in the S3 buckets.
>
> So we wanted to understand where does the sink creates the in-progress and
> pending files for S3 file sink ?
>
>
> 2. We can also see with local file system sink, the in-progress and
> pending file names follow the nomenclature:
> .<prefix>-<uid>-<partFileIndex>.inprogress.uid-<suffix>
>
> There is a dot at the begining of the filename, may be flink is trying to
> create these files as hidden files. But in the flink documentation this is
> not mentioned.
>
> So can we assume that the in-progress and pending filenames shall always
> start with a dot ?
>
> thanks a lot in advance
>
>
>