You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Dmytro Dragan <dd...@softserveinc.com> on 2020/06/16 10:03:17 UTC

Writing to S3 parquet files in Blink batch mode. Flink 1.10

Hi guys,

In our use case we consider to write data to AWS S3 in parquet format using Blink Batch mode.
As far as I see from one side to write parquet file valid approach is to use StreamingFileSink with Parquet bulk-encoded format, but
Based to documentation and tests it works only with OnCheckpointRollingPolicy.

While Blink Batch mode requires disabled checkpoint.

Has anyone faced with similar issue?

Re: Writing to S3 parquet files in Blink batch mode. Flink 1.10

Posted by Dmytro Dragan <dd...@softserveinc.com>.

Hi Jingsong,

Thank you for detailed clarification.

Best regards,
Dmytro Dragan | ddrag@softserveinc.com | Lead Big Data Engineer | Big Data & Analytics | SoftServe

________________________________
From: Jingsong Li <ji...@gmail.com>
Sent: Thursday, June 18, 2020 4:58:22 AM
To: Dmytro Dragan <dd...@softserveinc.com>
Cc: user@flink.apache.org <us...@flink.apache.org>
Subject: Re: Writing to S3 parquet files in Blink batch mode. Flink 1.10

Hi Dmytro,

Yes, Batch mode must disabled checkpoint, So StreamingFileSink can not be used in batch mode (StreamingFileSink requires checkpoint whatever formats), we are refactoring it to more generic, and can be used in batch mode, but this is a future topic.
Currently, in batch mode, for sink, we must use `OutputFormat` with `FinalizeOnMaster` instead of `SinkFunction`.  We should implement the file committing in the method of `FinalizeOnMaster`. If you have enough time, you can implement a custom `OutputFormat`, it is complicated.

Now the status quo is:
- For 1.10, blink batch support writing to the hive table, if you can convert your table to a hive table with parquet and S3, it can be. [1]
- For 1.11, there is a new connector named `filesystem connector`, [2], you can define a table with parquet and S3, and writing to the table by SQL.
- For 1.11, moreover, both hive and filesystem connector support streaming writing by built-in reusing StreamingFileSink.

[1]https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/read_write_hive.html#writing-to-hive
[2]https://ci.apache.org/projects/flink/flink-docs-master/dev/table/connectors/filesystem.html

Best,
Jingsong

On Tue, Jun 16, 2020 at 10:50 PM Dmytro Dragan <dd...@softserveinc.com>> wrote:
Hi guys,

In our use case we consider to write data to AWS S3 in parquet format using Blink Batch mode.
As far as I see from one side to write parquet file valid approach is to use StreamingFileSink with Parquet bulk-encoded format, but
Based to documentation and tests it works only with OnCheckpointRollingPolicy.

While Blink Batch mode requires disabled checkpoint.

Has anyone faced with similar issue?



--
Best, Jingsong Lee

Re: Writing to S3 parquet files in Blink batch mode. Flink 1.10

Posted by Jingsong Li <ji...@gmail.com>.

Hi Dmytro,

Yes, Batch mode must disabled checkpoint, So StreamingFileSink can not be
used in batch mode (StreamingFileSink requires checkpoint whatever
formats), we are refactoring it to more generic, and can be used in batch
mode, but this is a future topic.
Currently, in batch mode, for sink, we must use `OutputFormat` with
`FinalizeOnMaster` instead of `SinkFunction`.  We should implement the file
committing in the method of `FinalizeOnMaster`. If you have enough time,
you can implement a custom `OutputFormat`, it is complicated.

Now the status quo is:
- For 1.10, blink batch support writing to the hive table, if you can
convert your table to a hive table with parquet and S3, it can be. [1]
- For 1.11, there is a new connector named `filesystem connector`, [2], you
can define a table with parquet and S3, and writing to the table by SQL.
- For 1.11, moreover, both hive and filesystem connector support streaming
writing by built-in reusing StreamingFileSink.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/read_write_hive.html#writing-to-hive
[2]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/connectors/filesystem.html

Best,
Jingsong

On Tue, Jun 16, 2020 at 10:50 PM Dmytro Dragan <dd...@softserveinc.com>
wrote:

> Hi guys,
>
>
>
> In our use case we consider to write data to AWS S3 in parquet format
> using Blink Batch mode.
>
> As far as I see from one side to write parquet file valid approach is to
> use *StreamingFileSink* with Parquet bulk-encoded format, but
>
> Based to documentation and tests it works only with
> OnCheckpointRollingPolicy.
>
>
>
> While Blink Batch mode requires disabled checkpoint.
>
>
>
> Has anyone faced with similar issue?
>
>
>

-- 
Best, Jingsong Lee