You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Kostas Kloudas <k....@data-artisans.com> on 2018/12/03 14:06:45 UTC

Re: support/docs for compression in StreamingFileSink

Hi Addison,

Sorry for the late reply.

I agree that the documentation can be significantly improved
and that adding compression could be a nice thing to have.

There is already a PR open for supporting writing SequenceFiles with
the StreamingFileSink. When this gets merged, you will be able to use
compression when writing SequenceFiles (
https://github.com/apache/flink/pull/6774).

If this is not enough and you want to write plain-text and compress
it when you finalise your part-file, then you are right that you will need
to
write your own BulkWriter.

As you said, BulkWriters have only one RollingPolicy, and this is that they
roll on every checkpoint but there are plans to alleviate this limitation
in the future.

Cheers,
Kostas


On Thu, Nov 15, 2018 at 10:25 AM Till Rohrmann <tr...@apache.org> wrote:

> Hi Addison,
>
> I think it is a good idea to add some more details to the documentation.
> Thus, it would be great if you could contribute how to enable compression.
>
> Concerning the RollingPolicy, I've pulled in Klou who might give you more
> details about the design decisions.
>
> Cheers,
> Till
>
> On Wed, Nov 14, 2018 at 10:07 PM Addison Higham <ad...@gmail.com>
> wrote:
>
>> Just noticed one detail about using the BulkWriter interface, you no
>> longer
>> can assign a rolling policy. That makes sense for formats like
>> orc/parquet,
>> but perhaps not for simple text compression.
>>
>>
>>
>> On Wed, Nov 14, 2018 at 1:43 PM Addison Higham <ad...@gmail.com>
>> wrote:
>>
>> > HI all,
>> >
>> > I am moving some code to use the StreamingFileSink. Currently, it
>> doesn't
>> > look like there is any native support for compression (gzip or
>> otherwise)
>> > built into flink when using the StreamingFileSink. It seems like this
>> is a
>> > really common need that as far as I could tell, wasn't represented in
>> jira.
>> >
>> > After a fair amount of digging, it seems like the way to do that is to
>> > implement that is the BulkWriter interface where you can trivially wrap
>> an
>> > outputStream with something like a GZIPOutputStream.
>> >
>> > It seems like it would make sense that until compression functionality
>> is
>> > built into the StreamingFileSink, it might make sense to add some docs
>> on
>> > how to use compression with the StreamingFileSink.
>> >
>> > I am willing to spend a bit of time documenting that, but before I do i
>> > wanted to make sure I understand if that is in fact the correct way to
>> > think about this problem and get your thoughts.
>> >
>> > Thanks!
>> >
>> >
>> >
>>
>