You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Addison Higham <ad...@gmail.com> on 2018/11/14 20:43:22 UTC

support/docs for compression in StreamingFileSink

HI all,

I am moving some code to use the StreamingFileSink. Currently, it doesn't
look like there is any native support for compression (gzip or otherwise)
built into flink when using the StreamingFileSink. It seems like this is a
really common need that as far as I could tell, wasn't represented in jira.

After a fair amount of digging, it seems like the way to do that is to
implement that is the BulkWriter interface where you can trivially wrap an
outputStream with something like a GZIPOutputStream.

It seems like it would make sense that until compression functionality is
built into the StreamingFileSink, it might make sense to add some docs on
how to use compression with the StreamingFileSink.

I am willing to spend a bit of time documenting that, but before I do i
wanted to make sure I understand if that is in fact the correct way to
think about this problem and get your thoughts.

Thanks!

Re: support/docs for compression in StreamingFileSink

Posted by Kostas Kloudas <k....@data-artisans.com>.
Hi Addison,

Sorry for the late reply.

I agree that the documentation can be significantly improved
and that adding compression could be a nice thing to have.

There is already a PR open for supporting writing SequenceFiles with
the StreamingFileSink. When this gets merged, you will be able to use
compression when writing SequenceFiles (
https://github.com/apache/flink/pull/6774).

If this is not enough and you want to write plain-text and compress
it when you finalise your part-file, then you are right that you will need
to
write your own BulkWriter.

As you said, BulkWriters have only one RollingPolicy, and this is that they
roll on every checkpoint but there are plans to alleviate this limitation
in the future.

Cheers,
Kostas


On Thu, Nov 15, 2018 at 10:25 AM Till Rohrmann <tr...@apache.org> wrote:

> Hi Addison,
>
> I think it is a good idea to add some more details to the documentation.
> Thus, it would be great if you could contribute how to enable compression.
>
> Concerning the RollingPolicy, I've pulled in Klou who might give you more
> details about the design decisions.
>
> Cheers,
> Till
>
> On Wed, Nov 14, 2018 at 10:07 PM Addison Higham <ad...@gmail.com>
> wrote:
>
>> Just noticed one detail about using the BulkWriter interface, you no
>> longer
>> can assign a rolling policy. That makes sense for formats like
>> orc/parquet,
>> but perhaps not for simple text compression.
>>
>>
>>
>> On Wed, Nov 14, 2018 at 1:43 PM Addison Higham <ad...@gmail.com>
>> wrote:
>>
>> > HI all,
>> >
>> > I am moving some code to use the StreamingFileSink. Currently, it
>> doesn't
>> > look like there is any native support for compression (gzip or
>> otherwise)
>> > built into flink when using the StreamingFileSink. It seems like this
>> is a
>> > really common need that as far as I could tell, wasn't represented in
>> jira.
>> >
>> > After a fair amount of digging, it seems like the way to do that is to
>> > implement that is the BulkWriter interface where you can trivially wrap
>> an
>> > outputStream with something like a GZIPOutputStream.
>> >
>> > It seems like it would make sense that until compression functionality
>> is
>> > built into the StreamingFileSink, it might make sense to add some docs
>> on
>> > how to use compression with the StreamingFileSink.
>> >
>> > I am willing to spend a bit of time documenting that, but before I do i
>> > wanted to make sure I understand if that is in fact the correct way to
>> > think about this problem and get your thoughts.
>> >
>> > Thanks!
>> >
>> >
>> >
>>
>

Re: support/docs for compression in StreamingFileSink

Posted by Till Rohrmann <tr...@apache.org>.
Hi Addison,

I think it is a good idea to add some more details to the documentation.
Thus, it would be great if you could contribute how to enable compression.

Concerning the RollingPolicy, I've pulled in Klou who might give you more
details about the design decisions.

Cheers,
Till

On Wed, Nov 14, 2018 at 10:07 PM Addison Higham <ad...@gmail.com> wrote:

> Just noticed one detail about using the BulkWriter interface, you no longer
> can assign a rolling policy. That makes sense for formats like orc/parquet,
> but perhaps not for simple text compression.
>
>
>
> On Wed, Nov 14, 2018 at 1:43 PM Addison Higham <ad...@gmail.com> wrote:
>
> > HI all,
> >
> > I am moving some code to use the StreamingFileSink. Currently, it doesn't
> > look like there is any native support for compression (gzip or otherwise)
> > built into flink when using the StreamingFileSink. It seems like this is
> a
> > really common need that as far as I could tell, wasn't represented in
> jira.
> >
> > After a fair amount of digging, it seems like the way to do that is to
> > implement that is the BulkWriter interface where you can trivially wrap
> an
> > outputStream with something like a GZIPOutputStream.
> >
> > It seems like it would make sense that until compression functionality is
> > built into the StreamingFileSink, it might make sense to add some docs on
> > how to use compression with the StreamingFileSink.
> >
> > I am willing to spend a bit of time documenting that, but before I do i
> > wanted to make sure I understand if that is in fact the correct way to
> > think about this problem and get your thoughts.
> >
> > Thanks!
> >
> >
> >
>

Re: support/docs for compression in StreamingFileSink

Posted by Addison Higham <ad...@gmail.com>.
Just noticed one detail about using the BulkWriter interface, you no longer
can assign a rolling policy. That makes sense for formats like orc/parquet,
but perhaps not for simple text compression.



On Wed, Nov 14, 2018 at 1:43 PM Addison Higham <ad...@gmail.com> wrote:

> HI all,
>
> I am moving some code to use the StreamingFileSink. Currently, it doesn't
> look like there is any native support for compression (gzip or otherwise)
> built into flink when using the StreamingFileSink. It seems like this is a
> really common need that as far as I could tell, wasn't represented in jira.
>
> After a fair amount of digging, it seems like the way to do that is to
> implement that is the BulkWriter interface where you can trivially wrap an
> outputStream with something like a GZIPOutputStream.
>
> It seems like it would make sense that until compression functionality is
> built into the StreamingFileSink, it might make sense to add some docs on
> how to use compression with the StreamingFileSink.
>
> I am willing to spend a bit of time documenting that, but before I do i
> wanted to make sure I understand if that is in fact the correct way to
> think about this problem and get your thoughts.
>
> Thanks!
>
>
>