You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Wesley Kerr <we...@gmail.com> on 2016/08/18 00:13:03 UTC

Compress DataSink Output

Hello -

Forgive me if this has been asked before, but I'm trying to determine the
best way to add compression to DataSink Outputs (starting with
TextOutputFormat).  Realistically I would like each partition file (based
on parallelism) to be compressed independently with gzip, but am open to
other solutions.

My first thought was to extend TextOutputFormat with a new class that
compresses after closing and before returning, but I'm not sure that would
work across all possible file systems (S3, Local, and HDFS).

Any thoughts?

Thanks!

Wes

Re: Compress DataSink Output

Posted by Wesley Kerr <we...@gmail.com>.

That looks good.  Thanks!

On Fri, Aug 19, 2016 at 6:15 AM Robert Metzger <rm...@apache.org> wrote:

> Hi Wes,
>
> Flink's own OutputFormats don't support compression, but we have some
> tools to use Hadoop's OutputFormats with Flink [1], and those support
> compression:
> https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html
>
> Let me know if you need more information.
>
> Regards,
> Robert
>
> [1]:
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/hadoop_compatibility.html
>
>
> On Thu, Aug 18, 2016 at 2:13 AM, Wesley Kerr <we...@gmail.com>
> wrote:
>
>> Hello -
>>
>> Forgive me if this has been asked before, but I'm trying to determine the
>> best way to add compression to DataSink Outputs (starting with
>> TextOutputFormat).  Realistically I would like each partition file
>> (based on parallelism) to be compressed independently with gzip, but am
>> open to other solutions.
>>
>> My first thought was to extend TextOutputFormat with a new class that
>> compresses after closing and before returning, but I'm not sure that would
>> work across all possible file systems (S3, Local, and HDFS).
>>
>> Any thoughts?
>>
>> Thanks!
>>
>> Wes
>>
>>
>>
>

Re: Compress DataSink Output

Posted by Robert Metzger <rm...@apache.org>.

Hi Wes,

Flink's own OutputFormats don't support compression, but we have some tools
to use Hadoop's OutputFormats with Flink [1], and those support
compression:
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html

Let me know if you need more information.

Regards,
Robert

[1]:
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/hadoop_compatibility.html


On Thu, Aug 18, 2016 at 2:13 AM, Wesley Kerr <we...@gmail.com>
wrote:

> Hello -
>
> Forgive me if this has been asked before, but I'm trying to determine the
> best way to add compression to DataSink Outputs (starting with
> TextOutputFormat).  Realistically I would like each partition file (based
> on parallelism) to be compressed independently with gzip, but am open to
> other solutions.
>
> My first thought was to extend TextOutputFormat with a new class that
> compresses after closing and before returning, but I'm not sure that would
> work across all possible file systems (S3, Local, and HDFS).
>
> Any thoughts?
>
> Thanks!
>
> Wes
>
>
>