You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Huyen Levan (JIRA)" <ji...@apache.org> on 2018/10/07 07:56:00 UTC

[jira] [Commented] (FLINK-9753) Support Parquet for StreamingFileSink

    [ https://issues.apache.org/jira/browse/FLINK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640990#comment-16640990 ] 

Huyen Levan commented on FLINK-9753:
------------------------------------

[~kkl0u] In the description section you mentioned two approaches: (1) only encode when publishing, and (2) encode right from the time writing to in-progress files. May I ask which approach you chose at the end? If you provided both, how can the user choose which approach to use? 
Thanks!

> Support Parquet for StreamingFileSink
> -------------------------------------
>
>                 Key: FLINK-9753
>                 URL: https://issues.apache.org/jira/browse/FLINK-9753
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Streaming Connectors
>            Reporter: Stephan Ewen
>            Assignee: Kostas Kloudas
>            Priority: Major
>             Fix For: 1.6.0
>
>
> Formats like Parquet and ORC are great at compressing data and making it fast to scan/filter/project the data.
> However, these formats are only efficient, if they can columnarize and compress a significant amount of data in their columnar format. If they compress only a few rows at a time, they produce many short column vecors and are thus much less efficient.
> The Bucketing Sink has the requirement that data is persistent on the target FileSystem on each checkpoint.
> Pushing data through a Parquet or ORC encoder and flushing on each checkpoint means that for frequent checkpoints, the amount of data compressed/columnarized in a block is small. Hence, the result is an inefficiently compressed file.
> Making this efficient independently of the checkpoint interval would mean that the sink needs to first collect (and persist) a good amount of data and then push it through the Parquet/ORC writers.
> I would suggest to approach this as follows:
>  - When writing to the "in progress files" write the raw records (TypeSerializer encoding)
>  - When the "in progress file" is rolled over (published), the sink pushes the data through the encoder.
>  - This is not much work on top of the new abstraction and will result in large blocksand hence in efficient compression.
> Alternatively, we can support directly encoding the stream to the "in progress files" via Parque/ORC, if users know that their combination of data rate and checkpoint interval will result in large enough chunks of data per checkpoint interval.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)