You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Huyen Levan (JIRA)" <ji...@apache.org> on 2018/10/07 07:56:00 UTC
[jira] [Commented] (FLINK-9753) Support Parquet for
StreamingFileSink
[ https://issues.apache.org/jira/browse/FLINK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640990#comment-16640990 ]
Huyen Levan commented on FLINK-9753:
------------------------------------
[~kkl0u] In the description section you mentioned two approaches: (1) only encode when publishing, and (2) encode right from the time writing to in-progress files. May I ask which approach you chose at the end? If you provided both, how can the user choose which approach to use?
Thanks!
> Support Parquet for StreamingFileSink
> -------------------------------------
>
> Key: FLINK-9753
> URL: https://issues.apache.org/jira/browse/FLINK-9753
> Project: Flink
> Issue Type: Sub-task
> Components: Streaming Connectors
> Reporter: Stephan Ewen
> Assignee: Kostas Kloudas
> Priority: Major
> Fix For: 1.6.0
>
>
> Formats like Parquet and ORC are great at compressing data and making it fast to scan/filter/project the data.
> However, these formats are only efficient, if they can columnarize and compress a significant amount of data in their columnar format. If they compress only a few rows at a time, they produce many short column vecors and are thus much less efficient.
> The Bucketing Sink has the requirement that data is persistent on the target FileSystem on each checkpoint.
> Pushing data through a Parquet or ORC encoder and flushing on each checkpoint means that for frequent checkpoints, the amount of data compressed/columnarized in a block is small. Hence, the result is an inefficiently compressed file.
> Making this efficient independently of the checkpoint interval would mean that the sink needs to first collect (and persist) a good amount of data and then push it through the Parquet/ORC writers.
> I would suggest to approach this as follows:
> - When writing to the "in progress files" write the raw records (TypeSerializer encoding)
> - When the "in progress file" is rolled over (published), the sink pushes the data through the encoder.
> - This is not much work on top of the new abstraction and will result in large blocksand hence in efficient compression.
> Alternatively, we can support directly encoding the stream to the "in progress files" via Parque/ORC, if users know that their combination of data rate and checkpoint interval will result in large enough chunks of data per checkpoint interval.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)