You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Wenlong Lyu (JIRA)" <ji...@apache.org> on 2016/12/08 10:01:09 UTC

[jira] [Created] (FLINK-5284) Make output of bucketing sink compatible with other processing framework like mapreduce

Wenlong Lyu created FLINK-5284:
----------------------------------

             Summary: Make output of bucketing sink compatible with other processing framework like mapreduce
                 Key: FLINK-5284
                 URL: https://issues.apache.org/jira/browse/FLINK-5284
             Project: Flink
          Issue Type: Improvement
          Components: filesystem-connector
            Reporter: Wenlong Lyu
            Assignee: Wenlong Lyu


Currently bucketing sink cannot move the in-progress and pending files to final output when the stream finished, and when recovering, the current output file will contain some invalid content, which can only be identified by the file-length meta file. These make the final output of the job incompatible to other processing framework like mapreduce. There are two things to do to solve the problem:
1. add direct output option to bucketing sink, which writes output to the final file, and delete/truncate the some file when fail over. direct output will be quite useful specially for finite stream job, which can enable user to migrate there batch job to streaming, taking advantage of features such as checkpointing.
2. add truncate by copy option to enable bucketing sink to resize output file by copying content valid in current file instead of creating a length meta file. truncate by copy will make some more extra IO operation, but can make the output more clean.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)