You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "zhangminglei (JIRA)" <ji...@apache.org> on 2018/07/02 06:16:00 UTC
[jira] [Updated] (FLINK-9609) Add bucket ready mechanism for BucketingSink when checkpoint complete

     [ https://issues.apache.org/jira/browse/FLINK-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhangminglei updated FLINK-9609:
--------------------------------
    Description: 
Currently, BucketingSink only support {{notifyCheckpointComplete}}. However, users want to do some extra work when a bucket is ready. It would be nice if we can support {{BucketReady}} mechanism for users or we can tell users when a bucket is ready for use. For example, One bucket is created for every 5 minutes, at the end of 5 minutes before creating the next bucket, the user might need to do something as the previous bucket ready, like sending the timestamp of the bucket ready time to a server or do some other stuff.

Here, Bucket ready means all the part files suffix name under a bucket neither {{.pending}} nor {{.in-progress}}. Then we can think this bucket is ready for user use. Like a watermark means no elements with a timestamp older or equal to the watermark timestamp should arrive at the window. We can also refer to the concept of watermark here, or we can call this *BucketWatermark* if we could.

Recently, I found a user who wants this functionality which I would think.
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Let-BucketingSink-roll-file-on-each-checkpoint-td19034.html

Below is what he said:

My user case is we read data from message queue, write to HDFS, and our ETL team will use the data in HDFS. *In the case, ETL need to know if all data is ready to be read accurately*, so we use a counter to count how many data has been wrote, if the counter is equal to the number we received, we think HDFS file is ready. We send the counter message in a custom sink so ETL can know how many data has been wrote, but if use current BucketingSink, even through HDFS file is flushed, ETL may still cannot read the data. If we can close file during checkpoint, then the result is accurately. And for the HDFS small file problem, it can be controller by use bigger checkpoint interval. 

  was:
Currently, BucketingSink only support {{notifyCheckpointComplete}}. However, users want to do some extra work when a bucket is ready. It would be nice if we can support {{BucketReady}} mechanism for users or we can tell users when a bucket is ready for use. For example, One bucket is created for every 5 minutes, at the end of 5 minutes before creating the next bucket, the user might need to do something as the previous bucket ready, like sending the timestamp of the bucket ready time to a server or do some other stuff.

Here, Bucket ready means all the part files suffix name under a bucket neither {{.pending}} nor {{.in-progress}}. Then we can think this bucket is ready for user use. Like a watermark means no elements with a timestamp older or equal to the watermark timestamp should arrive at the window. We can also refer to the concept of watermark here, or we can call this *BucketWatermark* if we could.

Recently, I found a user who wants this functionality which I think.
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Let-BucketingSink-roll-file-on-each-checkpoint-td19034.html

Below is what he said:

My user case is we read data from message queue, write to HDFS, and our ETL team will use the data in HDFS. *In the case, ETL need to know if all data is ready to be read accurately*, so we use a counter to count how many data has been wrote, if the counter is equal to the number we received, we think HDFS file is ready. We send the counter message in a custom sink so ETL can know how many data has been wrote, but if use current BucketingSink, even through HDFS file is flushed, ETL may still cannot read the data. If we can close file during checkpoint, then the result is accurately. And for the HDFS small file problem, it can be controller by use bigger checkpoint interval. 


> Add bucket ready mechanism for BucketingSink when checkpoint complete
> ---------------------------------------------------------------------
>
>                 Key: FLINK-9609
>                 URL: https://issues.apache.org/jira/browse/FLINK-9609
>             Project: Flink
>          Issue Type: New Feature
>          Components: filesystem-connector, Streaming Connectors
>    Affects Versions: 1.5.0, 1.4.2
>            Reporter: zhangminglei
>            Assignee: zhangminglei
>            Priority: Major
>
> Currently, BucketingSink only support {{notifyCheckpointComplete}}. However, users want to do some extra work when a bucket is ready. It would be nice if we can support {{BucketReady}} mechanism for users or we can tell users when a bucket is ready for use. For example, One bucket is created for every 5 minutes, at the end of 5 minutes before creating the next bucket, the user might need to do something as the previous bucket ready, like sending the timestamp of the bucket ready time to a server or do some other stuff.
> Here, Bucket ready means all the part files suffix name under a bucket neither {{.pending}} nor {{.in-progress}}. Then we can think this bucket is ready for user use. Like a watermark means no elements with a timestamp older or equal to the watermark timestamp should arrive at the window. We can also refer to the concept of watermark here, or we can call this *BucketWatermark* if we could.
> Recently, I found a user who wants this functionality which I would think.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Let-BucketingSink-roll-file-on-each-checkpoint-td19034.html
> Below is what he said:
> My user case is we read data from message queue, write to HDFS, and our ETL team will use the data in HDFS. *In the case, ETL need to know if all data is ready to be read accurately*, so we use a counter to count how many data has been wrote, if the counter is equal to the number we received, we think HDFS file is ready. We send the counter message in a custom sink so ETL can know how many data has been wrote, but if use current BucketingSink, even through HDFS file is flushed, ETL may still cannot read the data. If we can close file during checkpoint, then the result is accurately. And for the HDFS small file problem, it can be controller by use bigger checkpoint interval. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)