You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by 徐涛 <ha...@gmail.com> on 2018/10/12 09:20:46 UTC

Questions in sink exactly once implementation

Hi 
	I am reading the book “Introduction to Apache Flink”, and in the book there mentions two ways to achieve sink exactly once:
	1. The first way is to buffer all output at the sink and commit this atomically when the sink receives a checkpoint record.
	2. The second way is to eagerly write data to the output, keeping in mind that some of this data might be “dirty” and replayed after a failure. If there is a failure, then we need to roll back the output, thus overwriting the dirty data and effectively deleting dirty data that has already been written to the output.

	I read the code of Elasticsearch sink, and find there is a flushOnCheckpoint option, if set to true, the change will accumulate until checkpoint is made. I guess it will guarantee at-least-once delivery, because although it use batch flush, but the flush is not a atomic action, so it can not guarantee exactly-once delivery. 

	My question is : 
	1. As many sinks do not support transaction, at this case I have to choose 2 to achieve exactly once. At this case, I have to buffer all the records between checkpoints and delete them, it is a bit heavy action.
	2. I guess mysql sink should support exactly once delivery, because it supports transaction, but at this case I have to execute batch according to the number of actions between checkpoints but not a specific number, 100 for example. When there are a lot of items between checkpoints, it is a heavy action either.

Best
Henry

Re: Questions in sink exactly once implementation

Posted by Hequn Cheng <ch...@gmail.com>.

Hi Henry,

> 1. I have heard a idempotent way but I do not know how to implement it,
would you please enlighten me about it by a example?
It's a property of the result data. For example, you can overwrite old
values with new ones using a primary key.

> 2. If dirty data are *added* but not updated
This against idempotent. Idempotent ensure that the result is consistent in
the end.

> 3. If using two-phase commit, the sink must support transaction.
I think the answer is yes.

Best, Hequn


On Sat, Oct 13, 2018 at 8:49 PM 徐涛 <ha...@gmail.com> wrote:

> Hi Hequn,
> Thanks a lot for your response. I have a few questions about this topic.
> Would you please help me about it?
> 1. I have heard a idempotent way but I do not know how to implement it,
> would you please enlighten me about it by a example?
> 2. If dirty data are *added* but not updated, then only overwrite is not
> enough I think.
> 3. If using two-phase commit, the sink must support transaction.
> 3.1 If the sink does not support transaction, for example, elasticsearch,
> do I *have to* use idempotent to implement exactly-once?
> 3.2 If the sink support transaction, for example, mysql, idempotent and
> two-phase commit is both OK. But like you say, if there are a lot of items
> between checkpoints, the batch insert is a heavy action, I still have to
> use idempotent way to implement exactly-once.
>
>
> Best
> Hequn
>
> 在 2018年10月13日，上午11:43，Hequn Cheng <ch...@gmail.com> 写道：
>
> Hi Henry,
>
> Yes, exactly once using atomic way is heavy for mysql. However, you don't
> have to buffer data if you choose option 2. You can simply overwrite old
> records with new ones if result data is idempotent and this way can also
> achieve exactly once.
> There is a document about End-to-End Exactly-Once Processing in Apache
> Flink[1], which may be helpful for you.
>
> Best, Hequn
>
> [1]
> https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
>
>
>
> On Fri, Oct 12, 2018 at 5:21 PM 徐涛 <ha...@gmail.com> wrote:
>
>> Hi
>>         I am reading the book “Introduction to Apache Flink”, and in the
>> book there mentions two ways to achieve sink exactly once:
>>         1. The first way is to buffer all output at the sink and commit
>> this atomically when the sink receives a checkpoint record.
>>         2. The second way is to eagerly write data to the output, keeping
>> in mind that some of this data might be “dirty” and replayed after a
>> failure. If there is a failure, then we need to roll back the output, thus
>> overwriting the dirty data and effectively deleting dirty data that has
>> already been written to the output.
>>
>>         I read the code of Elasticsearch sink, and find there is a
>> flushOnCheckpoint option, if set to true, the change will accumulate until
>> checkpoint is made. I guess it will guarantee at-least-once delivery,
>> because although it use batch flush, but the flush is not a atomic action,
>> so it can not guarantee exactly-once delivery.
>>
>>         My question is :
>>         1. As many sinks do not support transaction, at this case I have
>> to choose 2 to achieve exactly once. At this case, I have to buffer all the
>> records between checkpoints and delete them, it is a bit heavy action.
>>         2. I guess mysql sink should support exactly once delivery,
>> because it supports transaction, but at this case I have to execute batch
>> according to the number of actions between checkpoints but not a specific
>> number, 100 for example. When there are a lot of items between checkpoints,
>> it is a heavy action either.
>>
>> Best
>> Henry
>
>
>

Re: Questions in sink exactly once implementation

Posted by 徐涛 <ha...@gmail.com>.

Hi Hequn,
	Thanks a lot for your response. I have a few questions about this topic. Would you please help me about it?
	1. I have heard a idempotent way but I do not know how to implement it, would you please enlighten me about it by a example?
	2. If dirty data are added but not updated, then only overwrite is not enough I think.
	3. If using two-phase commit, the sink must support transaction.
		3.1 If the sink does not support transaction, for example, elasticsearch, do I have to use idempotent to implement exactly-once?
		3.2 If the sink support transaction, for example, mysql, idempotent and two-phase commit is both OK. But like you say, if there are a lot of items between checkpoints, the batch insert is a heavy action, I still have to use idempotent way to implement exactly-once.


Best
Hequn

> 在 2018年10月13日，上午11:43，Hequn Cheng <ch...@gmail.com> 写道：
> 
> Hi Henry,
> 
> Yes, exactly once using atomic way is heavy for mysql. However, you don't have to buffer data if you choose option 2. You can simply overwrite old records with new ones if result data is idempotent and this way can also achieve exactly once. 
> There is a document about End-to-End Exactly-Once Processing in Apache Flink[1], which may be helpful for you.
> 
> Best, Hequn
> 
> [1] https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html <https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html>
> 
> 
> 
> On Fri, Oct 12, 2018 at 5:21 PM 徐涛 <happydexutao@gmail.com <ma...@gmail.com>> wrote:
> Hi 
>         I am reading the book “Introduction to Apache Flink”, and in the book there mentions two ways to achieve sink exactly once:
>         1. The first way is to buffer all output at the sink and commit this atomically when the sink receives a checkpoint record.
>         2. The second way is to eagerly write data to the output, keeping in mind that some of this data might be “dirty” and replayed after a failure. If there is a failure, then we need to roll back the output, thus overwriting the dirty data and effectively deleting dirty data that has already been written to the output.
> 
>         I read the code of Elasticsearch sink, and find there is a flushOnCheckpoint option, if set to true, the change will accumulate until checkpoint is made. I guess it will guarantee at-least-once delivery, because although it use batch flush, but the flush is not a atomic action, so it can not guarantee exactly-once delivery. 
> 
>         My question is : 
>         1. As many sinks do not support transaction, at this case I have to choose 2 to achieve exactly once. At this case, I have to buffer all the records between checkpoints and delete them, it is a bit heavy action.
>         2. I guess mysql sink should support exactly once delivery, because it supports transaction, but at this case I have to execute batch according to the number of actions between checkpoints but not a specific number, 100 for example. When there are a lot of items between checkpoints, it is a heavy action either.
> 
> Best
> Henry

Re: Questions in sink exactly once implementation

Posted by Hequn Cheng <ch...@gmail.com>.

Hi Henry,

Yes, exactly once using atomic way is heavy for mysql. However, you don't
have to buffer data if you choose option 2. You can simply overwrite old
records with new ones if result data is idempotent and this way can also
achieve exactly once.
There is a document about End-to-End Exactly-Once Processing in Apache
Flink[1], which may be helpful for you.

Best, Hequn

[1]
https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html



On Fri, Oct 12, 2018 at 5:21 PM 徐涛 <ha...@gmail.com> wrote:

> Hi
>         I am reading the book “Introduction to Apache Flink”, and in the
> book there mentions two ways to achieve sink exactly once:
>         1. The first way is to buffer all output at the sink and commit
> this atomically when the sink receives a checkpoint record.
>         2. The second way is to eagerly write data to the output, keeping
> in mind that some of this data might be “dirty” and replayed after a
> failure. If there is a failure, then we need to roll back the output, thus
> overwriting the dirty data and effectively deleting dirty data that has
> already been written to the output.
>
>         I read the code of Elasticsearch sink, and find there is a
> flushOnCheckpoint option, if set to true, the change will accumulate until
> checkpoint is made. I guess it will guarantee at-least-once delivery,
> because although it use batch flush, but the flush is not a atomic action,
> so it can not guarantee exactly-once delivery.
>
>         My question is :
>         1. As many sinks do not support transaction, at this case I have
> to choose 2 to achieve exactly once. At this case, I have to buffer all the
> records between checkpoints and delete them, it is a bit heavy action.
>         2. I guess mysql sink should support exactly once delivery,
> because it supports transaction, but at this case I have to execute batch
> according to the number of actions between checkpoints but not a specific
> number, 100 for example. When there are a lot of items between checkpoints,
> it is a heavy action either.
>
> Best
> Henry