You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by vijikarthi <vi...@yahoo.com> on 2019/05/22 00:30:37 UTC

Checkpoint / Two Phase Commit

Question regarding end-to-end exactly once guarantee implementation using
2PC? 

As I understand how it operates, the pre-phase state is when the checkpoint
is initiated and the checkpoint barrier advances from source to sink. Once
the pre-phase is complete (and successful), then the next step in the
process is where the "Sink" operator is expected to "Commit" the transaction
(the data that was part of the checkpointed state). The datasource backed by
the Sink operator is expected to commit the transaction.

Say if the transaction times out and data cannot be committed, is there an
option to roll back the checkpointed state without incurring the data loss?
As of now, it does not work this way but I am trying to understand the
challenges/limitations with respect to discarding the checkpoint in
question?

Once reason could be with respect to (what to do with) the subsequent
checkpoints that might have advanced during this scenario? Anything else?

Regards
Vijay



--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/

Re: Checkpoint / Two Phase Commit

Posted by vijikarthi <vi...@yahoo.com>.

Pinging again on this thread to see if anyone has any recommendations?
Particularly, I am interested to understand whether this scenario is also
applicable for Kafka Connector?



--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/

Re: Checkpoint / Two Phase Commit

Posted by Piotr Nowojski <pi...@ververica.com>.

Hi,

Sorry for late answer.

> As I understand how it operates, the pre-phase state is when the checkpoint
> is initiated and the checkpoint barrier advances from source to sink. Once
> the pre-phase is complete (and successful), then the next step in the
> process is where the "Sink" operator is expected to "Commit" the transaction
> (the data that was part of the checkpointed state). The datasource backed by
> the Sink operator is expected to commit the transaction.

Yes you are correct. However keep in mind, that before we move to “Commit” phase, all of the participants of the pre commit must have completed it successfully.

> Say if the transaction times out and data cannot be committed, is there an
> option to roll back the checkpointed state without incurring the data loss?
> As of now, it does not work this way but I am trying to understand the
> challenges/limitations with respect to discarding the checkpoint in
> question?


No, it is impossible to rollback the data in case of timeout, because some of the other actors (different parallel instances of the sink) could have already committed the data. Once everybody acknowledged “pre-commit” AND the job master started “commit” phase, there is no going back, all commits must succeed eventually (intermittent failures are allowed) or there will be a data loss.

> 
> Once reason could be with respect to (what to do with) the subsequent
> checkpoints that might have advanced during this scenario? Anything else?

Sorry, I’m not sure if understood the question. Depending on an external system, but generally if a transaction has timed out’ed, the data is lost, regardless of whether there was a more recent batch of records written in some more recent transaction, that could have be committed successfully.

Piotrek

> On 22 May 2019, at 02:30, vijikarthi <vi...@yahoo.com> wrote:
> 
> Question regarding end-to-end exactly once guarantee implementation using
> 2PC? 
> 
> As I understand how it operates, the pre-phase state is when the checkpoint
> is initiated and the checkpoint barrier advances from source to sink. Once
> the pre-phase is complete (and successful), then the next step in the
> process is where the "Sink" operator is expected to "Commit" the transaction
> (the data that was part of the checkpointed state). The datasource backed by
> the Sink operator is expected to commit the transaction.
> 
> Say if the transaction times out and data cannot be committed, is there an
> option to roll back the checkpointed state without incurring the data loss?
> As of now, it does not work this way but I am trying to understand the
> challenges/limitations with respect to discarding the checkpoint in
> question?
> 
> Once reason could be with respect to (what to do with) the subsequent
> checkpoints that might have advanced during this scenario? Anything else?
> 
> Regards
> Vijay
> 
> 
> 
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/