You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "F.Amara" <fa...@wso2.com> on 2017/04/17 04:03:48 UTC

Data duplication on a High Availability activated cluster after a Task Manager failure recovery

Hi all,

I'm using Flink 1.2.0. I have a distributed system where Flink High
Availability feature is activated. Data is produced using a Kafka broker and
on a TM failure scenario, the cluster restarts. Checkpointing is enabled
with exactly once processing.
Problem encountered is, at the end of data processing I receive duplicated
data and some data are also missing. (ex: if 2000 events are sent it loses
around 800 events and some events are duplicated at the receiving end).

Is this an issue with the Flink version or would it be an issue from my
program logic?



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Data-duplication-on-a-High-Availability-activated-cluster-after-a-Task-Manager-failure-recovery-tp12627.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Data duplication on a High Availability activated cluster after a Task Manager failure recovery

Posted by "F.Amara" <fa...@wso2.com>.

Hi Gordan,

Appreciate your prompt reply. Thanks alot for pointing that out that Kafka
Producer has at least once guarantee of message delivery. That seems to be
the reason why I encountered duplicated data on a flink failure recovery
scenario.



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Data-duplication-on-a-High-Availability-activated-cluster-after-a-Task-Manager-failure-recovery-tp12627p12846.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Data duplication on a High Availability activated cluster after a Task Manager failure recovery

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.

Hi,

A few things to clarify first:

1. What is the sink you are using? Checkpointing in Flink allows for exactly-once state updates. Whether or not end-to-end exactly-once delivery can be achieved depends on the sink. For data store sinks such as Cassandra / Elasticsearch, this can be made effectively exactly-once using idempotent writes (depending on the application logic). For a Kafka topic as a sink, currently the delivery is only at-least-once. You can check out [1] for an overview.

2. Also note that if there essentially is already duplicates in the consumed Kafka topic (which may occur since Kafka producing does not support any kind of transactions at the moment), then they will all be consumed and processed by Flink.

However, this does not explain missing data, as this should not happen.
So for this, yes, I would try to check if there’s an issue with the application logic or the events simply were not in the consumed Kafka topic in the first place.

Cheers,
Gordon

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html


On 17 April 2017 at 12:14:00 PM, F.Amara (fathima@wso2.com) wrote:

Hi all,  

I'm using Flink 1.2.0. I have a distributed system where Flink High  
Availability feature is activated. Data is produced using a Kafka broker and  
on a TM failure scenario, the cluster restarts. Checkpointing is enabled  
with exactly once processing.  
Problem encountered is, at the end of data processing I receive duplicated  
data and some data are also missing. (ex: if 2000 events are sent it loses  
around 800 events and some events are duplicated at the receiving end).  

Is this an issue with the Flink version or would it be an issue from my  
program logic?  



--  
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Data-duplication-on-a-High-Availability-activated-cluster-after-a-Task-Manager-failure-recovery-tp12627.html  
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.