You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Aniket Bhatnagar <an...@gmail.com> on 2013/10/24 12:08:35 UTC

Understanding how producers and consumers behave in case of node failures in 0.8

I am trying to understand and document how producers & consumers
will/should behave in case of node failures in 0.8. I know there are
various other threads that discuss this but I wanted to bring all the
information together in one post. This should help people building
producers & consumers in other languages as well. Here is my understanding
of how Kafak behaves in failures:

Case 1: If a node fails that wasn't a leader for any partitions
No impact on consumers and producers

Case 2: If a leader node fails but another in sync node can be become a
leader
All publishing to and consumption from the partition whose leader failed
will momentarily stop until a new leader is elected. Producers should
implement retry logic in such cases (and in fact in all kinds of errors
from Kafka) and consumers can (depending on your use case) either continue
to other partitions after retrying decent number of times (in case you are
fetching from partitions in round robin fashion) or keep retrying until
leader is available.

Case 3: If a leader node goes down and no other in sync nodes are available
In this case, publishing to and consumption from the partition will halt
and will not resume until the faulty leader node recovers. In this case,
producers should fail the publish request after retrying decent number of
times and provide a callback to the client of the producer to take
corrective action. Consumers again have a choice to continue to other
partitions after retrying decent number of times (in case you are fetching
from partitions in round robin fashion) or keep retrying until leader is
available. In case of latter, the entire consumer process will halt until
the faulty node recovers.

Do I have this right?

Re: Understanding how producers and consumers behave in case of node failures in 0.8

Posted by Aniket Bhatnagar <an...@gmail.com>.

I am planning to use kafka 0.8 spout and after studing the source code
found that it doesnt handle errors. There is a fork that adds try catch
over using fetchResponse but my guess is this will lead to spout attempting
the same partition infinitely until the leader is elected/comes back
online. I will try and submit a pull request to kafka 0.8 plus spout to fix
this issue in a couple of days.
 On 25 Oct 2013 02:58, "Chris Bedford" <ch...@buildlackey.com> wrote:

> Thank you for posting these guidelines.   I'm wondering if anyone out there
> that is using the Kafka spout (for Storm) knows whether or not the Kafka
> spout takes care of these types of details ?
>
>  regards
>  -chris
>
>
>
> On Thu, Oct 24, 2013 at 2:05 PM, Neha Narkhede <neha.narkhede@gmail.com
> >wrote:
>
> > Yes, when a leader dies, the preference is to pick a leader from the ISR.
> > If not, the leader is picked from any other available replica. But if no
> > replicas are alive, the partition goes offline and all production and
> > consumption halts, until at least one replica is brought online.
> >
> > Thanks,
> > Neha
> >
> >
> > On Thu, Oct 24, 2013 at 11:57 AM, Kane Kane <ka...@gmail.com>
> wrote:
> >
> > > >>publishing to and consumption from the partition will halt
> > > and will not resume until the faulty leader node recovers
> > >
> > > Can you confirm that's the case? I think they won't wait until leader
> > > recovered and will try to elect new leader from existing non-ISR
> > replicas?
> > > And in case if they wait, and faulty leader never comes back?
> > >
> > >
> > > On Thu, Oct 24, 2013 at 6:24 AM, Aniket Bhatnagar <
> > > aniket.bhatnagar@gmail.com> wrote:
> > >
> > > > Thanks Neha
> > > >
> > > >
> > > > On 24 October 2013 18:11, Neha Narkhede <ne...@gmail.com>
> > wrote:
> > > >
> > > > > Yes. And during retries, the producer and consumer refetch
> metadata.
> > > > >
> > > > > Thanks,
> > > > > Neha
> > > > > On Oct 24, 2013 3:09 AM, "Aniket Bhatnagar" <
> > > aniket.bhatnagar@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I am trying to understand and document how producers & consumers
> > > > > > will/should behave in case of node failures in 0.8. I know there
> > are
> > > > > > various other threads that discuss this but I wanted to bring all
> > the
> > > > > > information together in one post. This should help people
> building
> > > > > > producers & consumers in other languages as well. Here is my
> > > > > understanding
> > > > > > of how Kafak behaves in failures:
> > > > > >
> > > > > > Case 1: If a node fails that wasn't a leader for any partitions
> > > > > > No impact on consumers and producers
> > > > > >
> > > > > > Case 2: If a leader node fails but another in sync node can be
> > > become a
> > > > > > leader
> > > > > > All publishing to and consumption from the partition whose leader
> > > > failed
> > > > > > will momentarily stop until a new leader is elected. Producers
> > should
> > > > > > implement retry logic in such cases (and in fact in all kinds of
> > > errors
> > > > > > from Kafka) and consumers can (depending on your use case) either
> > > > > continue
> > > > > > to other partitions after retrying decent number of times (in
> case
> > > you
> > > > > are
> > > > > > fetching from partitions in round robin fashion) or keep retrying
> > > until
> > > > > > leader is available.
> > > > > >
> > > > > > Case 3: If a leader node goes down and no other in sync nodes are
> > > > > available
> > > > > > In this case, publishing to and consumption from the partition
> will
> > > > halt
> > > > > > and will not resume until the faulty leader node recovers. In
> this
> > > > case,
> > > > > > producers should fail the publish request after retrying decent
> > > number
> > > > of
> > > > > > times and provide a callback to the client of the producer to
> take
> > > > > > corrective action. Consumers again have a choice to continue to
> > other
> > > > > > partitions after retrying decent number of times (in case you are
> > > > > fetching
> > > > > > from partitions in round robin fashion) or keep retrying until
> > leader
> > > > is
> > > > > > available. In case of latter, the entire consumer process will
> halt
> > > > until
> > > > > > the faulty node recovers.
> > > > > >
> > > > > > Do I have this right?
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Chris Bedford
>
> Founder & Lead Lackey
> Build Lackey Labs:  http://buildlackey.com
> Go Grails!: http://blog.buildlackey.com
>

Re: Understanding how producers and consumers behave in case of node failures in 0.8

Posted by Chris Bedford <ch...@buildlackey.com>.

Thank you for posting these guidelines.   I'm wondering if anyone out there
that is using the Kafka spout (for Storm) knows whether or not the Kafka
spout takes care of these types of details ?

 regards
 -chris



On Thu, Oct 24, 2013 at 2:05 PM, Neha Narkhede <ne...@gmail.com>wrote:

> Yes, when a leader dies, the preference is to pick a leader from the ISR.
> If not, the leader is picked from any other available replica. But if no
> replicas are alive, the partition goes offline and all production and
> consumption halts, until at least one replica is brought online.
>
> Thanks,
> Neha
>
>
> On Thu, Oct 24, 2013 at 11:57 AM, Kane Kane <ka...@gmail.com> wrote:
>
> > >>publishing to and consumption from the partition will halt
> > and will not resume until the faulty leader node recovers
> >
> > Can you confirm that's the case? I think they won't wait until leader
> > recovered and will try to elect new leader from existing non-ISR
> replicas?
> > And in case if they wait, and faulty leader never comes back?
> >
> >
> > On Thu, Oct 24, 2013 at 6:24 AM, Aniket Bhatnagar <
> > aniket.bhatnagar@gmail.com> wrote:
> >
> > > Thanks Neha
> > >
> > >
> > > On 24 October 2013 18:11, Neha Narkhede <ne...@gmail.com>
> wrote:
> > >
> > > > Yes. And during retries, the producer and consumer refetch metadata.
> > > >
> > > > Thanks,
> > > > Neha
> > > > On Oct 24, 2013 3:09 AM, "Aniket Bhatnagar" <
> > aniket.bhatnagar@gmail.com>
> > > > wrote:
> > > >
> > > > > I am trying to understand and document how producers & consumers
> > > > > will/should behave in case of node failures in 0.8. I know there
> are
> > > > > various other threads that discuss this but I wanted to bring all
> the
> > > > > information together in one post. This should help people building
> > > > > producers & consumers in other languages as well. Here is my
> > > > understanding
> > > > > of how Kafak behaves in failures:
> > > > >
> > > > > Case 1: If a node fails that wasn't a leader for any partitions
> > > > > No impact on consumers and producers
> > > > >
> > > > > Case 2: If a leader node fails but another in sync node can be
> > become a
> > > > > leader
> > > > > All publishing to and consumption from the partition whose leader
> > > failed
> > > > > will momentarily stop until a new leader is elected. Producers
> should
> > > > > implement retry logic in such cases (and in fact in all kinds of
> > errors
> > > > > from Kafka) and consumers can (depending on your use case) either
> > > > continue
> > > > > to other partitions after retrying decent number of times (in case
> > you
> > > > are
> > > > > fetching from partitions in round robin fashion) or keep retrying
> > until
> > > > > leader is available.
> > > > >
> > > > > Case 3: If a leader node goes down and no other in sync nodes are
> > > > available
> > > > > In this case, publishing to and consumption from the partition will
> > > halt
> > > > > and will not resume until the faulty leader node recovers. In this
> > > case,
> > > > > producers should fail the publish request after retrying decent
> > number
> > > of
> > > > > times and provide a callback to the client of the producer to take
> > > > > corrective action. Consumers again have a choice to continue to
> other
> > > > > partitions after retrying decent number of times (in case you are
> > > > fetching
> > > > > from partitions in round robin fashion) or keep retrying until
> leader
> > > is
> > > > > available. In case of latter, the entire consumer process will halt
> > > until
> > > > > the faulty node recovers.
> > > > >
> > > > > Do I have this right?
> > > > >
> > > >
> > >
> >
>



-- 
Chris Bedford

Founder & Lead Lackey
Build Lackey Labs:  http://buildlackey.com
Go Grails!: http://blog.buildlackey.com

Re: Understanding how producers and consumers behave in case of node failures in 0.8

Posted by Neha Narkhede <ne...@gmail.com>.

Yes, when a leader dies, the preference is to pick a leader from the ISR.
If not, the leader is picked from any other available replica. But if no
replicas are alive, the partition goes offline and all production and
consumption halts, until at least one replica is brought online.

Thanks,
Neha


On Thu, Oct 24, 2013 at 11:57 AM, Kane Kane <ka...@gmail.com> wrote:

> >>publishing to and consumption from the partition will halt
> and will not resume until the faulty leader node recovers
>
> Can you confirm that's the case? I think they won't wait until leader
> recovered and will try to elect new leader from existing non-ISR replicas?
> And in case if they wait, and faulty leader never comes back?
>
>
> On Thu, Oct 24, 2013 at 6:24 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
> > Thanks Neha
> >
> >
> > On 24 October 2013 18:11, Neha Narkhede <ne...@gmail.com> wrote:
> >
> > > Yes. And during retries, the producer and consumer refetch metadata.
> > >
> > > Thanks,
> > > Neha
> > > On Oct 24, 2013 3:09 AM, "Aniket Bhatnagar" <
> aniket.bhatnagar@gmail.com>
> > > wrote:
> > >
> > > > I am trying to understand and document how producers & consumers
> > > > will/should behave in case of node failures in 0.8. I know there are
> > > > various other threads that discuss this but I wanted to bring all the
> > > > information together in one post. This should help people building
> > > > producers & consumers in other languages as well. Here is my
> > > understanding
> > > > of how Kafak behaves in failures:
> > > >
> > > > Case 1: If a node fails that wasn't a leader for any partitions
> > > > No impact on consumers and producers
> > > >
> > > > Case 2: If a leader node fails but another in sync node can be
> become a
> > > > leader
> > > > All publishing to and consumption from the partition whose leader
> > failed
> > > > will momentarily stop until a new leader is elected. Producers should
> > > > implement retry logic in such cases (and in fact in all kinds of
> errors
> > > > from Kafka) and consumers can (depending on your use case) either
> > > continue
> > > > to other partitions after retrying decent number of times (in case
> you
> > > are
> > > > fetching from partitions in round robin fashion) or keep retrying
> until
> > > > leader is available.
> > > >
> > > > Case 3: If a leader node goes down and no other in sync nodes are
> > > available
> > > > In this case, publishing to and consumption from the partition will
> > halt
> > > > and will not resume until the faulty leader node recovers. In this
> > case,
> > > > producers should fail the publish request after retrying decent
> number
> > of
> > > > times and provide a callback to the client of the producer to take
> > > > corrective action. Consumers again have a choice to continue to other
> > > > partitions after retrying decent number of times (in case you are
> > > fetching
> > > > from partitions in round robin fashion) or keep retrying until leader
> > is
> > > > available. In case of latter, the entire consumer process will halt
> > until
> > > > the faulty node recovers.
> > > >
> > > > Do I have this right?
> > > >
> > >
> >
>

Re: Understanding how producers and consumers behave in case of node failures in 0.8

Posted by Kane Kane <ka...@gmail.com>.

>>publishing to and consumption from the partition will halt
and will not resume until the faulty leader node recovers

Can you confirm that's the case? I think they won't wait until leader
recovered and will try to elect new leader from existing non-ISR replicas?
And in case if they wait, and faulty leader never comes back?


On Thu, Oct 24, 2013 at 6:24 AM, Aniket Bhatnagar <
aniket.bhatnagar@gmail.com> wrote:

> Thanks Neha
>
>
> On 24 October 2013 18:11, Neha Narkhede <ne...@gmail.com> wrote:
>
> > Yes. And during retries, the producer and consumer refetch metadata.
> >
> > Thanks,
> > Neha
> > On Oct 24, 2013 3:09 AM, "Aniket Bhatnagar" <an...@gmail.com>
> > wrote:
> >
> > > I am trying to understand and document how producers & consumers
> > > will/should behave in case of node failures in 0.8. I know there are
> > > various other threads that discuss this but I wanted to bring all the
> > > information together in one post. This should help people building
> > > producers & consumers in other languages as well. Here is my
> > understanding
> > > of how Kafak behaves in failures:
> > >
> > > Case 1: If a node fails that wasn't a leader for any partitions
> > > No impact on consumers and producers
> > >
> > > Case 2: If a leader node fails but another in sync node can be become a
> > > leader
> > > All publishing to and consumption from the partition whose leader
> failed
> > > will momentarily stop until a new leader is elected. Producers should
> > > implement retry logic in such cases (and in fact in all kinds of errors
> > > from Kafka) and consumers can (depending on your use case) either
> > continue
> > > to other partitions after retrying decent number of times (in case you
> > are
> > > fetching from partitions in round robin fashion) or keep retrying until
> > > leader is available.
> > >
> > > Case 3: If a leader node goes down and no other in sync nodes are
> > available
> > > In this case, publishing to and consumption from the partition will
> halt
> > > and will not resume until the faulty leader node recovers. In this
> case,
> > > producers should fail the publish request after retrying decent number
> of
> > > times and provide a callback to the client of the producer to take
> > > corrective action. Consumers again have a choice to continue to other
> > > partitions after retrying decent number of times (in case you are
> > fetching
> > > from partitions in round robin fashion) or keep retrying until leader
> is
> > > available. In case of latter, the entire consumer process will halt
> until
> > > the faulty node recovers.
> > >
> > > Do I have this right?
> > >
> >
>

Re: Understanding how producers and consumers behave in case of node failures in 0.8

Posted by Aniket Bhatnagar <an...@gmail.com>.

Thanks Neha


On 24 October 2013 18:11, Neha Narkhede <ne...@gmail.com> wrote:

> Yes. And during retries, the producer and consumer refetch metadata.
>
> Thanks,
> Neha
> On Oct 24, 2013 3:09 AM, "Aniket Bhatnagar" <an...@gmail.com>
> wrote:
>
> > I am trying to understand and document how producers & consumers
> > will/should behave in case of node failures in 0.8. I know there are
> > various other threads that discuss this but I wanted to bring all the
> > information together in one post. This should help people building
> > producers & consumers in other languages as well. Here is my
> understanding
> > of how Kafak behaves in failures:
> >
> > Case 1: If a node fails that wasn't a leader for any partitions
> > No impact on consumers and producers
> >
> > Case 2: If a leader node fails but another in sync node can be become a
> > leader
> > All publishing to and consumption from the partition whose leader failed
> > will momentarily stop until a new leader is elected. Producers should
> > implement retry logic in such cases (and in fact in all kinds of errors
> > from Kafka) and consumers can (depending on your use case) either
> continue
> > to other partitions after retrying decent number of times (in case you
> are
> > fetching from partitions in round robin fashion) or keep retrying until
> > leader is available.
> >
> > Case 3: If a leader node goes down and no other in sync nodes are
> available
> > In this case, publishing to and consumption from the partition will halt
> > and will not resume until the faulty leader node recovers. In this case,
> > producers should fail the publish request after retrying decent number of
> > times and provide a callback to the client of the producer to take
> > corrective action. Consumers again have a choice to continue to other
> > partitions after retrying decent number of times (in case you are
> fetching
> > from partitions in round robin fashion) or keep retrying until leader is
> > available. In case of latter, the entire consumer process will halt until
> > the faulty node recovers.
> >
> > Do I have this right?
> >
>

Re: Understanding how producers and consumers behave in case of node failures in 0.8

Posted by Neha Narkhede <ne...@gmail.com>.

Yes. And during retries, the producer and consumer refetch metadata.

Thanks,
Neha
On Oct 24, 2013 3:09 AM, "Aniket Bhatnagar" <an...@gmail.com>
wrote:

> I am trying to understand and document how producers & consumers
> will/should behave in case of node failures in 0.8. I know there are
> various other threads that discuss this but I wanted to bring all the
> information together in one post. This should help people building
> producers & consumers in other languages as well. Here is my understanding
> of how Kafak behaves in failures:
>
> Case 1: If a node fails that wasn't a leader for any partitions
> No impact on consumers and producers
>
> Case 2: If a leader node fails but another in sync node can be become a
> leader
> All publishing to and consumption from the partition whose leader failed
> will momentarily stop until a new leader is elected. Producers should
> implement retry logic in such cases (and in fact in all kinds of errors
> from Kafka) and consumers can (depending on your use case) either continue
> to other partitions after retrying decent number of times (in case you are
> fetching from partitions in round robin fashion) or keep retrying until
> leader is available.
>
> Case 3: If a leader node goes down and no other in sync nodes are available
> In this case, publishing to and consumption from the partition will halt
> and will not resume until the faulty leader node recovers. In this case,
> producers should fail the publish request after retrying decent number of
> times and provide a callback to the client of the producer to take
> corrective action. Consumers again have a choice to continue to other
> partitions after retrying decent number of times (in case you are fetching
> from partitions in round robin fashion) or keep retrying until leader is
> available. In case of latter, the entire consumer process will halt until
> the faulty node recovers.
>
> Do I have this right?
>