You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Chinmay Soman <ch...@gmail.com> on 2015/02/19 19:44:55 UTC

Question on hello-samza (Kafka startup and shutdown)

Sending to a wider audience to know if anyone is also seeing this issue.

It seems Kafka gets in a weird state everytime I do bin/grid stop all  (and
then start all).

I keep getting a LeaderNotAvailable exception on the producer side. It
seems this happens everytime Kafka hasn't been shut down properly. This
issue goes away if I use the following sequence:

* bin/grid stop kafka
* bin/grid stop zookeeper (after like 5 seconds).

(and then start everything).

Has anyone else seen this ?

-- 
Thanks and regards

Chinmay Soman

Re: Question on hello-samza (Kafka startup and shutdown)

Posted by Chinmay Soman <ch...@gmail.com>.

That might be it - yes ! I do see infinite retires on the Zookeeper
connections. Thanks for pointing it out

On Fri, Feb 20, 2015 at 10:31 AM, Aditya Auradkar <
aauradkar@linkedin.com.invalid> wrote:

> Hey Chinmay,
>
> I remember someone else having this issue with Kafka + Zookeeper. IIRC,
> the cause was ZkClient blocking indefinitely.
>
> You may find this useful.
> https://issues.apache.org/jira/browse/KAFKA-1907
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201501.mbox/browser
>
> Aditya
>
> ________________________________________
> From: Chinmay Soman [chinmay.cerebro@gmail.com]
> Sent: Friday, February 20, 2015 10:15 AM
> To: dev@samza.apache.org
> Subject: Re: Question on hello-samza (Kafka startup and shutdown)
>
> I haven't really figured it out. But just to clarify - I'm not starting
> stopping within 5 seconds of each other - its more like a couple of hours.
>
> The Kafka process is indeed running even after stop all : It seems to be
> waiting on Zookeeper (doing a lot of retries). If I bring up Zookeeper
> again - then the Kafka process shuts down cleanly :)   But yes - in most
> cases I'm using SIGKILL and not SIGTERM to resolve this.
>
> This is not really an urgent issue - but was just curious - what's really
> happening ?
>
> On Fri, Feb 20, 2015 at 8:47 AM, Chris Riccomini <cr...@apache.org>
> wrote:
>
> > Hey Chinmay,
> >
> > It seems controlled.shutdown.enable=true is the default. Chinmay, did you
> > figure this out? I haven't seen this before, but I don't usually
> stop/start
> > within 5s of eachother.
> >
> > One thing that you might have a look at is whether the Kafka or ZK
> > processes are living past bin/grid stop all. I have seen procs (NM and
> > Kafka usually) continue to be alive after `stop all` is executed. I think
> > this is because the stop scripts SIGTERM and return immediately. This
> > allows procs to do a cleaner shutdown. But if you stop/start quickly, you
> > might get some weirdness there. Try jps'ing in between the stop/start,
> and
> > check to make sure there's nothing still alive (wait in a loop until
> > everything shuts down cleanly, and kill -9 if it takes more than 60s, or
> > something).
> >
> > Cheers,
> > Chris
> >
> > On Thu, Feb 19, 2015 at 2:01 PM, Neha Narkhede <ne...@gmail.com>
> > wrote:
> >
> > > Depending on the version of Kafka you're at,
> "controlled.shutdown.enable"
> > > should be set to true. If that's true and you always shutdown the
> broker
> > > cleanly (kill -15, not kill -9) and there are more than 1 replicas
> > > available, you should not see LeaderNotAvailable exceptions. If you
> kill
> > > the broker (kill -9) then Kafka does not get a chance to move the
> leaders
> > > away from the broken being shut down and the leader re-election can
> take
> > > some time leading to many LeaderNotAvailable exceptions.
> > >
> > > You can verify the replica availability as well as leader movement
> > through
> > > the kafka-topics command before shutting down zookeeper.
> > >
> > > Thanks
> > > Neha
> > >
> > > On Thu, Feb 19, 2015 at 10:51 AM, Felix GV
> > <fvillegas@linkedin.com.invalid
> > > >
> > > wrote:
> > >
> > > > I'm not 100% sure, but I think this happens when ZK ephemeral znodes
> > have
> > > > not had time to expire properly. When Kafka shuts down gracefully, it
> > > > should clean up its ephemeral nodes immediately (presumably, but that
> > is
> > > > also an assumption... maybe it does have a short-coming in its
> graceful
> > > > shutdown logic). If Kafka gets killed improperly and bounced back up
> > > right
> > > > away, it cannot assume leadership properly because the ephemeral
> znodes
> > > of
> > > > the previous run are still there in ZK.
> > > >
> > > > I imagine Kafka could have some logic to deal with that better when
> it
> > > > gets fast-bounced... Alternatively, you may just have to wait a bit
> > > before
> > > > restarting Kafka after killing it.
> > > >
> > > > If anyone knows better, please correct me if I'm wrong.
> > > >
> > > > --
> > > >
> > > > Felix GV
> > > > Data Infrastructure Engineer
> > > > Distributed Data Systems
> > > > LinkedIn
> > > >
> > > > fgv@linkedin.com
> > > > linkedin.com/in/felixgv
> > > >
> > > > ________________________________________
> > > > From: Chinmay Soman [chinmay.cerebro@gmail.com]
> > > > Sent: Thursday, February 19, 2015 10:44 AM
> > > > To: dev@samza.apache.org
> > > > Subject: Question on hello-samza (Kafka startup and shutdown)
> > > >
> > > > Sending to a wider audience to know if anyone is also seeing this
> > issue.
> > > >
> > > > It seems Kafka gets in a weird state everytime I do bin/grid stop all
> > > (and
> > > > then start all).
> > > >
> > > > I keep getting a LeaderNotAvailable exception on the producer side.
> It
> > > > seems this happens everytime Kafka hasn't been shut down properly.
> This
> > > > issue goes away if I use the following sequence:
> > > >
> > > > * bin/grid stop kafka
> > > > * bin/grid stop zookeeper (after like 5 seconds).
> > > >
> > > > (and then start everything).
> > > >
> > > > Has anyone else seen this ?
> > > >
> > > > --
> > > > Thanks and regards
> > > >
> > > > Chinmay Soman
> > > >
> > >
> >
>
>
>
> --
> Thanks and regards
>
> Chinmay Soman
>



-- 
Thanks and regards

Chinmay Soman

RE: Question on hello-samza (Kafka startup and shutdown)

Posted by Aditya Auradkar <aa...@linkedin.com.INVALID>.

Hey Chinmay,

I remember someone else having this issue with Kafka + Zookeeper. IIRC, the cause was ZkClient blocking indefinitely.

You may find this useful.
https://issues.apache.org/jira/browse/KAFKA-1907
http://mail-archives.apache.org/mod_mbox/kafka-dev/201501.mbox/browser

Aditya

________________________________________
From: Chinmay Soman [chinmay.cerebro@gmail.com]
Sent: Friday, February 20, 2015 10:15 AM
To: dev@samza.apache.org
Subject: Re: Question on hello-samza (Kafka startup and shutdown)

I haven't really figured it out. But just to clarify - I'm not starting
stopping within 5 seconds of each other - its more like a couple of hours.

The Kafka process is indeed running even after stop all : It seems to be
waiting on Zookeeper (doing a lot of retries). If I bring up Zookeeper
again - then the Kafka process shuts down cleanly :)   But yes - in most
cases I'm using SIGKILL and not SIGTERM to resolve this.

This is not really an urgent issue - but was just curious - what's really
happening ?

On Fri, Feb 20, 2015 at 8:47 AM, Chris Riccomini <cr...@apache.org>
wrote:

> Hey Chinmay,
>
> It seems controlled.shutdown.enable=true is the default. Chinmay, did you
> figure this out? I haven't seen this before, but I don't usually stop/start
> within 5s of eachother.
>
> One thing that you might have a look at is whether the Kafka or ZK
> processes are living past bin/grid stop all. I have seen procs (NM and
> Kafka usually) continue to be alive after `stop all` is executed. I think
> this is because the stop scripts SIGTERM and return immediately. This
> allows procs to do a cleaner shutdown. But if you stop/start quickly, you
> might get some weirdness there. Try jps'ing in between the stop/start, and
> check to make sure there's nothing still alive (wait in a loop until
> everything shuts down cleanly, and kill -9 if it takes more than 60s, or
> something).
>
> Cheers,
> Chris
>
> On Thu, Feb 19, 2015 at 2:01 PM, Neha Narkhede <ne...@gmail.com>
> wrote:
>
> > Depending on the version of Kafka you're at, "controlled.shutdown.enable"
> > should be set to true. If that's true and you always shutdown the broker
> > cleanly (kill -15, not kill -9) and there are more than 1 replicas
> > available, you should not see LeaderNotAvailable exceptions. If you kill
> > the broker (kill -9) then Kafka does not get a chance to move the leaders
> > away from the broken being shut down and the leader re-election can take
> > some time leading to many LeaderNotAvailable exceptions.
> >
> > You can verify the replica availability as well as leader movement
> through
> > the kafka-topics command before shutting down zookeeper.
> >
> > Thanks
> > Neha
> >
> > On Thu, Feb 19, 2015 at 10:51 AM, Felix GV
> <fvillegas@linkedin.com.invalid
> > >
> > wrote:
> >
> > > I'm not 100% sure, but I think this happens when ZK ephemeral znodes
> have
> > > not had time to expire properly. When Kafka shuts down gracefully, it
> > > should clean up its ephemeral nodes immediately (presumably, but that
> is
> > > also an assumption... maybe it does have a short-coming in its graceful
> > > shutdown logic). If Kafka gets killed improperly and bounced back up
> > right
> > > away, it cannot assume leadership properly because the ephemeral znodes
> > of
> > > the previous run are still there in ZK.
> > >
> > > I imagine Kafka could have some logic to deal with that better when it
> > > gets fast-bounced... Alternatively, you may just have to wait a bit
> > before
> > > restarting Kafka after killing it.
> > >
> > > If anyone knows better, please correct me if I'm wrong.
> > >
> > > --
> > >
> > > Felix GV
> > > Data Infrastructure Engineer
> > > Distributed Data Systems
> > > LinkedIn
> > >
> > > fgv@linkedin.com
> > > linkedin.com/in/felixgv
> > >
> > > ________________________________________
> > > From: Chinmay Soman [chinmay.cerebro@gmail.com]
> > > Sent: Thursday, February 19, 2015 10:44 AM
> > > To: dev@samza.apache.org
> > > Subject: Question on hello-samza (Kafka startup and shutdown)
> > >
> > > Sending to a wider audience to know if anyone is also seeing this
> issue.
> > >
> > > It seems Kafka gets in a weird state everytime I do bin/grid stop all
> > (and
> > > then start all).
> > >
> > > I keep getting a LeaderNotAvailable exception on the producer side. It
> > > seems this happens everytime Kafka hasn't been shut down properly. This
> > > issue goes away if I use the following sequence:
> > >
> > > * bin/grid stop kafka
> > > * bin/grid stop zookeeper (after like 5 seconds).
> > >
> > > (and then start everything).
> > >
> > > Has anyone else seen this ?
> > >
> > > --
> > > Thanks and regards
> > >
> > > Chinmay Soman
> > >
> >
>



--
Thanks and regards

Chinmay Soman

Re: Question on hello-samza (Kafka startup and shutdown)

Posted by Chinmay Soman <ch...@gmail.com>.

I haven't really figured it out. But just to clarify - I'm not starting
stopping within 5 seconds of each other - its more like a couple of hours.

The Kafka process is indeed running even after stop all : It seems to be
waiting on Zookeeper (doing a lot of retries). If I bring up Zookeeper
again - then the Kafka process shuts down cleanly :)   But yes - in most
cases I'm using SIGKILL and not SIGTERM to resolve this.

This is not really an urgent issue - but was just curious - what's really
happening ?

On Fri, Feb 20, 2015 at 8:47 AM, Chris Riccomini <cr...@apache.org>
wrote:

> Hey Chinmay,
>
> It seems controlled.shutdown.enable=true is the default. Chinmay, did you
> figure this out? I haven't seen this before, but I don't usually stop/start
> within 5s of eachother.
>
> One thing that you might have a look at is whether the Kafka or ZK
> processes are living past bin/grid stop all. I have seen procs (NM and
> Kafka usually) continue to be alive after `stop all` is executed. I think
> this is because the stop scripts SIGTERM and return immediately. This
> allows procs to do a cleaner shutdown. But if you stop/start quickly, you
> might get some weirdness there. Try jps'ing in between the stop/start, and
> check to make sure there's nothing still alive (wait in a loop until
> everything shuts down cleanly, and kill -9 if it takes more than 60s, or
> something).
>
> Cheers,
> Chris
>
> On Thu, Feb 19, 2015 at 2:01 PM, Neha Narkhede <ne...@gmail.com>
> wrote:
>
> > Depending on the version of Kafka you're at, "controlled.shutdown.enable"
> > should be set to true. If that's true and you always shutdown the broker
> > cleanly (kill -15, not kill -9) and there are more than 1 replicas
> > available, you should not see LeaderNotAvailable exceptions. If you kill
> > the broker (kill -9) then Kafka does not get a chance to move the leaders
> > away from the broken being shut down and the leader re-election can take
> > some time leading to many LeaderNotAvailable exceptions.
> >
> > You can verify the replica availability as well as leader movement
> through
> > the kafka-topics command before shutting down zookeeper.
> >
> > Thanks
> > Neha
> >
> > On Thu, Feb 19, 2015 at 10:51 AM, Felix GV
> <fvillegas@linkedin.com.invalid
> > >
> > wrote:
> >
> > > I'm not 100% sure, but I think this happens when ZK ephemeral znodes
> have
> > > not had time to expire properly. When Kafka shuts down gracefully, it
> > > should clean up its ephemeral nodes immediately (presumably, but that
> is
> > > also an assumption... maybe it does have a short-coming in its graceful
> > > shutdown logic). If Kafka gets killed improperly and bounced back up
> > right
> > > away, it cannot assume leadership properly because the ephemeral znodes
> > of
> > > the previous run are still there in ZK.
> > >
> > > I imagine Kafka could have some logic to deal with that better when it
> > > gets fast-bounced... Alternatively, you may just have to wait a bit
> > before
> > > restarting Kafka after killing it.
> > >
> > > If anyone knows better, please correct me if I'm wrong.
> > >
> > > --
> > >
> > > Felix GV
> > > Data Infrastructure Engineer
> > > Distributed Data Systems
> > > LinkedIn
> > >
> > > fgv@linkedin.com
> > > linkedin.com/in/felixgv
> > >
> > > ________________________________________
> > > From: Chinmay Soman [chinmay.cerebro@gmail.com]
> > > Sent: Thursday, February 19, 2015 10:44 AM
> > > To: dev@samza.apache.org
> > > Subject: Question on hello-samza (Kafka startup and shutdown)
> > >
> > > Sending to a wider audience to know if anyone is also seeing this
> issue.
> > >
> > > It seems Kafka gets in a weird state everytime I do bin/grid stop all
> > (and
> > > then start all).
> > >
> > > I keep getting a LeaderNotAvailable exception on the producer side. It
> > > seems this happens everytime Kafka hasn't been shut down properly. This
> > > issue goes away if I use the following sequence:
> > >
> > > * bin/grid stop kafka
> > > * bin/grid stop zookeeper (after like 5 seconds).
> > >
> > > (and then start everything).
> > >
> > > Has anyone else seen this ?
> > >
> > > --
> > > Thanks and regards
> > >
> > > Chinmay Soman
> > >
> >
>



-- 
Thanks and regards

Chinmay Soman

Re: Question on hello-samza (Kafka startup and shutdown)

Posted by Chris Riccomini <cr...@apache.org>.

Hey Chinmay,

It seems controlled.shutdown.enable=true is the default. Chinmay, did you
figure this out? I haven't seen this before, but I don't usually stop/start
within 5s of eachother.

One thing that you might have a look at is whether the Kafka or ZK
processes are living past bin/grid stop all. I have seen procs (NM and
Kafka usually) continue to be alive after `stop all` is executed. I think
this is because the stop scripts SIGTERM and return immediately. This
allows procs to do a cleaner shutdown. But if you stop/start quickly, you
might get some weirdness there. Try jps'ing in between the stop/start, and
check to make sure there's nothing still alive (wait in a loop until
everything shuts down cleanly, and kill -9 if it takes more than 60s, or
something).

Cheers,
Chris

On Thu, Feb 19, 2015 at 2:01 PM, Neha Narkhede <ne...@gmail.com>
wrote:

> Depending on the version of Kafka you're at, "controlled.shutdown.enable"
> should be set to true. If that's true and you always shutdown the broker
> cleanly (kill -15, not kill -9) and there are more than 1 replicas
> available, you should not see LeaderNotAvailable exceptions. If you kill
> the broker (kill -9) then Kafka does not get a chance to move the leaders
> away from the broken being shut down and the leader re-election can take
> some time leading to many LeaderNotAvailable exceptions.
>
> You can verify the replica availability as well as leader movement through
> the kafka-topics command before shutting down zookeeper.
>
> Thanks
> Neha
>
> On Thu, Feb 19, 2015 at 10:51 AM, Felix GV <fvillegas@linkedin.com.invalid
> >
> wrote:
>
> > I'm not 100% sure, but I think this happens when ZK ephemeral znodes have
> > not had time to expire properly. When Kafka shuts down gracefully, it
> > should clean up its ephemeral nodes immediately (presumably, but that is
> > also an assumption... maybe it does have a short-coming in its graceful
> > shutdown logic). If Kafka gets killed improperly and bounced back up
> right
> > away, it cannot assume leadership properly because the ephemeral znodes
> of
> > the previous run are still there in ZK.
> >
> > I imagine Kafka could have some logic to deal with that better when it
> > gets fast-bounced... Alternatively, you may just have to wait a bit
> before
> > restarting Kafka after killing it.
> >
> > If anyone knows better, please correct me if I'm wrong.
> >
> > --
> >
> > Felix GV
> > Data Infrastructure Engineer
> > Distributed Data Systems
> > LinkedIn
> >
> > fgv@linkedin.com
> > linkedin.com/in/felixgv
> >
> > ________________________________________
> > From: Chinmay Soman [chinmay.cerebro@gmail.com]
> > Sent: Thursday, February 19, 2015 10:44 AM
> > To: dev@samza.apache.org
> > Subject: Question on hello-samza (Kafka startup and shutdown)
> >
> > Sending to a wider audience to know if anyone is also seeing this issue.
> >
> > It seems Kafka gets in a weird state everytime I do bin/grid stop all
> (and
> > then start all).
> >
> > I keep getting a LeaderNotAvailable exception on the producer side. It
> > seems this happens everytime Kafka hasn't been shut down properly. This
> > issue goes away if I use the following sequence:
> >
> > * bin/grid stop kafka
> > * bin/grid stop zookeeper (after like 5 seconds).
> >
> > (and then start everything).
> >
> > Has anyone else seen this ?
> >
> > --
> > Thanks and regards
> >
> > Chinmay Soman
> >
>

Re: Question on hello-samza (Kafka startup and shutdown)

Posted by Neha Narkhede <ne...@gmail.com>.

Depending on the version of Kafka you're at, "controlled.shutdown.enable"
should be set to true. If that's true and you always shutdown the broker
cleanly (kill -15, not kill -9) and there are more than 1 replicas
available, you should not see LeaderNotAvailable exceptions. If you kill
the broker (kill -9) then Kafka does not get a chance to move the leaders
away from the broken being shut down and the leader re-election can take
some time leading to many LeaderNotAvailable exceptions.

You can verify the replica availability as well as leader movement through
the kafka-topics command before shutting down zookeeper.

Thanks
Neha

On Thu, Feb 19, 2015 at 10:51 AM, Felix GV <fv...@linkedin.com.invalid>
wrote:

> I'm not 100% sure, but I think this happens when ZK ephemeral znodes have
> not had time to expire properly. When Kafka shuts down gracefully, it
> should clean up its ephemeral nodes immediately (presumably, but that is
> also an assumption... maybe it does have a short-coming in its graceful
> shutdown logic). If Kafka gets killed improperly and bounced back up right
> away, it cannot assume leadership properly because the ephemeral znodes of
> the previous run are still there in ZK.
>
> I imagine Kafka could have some logic to deal with that better when it
> gets fast-bounced... Alternatively, you may just have to wait a bit before
> restarting Kafka after killing it.
>
> If anyone knows better, please correct me if I'm wrong.
>
> --
>
> Felix GV
> Data Infrastructure Engineer
> Distributed Data Systems
> LinkedIn
>
> fgv@linkedin.com
> linkedin.com/in/felixgv
>
> ________________________________________
> From: Chinmay Soman [chinmay.cerebro@gmail.com]
> Sent: Thursday, February 19, 2015 10:44 AM
> To: dev@samza.apache.org
> Subject: Question on hello-samza (Kafka startup and shutdown)
>
> Sending to a wider audience to know if anyone is also seeing this issue.
>
> It seems Kafka gets in a weird state everytime I do bin/grid stop all  (and
> then start all).
>
> I keep getting a LeaderNotAvailable exception on the producer side. It
> seems this happens everytime Kafka hasn't been shut down properly. This
> issue goes away if I use the following sequence:
>
> * bin/grid stop kafka
> * bin/grid stop zookeeper (after like 5 seconds).
>
> (and then start everything).
>
> Has anyone else seen this ?
>
> --
> Thanks and regards
>
> Chinmay Soman
>

RE: Question on hello-samza (Kafka startup and shutdown)

Posted by Felix GV <fv...@linkedin.com.INVALID>.

I'm not 100% sure, but I think this happens when ZK ephemeral znodes have not had time to expire properly. When Kafka shuts down gracefully, it should clean up its ephemeral nodes immediately (presumably, but that is also an assumption... maybe it does have a short-coming in its graceful shutdown logic). If Kafka gets killed improperly and bounced back up right away, it cannot assume leadership properly because the ephemeral znodes of the previous run are still there in ZK.

I imagine Kafka could have some logic to deal with that better when it gets fast-bounced... Alternatively, you may just have to wait a bit before restarting Kafka after killing it.

If anyone knows better, please correct me if I'm wrong.

--

Felix GV
Data Infrastructure Engineer
Distributed Data Systems
LinkedIn

fgv@linkedin.com
linkedin.com/in/felixgv

________________________________________
From: Chinmay Soman [chinmay.cerebro@gmail.com]
Sent: Thursday, February 19, 2015 10:44 AM
To: dev@samza.apache.org
Subject: Question on hello-samza (Kafka startup and shutdown)

Sending to a wider audience to know if anyone is also seeing this issue.

It seems Kafka gets in a weird state everytime I do bin/grid stop all  (and
then start all).

I keep getting a LeaderNotAvailable exception on the producer side. It
seems this happens everytime Kafka hasn't been shut down properly. This
issue goes away if I use the following sequence:

* bin/grid stop kafka
* bin/grid stop zookeeper (after like 5 seconds).

(and then start everything).

Has anyone else seen this ?

--
Thanks and regards

Chinmay Soman