You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Karol Nowak <gr...@gmail.com> on 2014/12/01 15:31:06 UTC

Failed partition reassignment

Hi,

I observed some error messages / exceptions while running partition
reassignment on kafka 0.8.1.1 cluster. Being fairly new to this system I'm
not sure if these indicate serious failures or transient problems, or if
manual intervention is needed.

I used kafka-reassign-partitions.sh to reassign partitions from brokers
{143,155,155,93} to {143,155,115,68} on a healthy (?) cluster. Right now
one partition has just two replicas in the ISR and a number of partitions
is left with 4 partitions in ISR even though replication factor is 3. Logs
show a few zookeeper timeouts, but there were no GC pauses anywhere near
the session timeout. Zookeeper itself seems healthy and not overloaded,
with exception of regular CPU spikes, probably related to snapshots.

I cleaned the log lines a little bit for brevity.

First example: https://gist.github.com/knowak/a682afc1545fdeb836a1
Second one with two similar stack traces:
https://gist.github.com/knowak/6398be433d869d8141e5
Third one, many many of these:
https://gist.github.com/knowak/e78301259b74841702ae
Fourth: https://gist.github.com/knowak/1fbde5ca90d8f1924141
Fifth:https://gist.github.com/knowak/57fdcb75b3dc7c626893

Hints?


Thanks,
Karol

Re: Failed partition reassignment

Posted by Jun Rao <ju...@gmail.com>.

You can do the following (1) check if there is any error in the controller
and the state-change log, (2) use the per partition offset lag JMX in the
follower to see if the follower is making good progress.

Thanks,

Jun

On Tue, Dec 2, 2014 at 3:13 PM, Karol Nowak <gr...@gmail.com> wrote:

> I don't have it reproduced in a sandbox environment, but it's already
> happened twice on that cluster, so it's a safe bet to say it's reproducible
> in that setup. Are there special metrics / events that I should capture to
> make debugging this easier?
>
>
> Thanks,
> Karol
>
> On Tue, Dec 2, 2014 at 11:20 PM, Jun Rao <ju...@gmail.com> wrote:
>
> > Is there an easy way to reproduce the issues that you saw?
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Dec 1, 2014 at 6:31 AM, Karol Nowak <gr...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I observed some error messages / exceptions while running partition
> > > reassignment on kafka 0.8.1.1 cluster. Being fairly new to this system
> > I'm
> > > not sure if these indicate serious failures or transient problems, or
> if
> > > manual intervention is needed.
> > >
> > > I used kafka-reassign-partitions.sh to reassign partitions from brokers
> > > {143,155,155,93} to {143,155,115,68} on a healthy (?) cluster. Right
> now
> > > one partition has just two replicas in the ISR and a number of
> partitions
> > > is left with 4 partitions in ISR even though replication factor is 3.
> > Logs
> > > show a few zookeeper timeouts, but there were no GC pauses anywhere
> near
> > > the session timeout. Zookeeper itself seems healthy and not overloaded,
> > > with exception of regular CPU spikes, probably related to snapshots.
> > >
> > > I cleaned the log lines a little bit for brevity.
> > >
> > > First example: https://gist.github.com/knowak/a682afc1545fdeb836a1
> > > Second one with two similar stack traces:
> > > https://gist.github.com/knowak/6398be433d869d8141e5
> > > Third one, many many of these:
> > > https://gist.github.com/knowak/e78301259b74841702ae
> > > Fourth: https://gist.github.com/knowak/1fbde5ca90d8f1924141
> > > Fifth:https://gist.github.com/knowak/57fdcb75b3dc7c626893
> > >
> > > Hints?
> > >
> > >
> > > Thanks,
> > > Karol
> > >
> >
>
>
>
> --
> pozdrawiam
> Karol Nowak
> http://knowak.wordpress.com
>

Re: Failed partition reassignment

Posted by Karol Nowak <gr...@gmail.com>.

I don't have it reproduced in a sandbox environment, but it's already
happened twice on that cluster, so it's a safe bet to say it's reproducible
in that setup. Are there special metrics / events that I should capture to
make debugging this easier?


Thanks,
Karol

On Tue, Dec 2, 2014 at 11:20 PM, Jun Rao <ju...@gmail.com> wrote:

> Is there an easy way to reproduce the issues that you saw?
>
> Thanks,
>
> Jun
>
> On Mon, Dec 1, 2014 at 6:31 AM, Karol Nowak <gr...@gmail.com> wrote:
>
> > Hi,
> >
> > I observed some error messages / exceptions while running partition
> > reassignment on kafka 0.8.1.1 cluster. Being fairly new to this system
> I'm
> > not sure if these indicate serious failures or transient problems, or if
> > manual intervention is needed.
> >
> > I used kafka-reassign-partitions.sh to reassign partitions from brokers
> > {143,155,155,93} to {143,155,115,68} on a healthy (?) cluster. Right now
> > one partition has just two replicas in the ISR and a number of partitions
> > is left with 4 partitions in ISR even though replication factor is 3.
> Logs
> > show a few zookeeper timeouts, but there were no GC pauses anywhere near
> > the session timeout. Zookeeper itself seems healthy and not overloaded,
> > with exception of regular CPU spikes, probably related to snapshots.
> >
> > I cleaned the log lines a little bit for brevity.
> >
> > First example: https://gist.github.com/knowak/a682afc1545fdeb836a1
> > Second one with two similar stack traces:
> > https://gist.github.com/knowak/6398be433d869d8141e5
> > Third one, many many of these:
> > https://gist.github.com/knowak/e78301259b74841702ae
> > Fourth: https://gist.github.com/knowak/1fbde5ca90d8f1924141
> > Fifth:https://gist.github.com/knowak/57fdcb75b3dc7c626893
> >
> > Hints?
> >
> >
> > Thanks,
> > Karol
> >
>



-- 
pozdrawiam
Karol Nowak
http://knowak.wordpress.com

Re: Failed partition reassignment

Posted by Jun Rao <ju...@gmail.com>.

Is there an easy way to reproduce the issues that you saw?

Thanks,

Jun

On Mon, Dec 1, 2014 at 6:31 AM, Karol Nowak <gr...@gmail.com> wrote:

> Hi,
>
> I observed some error messages / exceptions while running partition
> reassignment on kafka 0.8.1.1 cluster. Being fairly new to this system I'm
> not sure if these indicate serious failures or transient problems, or if
> manual intervention is needed.
>
> I used kafka-reassign-partitions.sh to reassign partitions from brokers
> {143,155,155,93} to {143,155,115,68} on a healthy (?) cluster. Right now
> one partition has just two replicas in the ISR and a number of partitions
> is left with 4 partitions in ISR even though replication factor is 3. Logs
> show a few zookeeper timeouts, but there were no GC pauses anywhere near
> the session timeout. Zookeeper itself seems healthy and not overloaded,
> with exception of regular CPU spikes, probably related to snapshots.
>
> I cleaned the log lines a little bit for brevity.
>
> First example: https://gist.github.com/knowak/a682afc1545fdeb836a1
> Second one with two similar stack traces:
> https://gist.github.com/knowak/6398be433d869d8141e5
> Third one, many many of these:
> https://gist.github.com/knowak/e78301259b74841702ae
> Fourth: https://gist.github.com/knowak/1fbde5ca90d8f1924141
> Fifth:https://gist.github.com/knowak/57fdcb75b3dc7c626893
>
> Hints?
>
>
> Thanks,
> Karol
>