You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Tom Raney <to...@urbanairship.com> on 2017/01/27 18:16:05 UTC

stuck re-balance

After adding a new Kafka node, I ran the kafka-reassign-partitions.sh tool
to redistribute topics onto the new machine and it seemed like some of the
migrations were stuck processing for over 24 hours, so I cancelled the
reassignment by deleting the zk node (/admin/reassign_partitions) and used
the kafka-preferred-replica-election.sh to try and resolve it.  It didn't
work.

Now, I have partitions in a weird state.  For example, I have one partition
that has broker 1003 as a replica but it shouldn't be there.  The partition
directory on 1003 is still growing but is way behind the leader and the
other ISR on 1001.

Topic: foo Partition: 2 Leader: 1004 Replicas: 1003,1004,1001 Isr: 1004,1001

When I force a leader election, for that partition, it fails because 1003
is not in sync.

kafka.common.StateChangeFailedException: encountered error while electing
leader for partition [foo,2] due to: Preferred replica 1003 for partition
[foo,2] is either not alive or not in the isr. Current leader and ISR:
[{"leader":1004,"leader_epoch":11,"isr":[1004,1001]}].

When I try to reassign with the config...

{"version":1,"partitions":[{"topic":"foo","partition":2,"replicas":[1004,1001]}]}

I see that it doesn't resolve.

Status of partition reassignment:
Reassignment of partition [foo,2] is still in progress

And, I would think it would since 1001 is already an ISR and the leader is
already 1004.

How do I resolve this?

Re: stuck re-balance

Posted by Tom Raney <to...@urbanairship.com>.

Thanks, Todd!  Deleting the /controller znode worked.

On Fri, Jan 27, 2017 at 10:20 AM, Todd Palino <tp...@gmail.com> wrote:

> Did you move the controller (by deleting the /controller znode) after
> removing the reassign_partitions znode? If not, the controller is probably
> still trying to do that move, and is not going to accept a new move
> request.
>
> On Fri, Jan 27, 2017 at 10:16 AM, Tom Raney <to...@urbanairship.com>
> wrote:
>
> > After adding a new Kafka node, I ran the kafka-reassign-partitions.sh
> tool
> > to redistribute topics onto the new machine and it seemed like some of
> the
> > migrations were stuck processing for over 24 hours, so I cancelled the
> > reassignment by deleting the zk node (/admin/reassign_partitions) and
> used
> > the kafka-preferred-replica-election.sh to try and resolve it.  It
> didn't
> > work.
> >
> > Now, I have partitions in a weird state.  For example, I have one
> partition
> > that has broker 1003 as a replica but it shouldn't be there.  The
> partition
> > directory on 1003 is still growing but is way behind the leader and the
> > other ISR on 1001.
> >
> > Topic: foo Partition: 2 Leader: 1004 Replicas: 1003,1004,1001 Isr:
> > 1004,1001
> >
> > When I force a leader election, for that partition, it fails because 1003
> > is not in sync.
> >
> > kafka.common.StateChangeFailedException: encountered error while
> electing
> > leader for partition [foo,2] due to: Preferred replica 1003 for partition
> > [foo,2] is either not alive or not in the isr. Current leader and ISR:
> > [{"leader":1004,"leader_epoch":11,"isr":[1004,1001]}].
> >
> > When I try to reassign with the config...
> >
> > {"version":1,"partitions":[{"topic":"foo","partition":2,"
> > replicas":[1004,1001]}]}
> >
> > I see that it doesn't resolve.
> >
> > Status of partition reassignment:
> > Reassignment of partition [foo,2] is still in progress
> >
> > And, I would think it would since 1001 is already an ISR and the leader
> is
> > already 1004.
> >
> > How do I resolve this?
> >
>
>
>
> --
> *Todd Palino*
> Staff Site Reliability Engineer
> Data Infrastructure Streaming
>
>
>
> linkedin.com/in/toddpalino
>

Re: stuck re-balance

Posted by Todd Palino <tp...@gmail.com>.

Did you move the controller (by deleting the /controller znode) after
removing the reassign_partitions znode? If not, the controller is probably
still trying to do that move, and is not going to accept a new move request.

On Fri, Jan 27, 2017 at 10:16 AM, Tom Raney <to...@urbanairship.com>
wrote:

> After adding a new Kafka node, I ran the kafka-reassign-partitions.sh tool
> to redistribute topics onto the new machine and it seemed like some of the
> migrations were stuck processing for over 24 hours, so I cancelled the
> reassignment by deleting the zk node (/admin/reassign_partitions) and used
> the kafka-preferred-replica-election.sh to try and resolve it.  It didn't
> work.
>
> Now, I have partitions in a weird state.  For example, I have one partition
> that has broker 1003 as a replica but it shouldn't be there.  The partition
> directory on 1003 is still growing but is way behind the leader and the
> other ISR on 1001.
>
> Topic: foo Partition: 2 Leader: 1004 Replicas: 1003,1004,1001 Isr:
> 1004,1001
>
> When I force a leader election, for that partition, it fails because 1003
> is not in sync.
>
> kafka.common.StateChangeFailedException: encountered error while electing
> leader for partition [foo,2] due to: Preferred replica 1003 for partition
> [foo,2] is either not alive or not in the isr. Current leader and ISR:
> [{"leader":1004,"leader_epoch":11,"isr":[1004,1001]}].
>
> When I try to reassign with the config...
>
> {"version":1,"partitions":[{"topic":"foo","partition":2,"
> replicas":[1004,1001]}]}
>
> I see that it doesn't resolve.
>
> Status of partition reassignment:
> Reassignment of partition [foo,2] is still in progress
>
> And, I would think it would since 1001 is already an ISR and the leader is
> already 1004.
>
> How do I resolve this?
>



-- 
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming



linkedin.com/in/toddpalino