You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Maysam Yabandeh <my...@dropbox.com> on 2016/05/12 00:24:02 UTC

Avoid recreating reassign partition path in zk if it is already deleted

Hi

I wondering if makes sense to remove
{code}
          case nne: ZkNoNodeException =>
            createPersistentPath(zkPath, jsonData)
            debug("Created path %s with %s for partition
reassignment".format(zkPath, jsonData))
{code}
from ZKUtils::updatePartitionReassignmentData, which has caused an incident
for us.

The code does not seem to be doing anything in the normal case: if reassign
path does not exist when removePartitionFromReassignedPartitions starts, it
then has nothing to write back to zk anyway. The only time that the code
kick in is when the admin manually deletes the zk path in the middle of
update, which essentially cancels the admin's attempt to stop a bad
partition assignment.

The incident in our case was a very large json file that was mistakenly
used by admin for partition assignment. The controller zk thread was in a
busy loop removing partitions from this json file stored at zk, one by one.
We attempted to stop the assignment by i) removing the zk path, ii)
changing the controller. However, due to the many zk update operations by
the active controller, the path would be recreated over and over. Changing
the controller would also did not help since the new controller resumes the
badly started reassignment job by picking it up from zk.

Simply removing createPersistentPath in the catch clause should avoid such
problems and yet does not seem to changing the intended semantics of
removePartitionFromReassignedPartitions.

Thoughts?

Thanks
Maysam

Re: Avoid recreating reassign partition path in zk if it is already deleted

Posted by Maysam Yabandeh <my...@dropbox.com>.
For the benefit of the future readers, a simple workaround for this issue
is to:
# change the controller to a non-existing broker,
# delete the current assignment from zk, and
# then change the controller to an existent broker

Maysam

On Wed, May 11, 2016 at 5:24 PM, Maysam Yabandeh <my...@dropbox.com>
wrote:

> Hi
>
> I wondering if makes sense to remove
> {code}
>           case nne: ZkNoNodeException =>
>             createPersistentPath(zkPath, jsonData)
>             debug("Created path %s with %s for partition
> reassignment".format(zkPath, jsonData))
> {code}
> from ZKUtils::updatePartitionReassignmentData, which has caused an
> incident for us.
>
> The code does not seem to be doing anything in the normal case: if
> reassign path does not exist when removePartitionFromReassignedPartitions
> starts, it then has nothing to write back to zk anyway. The only time that
> the code kick in is when the admin manually deletes the zk path in the
> middle of update, which essentially cancels the admin's attempt to stop a
> bad partition assignment.
>
> The incident in our case was a very large json file that was mistakenly
> used by admin for partition assignment. The controller zk thread was in a
> busy loop removing partitions from this json file stored at zk, one by one.
> We attempted to stop the assignment by i) removing the zk path, ii)
> changing the controller. However, due to the many zk update operations by
> the active controller, the path would be recreated over and over. Changing
> the controller would also did not help since the new controller resumes the
> badly started reassignment job by picking it up from zk.
>
> Simply removing createPersistentPath in the catch clause should avoid such
> problems and yet does not seem to changing the intended semantics of
> removePartitionFromReassignedPartitions.
>
> Thoughts?
>
> Thanks
> Maysam
>