You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "A. Sophie Blee-Goldman (Jira)" <ji...@apache.org> on 2020/11/05 23:00:00 UTC
[jira] [Commented] (KAFKA-10678) Re-deploying Streams app causes rebalance and task migration

    [ https://issues.apache.org/jira/browse/KAFKA-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227067#comment-17227067 ] 

A. Sophie Blee-Goldman commented on KAFKA-10678:
------------------------------------------------

Thanks for opening a separate ticket for this. There seem to be two main problems/unanswered questions here:

1) Why was there a rebalance at all if static membership was enabled?
2) Why did the rebalance result in a large shuffling of tasks?

For 1) it's difficult to say with only the broker side logs, since they won't tell us _why_ the client triggered a new rebalance after it was bounced. Would it be possible to collect logs from the client covering the period immediately after it was bounced, when it apparently tried to trigger a rebalance?

I was discussing question 2) with [~cadonna] and it seems to be a combination of a few things: first, the "eventual" assignment is currently performed without regard to the previous placement of tasks. It just tries to distribute tasks as evenly as possible, using intermediate assignments and probing rebalances as needed. [~vvcephei] wrote up some thoughts on this in KAFKA-10121. We're aware of this limitation but haven't addressed it since the assignor is deterministic and therefore no-op group changes – such as an existing member being bounced – shouldn't result in a different eventual assignment than the stable one pre-bounce.

Unfortunately this assignment identifies clients based on the encoded processId, which is actually randomly generated during StreamThread startup. So the processId identifier would change after a bounce, meaning different initial conditions to the assignor function and therefore a different final result :/

I think if the shuffling of tasks wasn't so bad then even if you did still get a rebalance even with static membership, then it would hardly be noticeable (given that it can continue to actively process during a cooperative rebalance). We could probably improve a majority of cases just by fixing the processId thing, but I feel like we might as well skip that and just go ahead with implementing KAFKA-10121 at that point to improve it for all cases.

> Re-deploying Streams app causes rebalance and task migration
> ------------------------------------------------------------
>
>                 Key: KAFKA-10678
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10678
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.6.0, 2.6.1
>            Reporter: Bradley Peterson
>            Priority: Major
>         Attachments: after, before, broker
>
>
> Re-deploying our Streams app causes a rebalance, even when using static group membership. Worse, the rebalance creates standby tasks, even when the previous task assignment was balanced and stable.
> Our app is currently using Streams 2.6.1-SNAPSHOT (due to [KAFKA-10633]) but we saw the same behavior in 2.6.0. The app runs on 4 EC2 instances, each with 4 streams threads, and data stored on persistent EBS volumes.. During a redeploy, all EC2 instances are stopped, new instances are launched, and the EBS volumes are attached to the new instances. We do not use interactive queries. {{session.timeout.ms}} is set to 30 minutes, and the deployment finishes well under that. {{num.standby.replicas}} is 0.
> h2. Expected Behavior
> Given a stable and balanced task assignment prior to deploying, we expect to see the same task assignment after deploying. Even if a rebalance is triggered, we do not expect to see new standby tasks.
> h2. Observed Behavior
> Attached are the "Assigned tasks to clients" log lines from before and after deploying. The "before" is from over 24 hours ago, the task assignment is well balanced and "Finished stable assignment of tasks, no followup rebalances required." is logged. The "after" log lines show the same assignment of active tasks, but some additional standby tasks. There are additional log lines about adding and removing active tasks, which I don't quite understand.
> I've also included logs from the broker showing the rebalance was triggered for "Updating metadata".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)