You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jason Gustafson (Jira)" <ji...@apache.org> on 2019/12/30 22:20:00 UTC
[jira] [Resolved] (KAFKA-1342) Slow controlled shutdowns can result
in stale shutdown requests
[ https://issues.apache.org/jira/browse/KAFKA-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson resolved KAFKA-1342.
------------------------------------
Resolution: Fixed
> Slow controlled shutdowns can result in stale shutdown requests
> ---------------------------------------------------------------
>
> Key: KAFKA-1342
> URL: https://issues.apache.org/jira/browse/KAFKA-1342
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8.1
> Reporter: Joel Jacob Koshy
> Assignee: Joel Jacob Koshy
> Priority: Critical
> Labels: newbie++, reliability
>
> I don't think this is a bug introduced in 0.8.1., but triggered by the fact
> that controlled shutdown seems to have become slower in 0.8.1 (will file a
> separate ticket to investigate that). When doing a rolling bounce, it is
> possible for a bounced broker to stop all its replica fetchers since the
> previous PID's shutdown requests are still being shutdown.
> - 515 is the controller
> - Controlled shutdown initiated for 503
> - Controller starts controlled shutdown for 503
> - The controlled shutdown takes a long time in moving leaders and moving
> follower replicas on 503 to the offline state.
> - So 503's read from the shutdown channel times out and a new channel is
> created. It issues another shutdown request. This request (since it is a
> new channel) is accepted at the controller's socket server but then waits
> on the broker shutdown lock held by the previous controlled shutdown which
> is still in progress.
> - The above step repeats for the remaining retries (six more requests).
> - 503 hits SocketTimeout exception on reading the response of the last
> shutdown request and proceeds to do an unclean shutdown.
> - The controller's onBrokerFailure call-back fires and moves 503's replicas
> to offline (not too important in this sequence).
> - 503 is brought back up.
> - The controller's onBrokerStartup call-back fires and moves its replicas
> (and partitions) to online state. 503 starts its replica fetchers.
> - Unfortunately, the (phantom) shutdown requests are still being handled and
> the controller sends StopReplica requests to 503.
> - The first shutdown request finally finishes (after 76 minutes in my case!).
> - The remaining shutdown requests also execute and do the same thing (sends
> StopReplica requests for all partitions to
> 503).
> - The remaining requests complete quickly because they end up not having to
> touch zookeeper paths - no leaders left on the broker and no need to
> shrink ISR in zookeeper since it has already been done by the first
> shutdown request.
> - So in the end-state 503 is up, but effectively idle due to the previous
> PID's shutdown requests.
> There are some obvious fixes that can be made to controlled shutdown to help
> address the above issue. E.g., we don't really need to move follower
> partitions to Offline. We did that as an "optimization" so the broker falls
> out of ISR sooner - which is helpful when producers set required.acks to -1.
> However it adds a lot of latency to controlled shutdown. Also, (more
> importantly) we should have a mechanism to abort any stale shutdown process.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)