You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Brian Sung-jin Hong (JIRA)" <ji...@apache.org> on 2015/09/23 10:55:04 UTC

[jira] [Commented] (KAFKA-2459) Connection backoff/blackout period should start when a connection is disconnected, not when the connection attempt was initiated

    [ https://issues.apache.org/jira/browse/KAFKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904194#comment-14904194 ] 

Brian Sung-jin Hong commented on KAFKA-2459:
--------------------------------------------

How's the status of this issue? We use Kafka in AWS EC2. When a Kafka instance is terminated, we experience this problem.

I tried to fix this myself. But looking at the code, it seems to need many refactoring for this to work out. If no one is working on this issue, can anyone give me some guidance for me to proceed?

> Connection backoff/blackout period should start when a connection is disconnected, not when the connection attempt was initiated
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-2459
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2459
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer, producer 
>    Affects Versions: 0.8.2.1
>            Reporter: Ewen Cheslack-Postava
>            Assignee: Manikumar Reddy
>
> Currently the connection code for new clients marks the time when a connection was initiated (NodeConnectionState.lastConnectMs) and then uses this to compute blackout periods for nodes, during which connections will not be attempted and the node is not considered a candidate for leastLoadedNode.
> However, in cases where the connection attempt takes longer than the blackout/backoff period (default 10ms), this results in incorrect behavior. If a broker is not available and, for example, the broker does not explicitly reject the connection, instead waiting for a connection timeout (e.g. due to firewall settings), then the backoff period will have already elapsed and the node will immediately be considered ready for a new connection attempt and a node to be selected by leastLoadedNode for metadata updates. I think it should be easy to reproduce and verify this problem manually by using tc to introduce enough latency to make connection failures take > 10ms.
> The correct behavior would use the disconnection event to mark the end of the last connection attempt and then wait for the backoff period to elapse after that.
> See http://mail-archives.apache.org/mod_mbox/kafka-users/201508.mbox/%3CCAJY8EofpeU4%2BAJ%3Dw91HDUx2RabjkWoU00Z%3DcQ2wHcQSrbPT4HA%40mail.gmail.com%3E for the original description of the problem.
> This is related to KAFKA-1843 because leastLoadedNode currently will consistently choose the same node if this blackout period is not handled correctly, but is a much smaller issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)