You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Neha Narkhede (JIRA)" <ji...@apache.org> on 2013/08/28 07:23:51 UTC

[jira] [Commented] (KAFKA-956) High-level consumer fails to check topic metadata response for errors

    [ https://issues.apache.org/jira/browse/KAFKA-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752116#comment-13752116 ] 

Neha Narkhede commented on KAFKA-956:
-------------------------------------

The root cause of this issue is the fact that a broker, that is not really ready to serve requests, ends up serving requests and misleading clients. When a broker starts up, it expects to receive an UpdateMetadata request from the controller. Until this happens, the broker should explicitly return a BrokerNotReady error code. Upon receiving this error code, the client should try to send the request to another broker. In the specific example of rebalance, the consumer will get BrokerNotReady error code and will try fetching metadata from all the brokers at least once before giving up. A similar problem exists on the producer side. If you rolling bounce a Kafka cluster when several thousands of producer clients are connected to the cluster, and auto creation of topics is turned on, it creates a storm of topic metadata requests turning into create topic requests to the brokers. The brokers spend a lot of time trying to create topics since they don't yet know that the topic exists.

You could argue that a broker that is not ready should not accept connections and probably not even start the socket server until it is ready to serve requests. But currently since the broker uses the same socket server to communicate with the controller, this is not an easy fix to put in 0.8
                
> High-level consumer fails to check topic metadata response for errors
> ---------------------------------------------------------------------
>
>                 Key: KAFKA-956
>                 URL: https://issues.apache.org/jira/browse/KAFKA-956
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>    Affects Versions: 0.8
>            Reporter: Sam Meder
>            Assignee: Neha Narkhede
>            Priority: Blocker
>             Fix For: 0.8
>
>         Attachments: consumer_metadata_fetch.patch
>
>
> In our environment we noticed that consumers would sometimes hang when started too close to starting the Kafka server. I tracked this down and it seems to be related to some code in rebalance (ZookeeperConsumerConnector.scala). In particular the following code seems problematic:
>       val topicsMetadata = ClientUtils.fetchTopicMetadata(myTopicThreadIdsMap.keySet,
>                                                           brokers,
>                                                           config.clientId,
>                                                           config.socketTimeoutMs,
>                                                           correlationId.getAndIncrement).topicsMetadata
>       val partitionsPerTopicMap = new mutable.HashMap[String, Seq[Int]]
>       topicsMetadata.foreach(m => {
>         val topic = m.topic
>         val partitions = m.partitionsMetadata.map(m1 => m1.partitionId)
>         partitionsPerTopicMap.put(topic, partitions)
>       })
> The response is never checked for error, so may not actually contain any partition info! Rebalance goes its merry way, but doesn't know about any partitions so never assigns them...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira