You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Ashish K Singh (JIRA)" <ji...@apache.org> on 2015/06/25 05:11:05 UTC

[jira] [Commented] (KAFKA-972) MetadataRequest returns stale list of brokers

    [ https://issues.apache.org/jira/browse/KAFKA-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600602#comment-14600602 ] 

Ashish K Singh commented on KAFKA-972:
--------------------------------------

Hey Guys,

I spent some time reproducing the issue and finding the root cause. Turns out KAFKA-1367 is not the issue here. Below is the problem and my suggested solution.

Problem: Alive brokers list not being propagated to brokers by coordinator. When a broker is started, it writes to ZK brokers path. Coordinator watches that path and notices the new broker. On noticing a new broker, the coordinator sends the UpdateMetadataRequest to only the new broker that just started up. The other brokers in cluster never gets to know that there are new brokers in the cluster.

Effect of KAFKA-1367: After KAFKA-1367 goes in it correct alive brokers information will be propagated to all live brokers after ISR changes at any broker. However, if there are no topics/ partitions KAFKA-1367 will not help and this issue will still be there.

Solution: Instead of sending the UpdateMetadataRequest only to new broker, send it to all live brokers in the cluster.

[~junrao], [~nehanarkhede], [~granthenke], [~gwenshap], [~charmalloc], [~jjkoshy] please provide your thoughts. I have a patch ready which I will post if you guys think this is indeed the correct approach.

> MetadataRequest returns stale list of brokers
> ---------------------------------------------
>
>                 Key: KAFKA-972
>                 URL: https://issues.apache.org/jira/browse/KAFKA-972
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.0
>            Reporter: Vinicius Carvalho
>            Assignee: Ashish K Singh
>         Attachments: BrokerMetadataTest.scala
>
>
> When we issue an metadatarequest towards the cluster, the list of brokers is stale. I mean, even when a broker is down, it's returned back to the client. The following are examples of two invocations one with both brokers online and the second with a broker down:
> {
>     "brokers": [
>         {
>             "nodeId": 0,
>             "host": "10.139.245.106",
>             "port": 9092,
>             "byteLength": 24
>         },
>         {
>             "nodeId": 1,
>             "host": "localhost",
>             "port": 9093,
>             "byteLength": 19
>         }
>     ],
>     "topicMetadata": [
>         {
>             "topicErrorCode": 0,
>             "topicName": "foozbar",
>             "partitions": [
>                 {
>                     "replicas": [
>                         0
>                     ],
>                     "isr": [
>                         0
>                     ],
>                     "partitionErrorCode": 0,
>                     "partitionId": 0,
>                     "leader": 0,
>                     "byteLength": 26
>                 },
>                 {
>                     "replicas": [
>                         1
>                     ],
>                     "isr": [
>                         1
>                     ],
>                     "partitionErrorCode": 0,
>                     "partitionId": 1,
>                     "leader": 1,
>                     "byteLength": 26
>                 },
>                 {
>                     "replicas": [
>                         0
>                     ],
>                     "isr": [
>                         0
>                     ],
>                     "partitionErrorCode": 0,
>                     "partitionId": 2,
>                     "leader": 0,
>                     "byteLength": 26
>                 },
>                 {
>                     "replicas": [
>                         1
>                     ],
>                     "isr": [
>                         1
>                     ],
>                     "partitionErrorCode": 0,
>                     "partitionId": 3,
>                     "leader": 1,
>                     "byteLength": 26
>                 },
>                 {
>                     "replicas": [
>                         0
>                     ],
>                     "isr": [
>                         0
>                     ],
>                     "partitionErrorCode": 0,
>                     "partitionId": 4,
>                     "leader": 0,
>                     "byteLength": 26
>                 }
>             ],
>             "byteLength": 145
>         }
>     ],
>     "responseSize": 200,
>     "correlationId": -1000
> }
> {
>     "brokers": [
>         {
>             "nodeId": 0,
>             "host": "10.139.245.106",
>             "port": 9092,
>             "byteLength": 24
>         },
>         {
>             "nodeId": 1,
>             "host": "localhost",
>             "port": 9093,
>             "byteLength": 19
>         }
>     ],
>     "topicMetadata": [
>         {
>             "topicErrorCode": 0,
>             "topicName": "foozbar",
>             "partitions": [
>                 {
>                     "replicas": [
>                         0
>                     ],
>                     "isr": [],
>                     "partitionErrorCode": 5,
>                     "partitionId": 0,
>                     "leader": -1,
>                     "byteLength": 22
>                 },
>                 {
>                     "replicas": [
>                         1
>                     ],
>                     "isr": [
>                         1
>                     ],
>                     "partitionErrorCode": 0,
>                     "partitionId": 1,
>                     "leader": 1,
>                     "byteLength": 26
>                 },
>                 {
>                     "replicas": [
>                         0
>                     ],
>                     "isr": [],
>                     "partitionErrorCode": 5,
>                     "partitionId": 2,
>                     "leader": -1,
>                     "byteLength": 22
>                 },
>                 {
>                     "replicas": [
>                         1
>                     ],
>                     "isr": [
>                         1
>                     ],
>                     "partitionErrorCode": 0,
>                     "partitionId": 3,
>                     "leader": 1,
>                     "byteLength": 26
>                 },
>                 {
>                     "replicas": [
>                         0
>                     ],
>                     "isr": [],
>                     "partitionErrorCode": 5,
>                     "partitionId": 4,
>                     "leader": -1,
>                     "byteLength": 22
>                 }
>             ],
>             "byteLength": 133
>         }
>     ],
>     "responseSize": 188,
>     "correlationId": -1000
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)