You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Piotr Trzpil (JIRA)" <ji...@apache.org> on 2016/06/24 16:09:16 UTC

[jira] [Commented] (KAFKA-3296) All consumer reads hang indefinately

    [ https://issues.apache.org/jira/browse/KAFKA-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348485#comment-15348485 ] 

Piotr Trzpil commented on KAFKA-3296:
-------------------------------------

I can successfully reproduce this issue in AWS EC2, running kafka 0.10.0.0 cluster with 3 brokers with autogenerated IDs.
The problem occurs if I restart all of the brokers (so the restarted brokers will get new IDs) before reassiging the __consumer_offsets topic partitions

1. Initial state (got using kafkacat): 

{noformat}
3 brokers:
 broker 1015 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092
 broker 1016 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092
 broker 1017 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092

topic "__consumer_offsets" with 50 partitions:
    partition 23, leader 1016, replicas: 1017,1015,1016, isrs: 1015,1016,1017
    partition 41, leader 1016, replicas: 1017,1015,1016, isrs: 1015,1016,1017
    partition 32, leader 1016, replicas: 1017,1015,1016, isrs: 1015,1016,1017
    partition 8, leader 1016, replicas: 1017,1015,1016, isrs: 1015,1016,1017

{noformat}

2. Then, after restarting all of the brokers, the leaders are not available, as expected:
    
{noformat}
3 brokers:
 broker 1019 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092
 broker 1018 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092
 broker 1020 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092
...
topic "__consumer_offsets" with 50 partitions:
 partition 23, leader -1, replicas: , isrs: , Broker: Leader not available
 partition 32, leader -1, replicas: , isrs: , Broker: Leader not available
 partition 41, leader -1, replicas: , isrs: , Broker: Leader not available
 partition 17, leader -1, replicas: , isrs: , Broker: Leader not available
 ...
{noformat}

3. However, after generating a reassignment plan for the new brokers and executing it, the isrs list is empty and the leaders are still not chosen.

{noformat}
 topic "__consumer_offsets" with 50 partitions:
  partition 23, leader -1, replicas: 1018,1019,1020, isrs: , Broker: Leader not available
  partition 32, leader -1, replicas: 1018,1019,1020, isrs: , Broker: Leader not available
  partition 41, leader -1, replicas: 1018,1019,1020, isrs: , Broker: Leader not available
  partition 17, leader -1, replicas: 1018,1019,1020, isrs: , Broker: Leader not available
{noformat}

4. Only after restarting one of the brokers (chosen at random) with the same ID, this situation is fixed:

{noformat}
topic "__consumer_offsets" with 50 partitions:
  partition 23, leader 1020, replicas: 1018,1019,1020, isrs: 1020,1018,1019
  partition 32, leader 1020, replicas: 1018,1019,1020, isrs: 1020,1018,1019
  partition 41, leader 1020, replicas: 1018,1019,1020, isrs: 1020,1018,1019
  partition 17, leader 1020, replicas: 1018,1019,1020, isrs: 1020,1018,1019
{noformat}

But, this issue does not happen if only one broker is restarted with a new ID before executing partition reassignment.

> All consumer reads hang indefinately
> ------------------------------------
>
>                 Key: KAFKA-3296
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3296
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.9.0.0, 0.9.0.1
>            Reporter: Simon Cooper
>            Priority: Critical
>         Attachments: controller.zip, kafkalogs.zip
>
>
> We've got several integration tests that bring up systems on VMs for testing. We've recently upgraded to 0.9, and very occasionally we occasionally see an issue where every consumer that tries to read from the broker hangs, spamming the following in their logs:
> {code}2016-02-26T12:25:37,856 | DEBUG | o.a.k.c.NetworkClient [pool-10-thread-1] | Sending metadata request ClientRequest(expectResponse=true, callback=null, request=RequestSend(header={api_key=3,api_version=0,correlation_id=21905,client_id=consumer-1}, body={topics=[Topic1]}), isInitiatedByNetworkClient, createdTimeMs=1456489537856, sendTimeMs=0) to node 1
> 2016-02-26T12:25:37,856 | DEBUG | o.a.k.c.Metadata [pool-10-thread-1] | Updated cluster metadata version 10954 to Cluster(nodes = [Node(1, server.internal, 9092)], partitions = [Partition(topic = Topic1, partition = 0, leader = 1, replicas = [1,], isr = [1,]])
> 2016-02-26T12:25:37,856 | DEBUG | o.a.k.c.c.i.AbstractCoordinator [pool-10-thread-1] | Issuing group metadata request to broker 1
> 2016-02-26T12:25:37,857 | DEBUG | o.a.k.c.c.i.AbstractCoordinator [pool-10-thread-1] | Group metadata response ClientResponse(receivedTimeMs=1456489537857, disconnected=false, request=ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@28edb273, request=RequestSend(header={api_key=10,api_version=0,correlation_id=21906,client_id=consumer-1}, body={group_id=}), createdTimeMs=1456489537856, sendTimeMs=1456489537856), responseBody={error_code=15,coordinator={node_id=-1,host=,port=-1}})
> 2016-02-26T12:25:37,956 | DEBUG | o.a.k.c.NetworkClient [pool-10-thread-1] | Sending metadata request ClientRequest(expectResponse=true, callback=null, request=RequestSend(header={api_key=3,api_version=0,correlation_id=21907,client_id=consumer-1}, body={topics=[Topic1]}), isInitiatedByNetworkClient, createdTimeMs=1456489537956, sendTimeMs=0) to node 1
> 2016-02-26T12:25:37,956 | DEBUG | o.a.k.c.Metadata [pool-10-thread-1] | Updated cluster metadata version 10955 to Cluster(nodes = [Node(1, server.internal, 9092)], partitions = [Partition(topic = Topic1, partition = 0, leader = 1, replicas = [1,], isr = [1,]])
> 2016-02-26T12:25:37,956 | DEBUG | o.a.k.c.c.i.AbstractCoordinator [pool-10-thread-1] | Issuing group metadata request to broker 1
> 2016-02-26T12:25:37,957 | DEBUG | o.a.k.c.c.i.AbstractCoordinator [pool-10-thread-1] | Group metadata response ClientResponse(receivedTimeMs=1456489537957, disconnected=false, request=ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@40cee8cc, request=RequestSend(header={api_key=10,api_version=0,correlation_id=21908,client_id=consumer-1}, body={group_id=}), createdTimeMs=1456489537956, sendTimeMs=1456489537956), responseBody={error_code=15,coordinator={node_id=-1,host=,port=-1}})
> 2016-02-26T12:25:38,056 | DEBUG | o.a.k.c.NetworkClient [pool-10-thread-1] | Sending metadata request ClientRequest(expectResponse=true, callback=null, request=RequestSend(header={api_key=3,api_version=0,correlation_id=21909,client_id=consumer-1}, body={topics=[Topic1]}), isInitiatedByNetworkClient, createdTimeMs=1456489538056, sendTimeMs=0) to node 1
> 2016-02-26T12:25:38,056 | DEBUG | o.a.k.c.Metadata [pool-10-thread-1] | Updated cluster metadata version 10956 to Cluster(nodes = [Node(1, server.internal, 9092)], partitions = [Partition(topic = Topic1, partition = 0, leader = 1, replicas = [1,], isr = [1,]])
> 2016-02-26T12:25:38,056 | DEBUG | o.a.k.c.c.i.AbstractCoordinator [pool-10-thread-1] | Issuing group metadata request to broker 1
> 2016-02-26T12:25:38,057 | DEBUG | o.a.k.c.c.i.AbstractCoordinator [pool-10-thread-1] | Group metadata response ClientResponse(receivedTimeMs=1456489538057, disconnected=false, request=ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@439e25fb, request=RequestSend(header={api_key=10,api_version=0,correlation_id=21910,client_id=consumer-1}, body={group_id=}), createdTimeMs=1456489538056, sendTimeMs=1456489538056), responseBody={error_code=15,coordinator={node_id=-1,host=,port=-1}}){code}
> This persists for any 0.9 consumer trying to read from the topic (we haven't confirmed if this is for a single topic or for any topic on the broker). 0.8 consumers can read from the broker without issues. This is fixed by a broker restart.
> This was observed on a single-broker cluster. There were no suspicious log messages on the server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)