You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "mlowicki (Jira)" <ji...@apache.org> on 2021/05/21 15:26:00 UTC
[jira] [Created] (KAFKA-12836) Possible race in handling IP address change

mlowicki created KAFKA-12836:
--------------------------------

             Summary: Possible race in handling IP address change
                 Key: KAFKA-12836
                 URL: https://issues.apache.org/jira/browse/KAFKA-12836
             Project: Kafka
          Issue Type: Bug
          Components: clients
    Affects Versions: 2.4.1
            Reporter: mlowicki


We've experienced few incidents where suddenly Kafka producers weren't able to send messages and were simply timing out. I've a repro now after some investigation. We're running the cluster of 3 brokers on K8s and it's enough to kill the underlying AWS VM using AWS Console which causes also IP change (in our case 10.128.132.196 → 10.128.135.90). It needs to be VM hosting broker which is a leader of partition (node 10000).
 
Slightly before the kill there were 2 events:
 
{quote}About to close the idle connection from 10001 due to being idle for 38545 millis
About to close the idle connection from 10002 due to being idle for 34946 millis{quote}
 
And then client picks the node to connect to:
 
{quote}Removing node [10.128.140.202:9092|http://10.128.140.202:9092/] (id: 10002 rack: us-east-1c) from least loaded node selection since it is neither ready for sending or connecting
Found least loaded connecting node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a)
About to close the idle connection from 10000 due to being idle for 30529 millis
Node 10000 disconnected.
Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) using address /[10.128.132.196|http://10.128.132.196/]{quote}
 
and then it repeats:
 
{quote}Found least loaded connecting node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a)
About to close the idle connection from 10000 due to being idle for 30027 millis
Node 10000 disconnected.
Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) using address /[10.128.132.196|http://10.128.132.196/]{quote}
 
{quote}Found least loaded connecting node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a)
About to close the idle connection from 10000 due to being idle for 30027 millis
Node 10000 disconnected.
Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) using address /[10.128.132.196|http://10.128.132.196/]{quote}
 
{quote}Found least loaded connecting node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a)
About to close the idle connection from 10000 due to being idle for 30027 millis
Node 10000 disconnected.
Initiating connection to node [10.128.132.196:9092|http://10.128.132.196:9092/] (id: 10000 rack: us-east-1a) using address /[10.128.132.196|http://10.128.132.196/]{quote}
 
Affected clients won't ever request METADATA from other brokers (10001 or 10002) to discover new IP. It doesn't happen for all clients as some handle the process gracefully and those log with:
 
{quote}[log.info|http://log.info/]("Hostname for node {} changed from {} to {}.", id, connectionState.host(), host);{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)