You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Tom Bentley (Jira)" <ji...@apache.org> on 2020/06/02 11:22:00 UTC

[jira] [Commented] (KAFKA-10075) Kafka client stucks after Kafka-cluster unavailability

    [ https://issues.apache.org/jira/browse/KAFKA-10075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123656#comment-17123656 ] 

Tom Bentley commented on KAFKA-10075:
-------------------------------------

Is the JVM in which you're running the client(s) caching DNS lookups? When the brokers get rescheduled on different pods (as can happen during an upgrade) their resolved IPs can change. There's a Java security property (nb, not a system property) which you can use to configure the DNS caching explicitly. See [https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-jvm-ttl.html|https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-jvm-ttl.html.] for example.

> Kafka client stucks after Kafka-cluster unavailability
> ------------------------------------------------------
>
>                 Key: KAFKA-10075
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10075
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 2.4.0
>         Environment: Kafka v2.3.1 deployed by https://strimzi.io/ to Kubernetes cluster
> openjdk version "1.8.0_242"
> OpenJDK Runtime Environment (build 1.8.0_242-b08)
> OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
>            Reporter: Dmitry Mischenko
>            Priority: Minor
>
> Several times we got an issue with kafka-client.
> What happened:
> We have Kafka v2.3.1 deployed by [https://strimzi.io/] to Kubernetes cluster (Amazon EKS). 
>  # Kafka brokers were unavailable (due to cluster upgrade) and couldn't be resolved by internal hostnames
> {code:java}
> 2020-05-28 17:19:50 WARN  NetworkClient:962 - [Producer clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer] Error connecting to node data-kafka-dev-kafka-0.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092 (id: -1 rack: null)2020-05-28 17:19:50 WARN  NetworkClient:962 - [Producer clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer] Error connecting to node data-kafka-dev-kafka-0.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092 (id: -1 rack: null)at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:289)at org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:538)at org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:335)" at java.base/java.net.InetAddress.getAllByName(Unknown Source)"at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)" at java.base/java.net.InetAddress.getAllByName(Unknown Source)"" at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)"at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:104)at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:671)at java.base/java.net.InetAddress.getAllByName0(Unknown Source)at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:444)at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:843)at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:698)2020-05-28 17:19:50 WARN  NetworkClient:962 - [Producer clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer] Error connecting to node data-kafka-dev-kafka-1.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092 (id: -2 rack: null)at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:955)" at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)"at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363){code}
> But after the moment when cluster was repaired, kafka-admin-client couldn't restore connection and only every 120s was throwing timeout exceptions for a long time.
>  
> {code:java}
> 2020-05-28 17:21:14 INFO StreamThread:219 - stream-thread [consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1] State transition from CREATED to STARTING
>  2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration 'admin.retry.backoff.ms' was supplied but isn't a known config.
>  2020-05-28 17:21:14 INFO AppInfoParser:118 - Kafka commitId: 77a89fcf8d7fa018
>  2020-05-28 17:21:14 INFO AppInfoParser:117 - Kafka version: 2.4.0
>  2020-05-28 17:21:14 INFO KafkaConsumer:1032 - [Consumer clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1-consumer, groupId=consumer_group-101.public.user_storage] Subscribed to pattern: 'postgres_101.public.user_storage'
>  2020-05-28 17:21:14 INFO KafkaStreams:276 - stream-client [consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7] State transition from CREATED to REBALANCING
>  2020-05-28 17:21:14 INFO StreamThread:664 - stream-thread [consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1] Starting
>  2020-05-28 17:21:14 INFO AppInfoParser:119 - Kafka startTimeMs: 1590686474110
>  2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration 'schema.registry.url' was supplied but isn't a known config.
>  2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration 'admin.retries' was supplied but isn't a known config.
>  "org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.
>  "
>  2020-05-28 17:23:11 INFO AdminMetadataManager:238 - [AdminClient clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin] Metadata update failed
>  2020-05-28 17:25:11 INFO AdminMetadataManager:238 - [AdminClient clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin] Metadata update failed
>  "org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call.
>  "
>  2020-05-28 17:27:11 INFO AdminMetadataManager:238 - [AdminClient clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin] Metadata update failed
>  "org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.
>  "
>  2020-05-28 17:29:11 INFO AdminMetadataManager:238 - [AdminClient clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin] Metadata update failed
>  "org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.{code}
> After app restart everything works fine
>  The problem is that we nor can catch this exception and detect problem in order to automatically reboot app nor client can self-heal in this situatuon.
>  Why could this happen and 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)