You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Nico Kruber (JIRA)" <ji...@apache.org> on 2018/10/02 13:52:00 UTC
[jira] [Commented] (FLINK-10455) Potential Kafka producer leak in case of failures

    [ https://issues.apache.org/jira/browse/FLINK-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635519#comment-16635519 ] 

Nico Kruber commented on FLINK-10455:
-------------------------------------

The following symptom may actually be a follow-up of this leak:

{code}
2018-10-02 12:17:30,077 ERROR org.apache.kafka.common.utils.KafkaThread                     - Uncaught exception in kafka-producer-network-thread | producer-26: 
java.lang.NoClassDefFoundError: org/apache/kafka/clients/NetworkClient$1
	at org.apache.kafka.clients.NetworkClient.processDisconnection(NetworkClient.java:583)
	at org.apache.kafka.clients.NetworkClient.handleDisconnections(NetworkClient.java:705)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:443)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:224)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.clients.NetworkClient$1
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader.loadClass(FlinkUserCodeClassLoaders.java:129)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 6 more
{code}

-> This may happen, if we do not properly clean up and something, e.g. a producer, is outliving the task it belongs to. Since we close the {{URLClassLoader}} with the task, loading further classes then tails - see https://heapanalytics.com/blog/engineering/missing-scala-class-noclassdeffounderror for more info on this.

> Potential Kafka producer leak in case of failures
> -------------------------------------------------
>
>                 Key: FLINK-10455
>                 URL: https://issues.apache.org/jira/browse/FLINK-10455
>             Project: Flink
>          Issue Type: Bug
>          Components: Kafka Connector
>    Affects Versions: 1.5.2
>            Reporter: Nico Kruber
>            Priority: Major
>
> If the Kafka brokers' timeout is too low for our checkpoint interval [1], we may get an {{ProducerFencedException}}. Documentation around {{ProducerFencedException}} explicitly states that we should close the producer after encountering it.
> By looking at the code, it doesn't seem like this is actually done in {{FlinkKafkaProducer011}}. Also, in case one transaction's commit in {{TwoPhaseCommitSinkFunction#notifyCheckpointComplete}} fails with an exception, we don't clean up (nor try to commit) any other transaction.
> -> from what I see, {{TwoPhaseCommitSinkFunction#notifyCheckpointComplete}} simply iterates over the {{pendingCommitTransactions}} which is not touched during {{close()}}
> Now if we restart the failing job on the same Flink cluster, any resources from the previous attempt will still linger around.
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/connectors/kafka.html#kafka-011



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)