You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Hui Yang <hu...@expedia.com> on 2017/01/21 01:27:54 UTC

Kafka 10 Stability Issue

Hi, Kafka Team

This is Hui Yang from Expedia engineer team and want to ask a question about Kafka 10 issue.
Our team use Kafka as our core infrastructure and recently upgrade from Kafka 0.8.2.2 to Kafka 0.10.1.0 but get a issue after the upgrade.

The issue is as below:
Kafka 10 works well after the upgrade for couple days but then we started to see "java.io.IOException: Connection to 3 was disconnected before the response was read” on each Kafka broker when trying to communicate to controller (as you may know, one of the Kafka broker is acting as a controller to handle the topic/partition assignment and state change task, in our case, it is the broker 3).
Even on the controller log, I found "[Controller-3-to-broker-3-send-thread], Controller 3 epoch 3 fails to send request,java.io.IOException: Connection to 3 was disconnected before the response was read”, looks it is even not able to sent message to itself.
After we saw those exception on brokers for a while, we started to see timeout exception from our producer side that our producer is not able to send messages to brokers.

When I check the JMX metrics, I found the CPU usage for controller is always higher than other brokers after we upgrade to Kafka 10(brokers have similar CPU usage when Kafka 8) and memory increased for a spike specifically for the controller during the issue. I assume the controller may not have enough memory left to create new connections for the producer and other brokers.

One more need to mention is we use the Kafka 0.8 protocol and format on Kafka 0.10 brokers that we can still use 0.8 clients.

Details for the exception:
" WARN [ReplicaFetcherThread-0-3], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@87d8e00 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)"

"WARN [Controller-3-to-broker-3-send-thread], Controller 3 epoch 1 fails to send request
java.io.IOException: Connection to 2 was disconnected before the response was read
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:190)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:181)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)”

In production, we build 6 Kafka brokers with 3 zookeeper nodes on the AWS using C3.xlarge type.
Our JVM settings is as follow: -Xmx1G -Xms1G –server -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark.
Our traffic is 500 TPS and each message has average 100KB size.

I am appreciate for your time to give us any help and suggestion about this issue!

Best,

Hui

Re: Kafka 10 Stability Issue

Posted by Jason Gustafson <ja...@confluent.io>.
Hi there,

This sounds similar to https://issues.apache.org/jira/browse/KAFKA-4477.
Have you tried 0.10.1.1?

-Jason

On Fri, Jan 20, 2017 at 5:27 PM, Hui Yang <hu...@expedia.com> wrote:

> Hi, Kafka Team
>
> This is Hui Yang from Expedia engineer team and want to ask a question
> about Kafka 10 issue.
> Our team use Kafka as our core infrastructure and recently upgrade from
> Kafka 0.8.2.2 to Kafka 0.10.1.0 but get a issue after the upgrade.
>
> The issue is as below:
> Kafka 10 works well after the upgrade for couple days but then we started
> to see "java.io.IOException: Connection to 3 was disconnected before the
> response was read” on each Kafka broker when trying to communicate to
> controller (as you may know, one of the Kafka broker is acting as a
> controller to handle the topic/partition assignment and state change task,
> in our case, it is the broker 3).
> Even on the controller log, I found "[Controller-3-to-broker-3-send-thread],
> Controller 3 epoch 3 fails to send request,java.io.IOException: Connection
> to 3 was disconnected before the response was read”, looks it is even not
> able to sent message to itself.
> After we saw those exception on brokers for a while, we started to see
> timeout exception from our producer side that our producer is not able to
> send messages to brokers.
>
> When I check the JMX metrics, I found the CPU usage for controller is
> always higher than other brokers after we upgrade to Kafka 10(brokers have
> similar CPU usage when Kafka 8) and memory increased for a spike
> specifically for the controller during the issue. I assume the controller
> may not have enough memory left to create new connections for the producer
> and other brokers.
>
> One more need to mention is we use the Kafka 0.8 protocol and format on
> Kafka 0.10 brokers that we can still use 0.8 clients.
>
> Details for the exception:
> " WARN [ReplicaFetcherThread-0-3], Error in fetch kafka.server.
> ReplicaFetcherThread$FetchRequest@87d8e00 (kafka.server.
> ReplicaFetcherThread)
> java.io.IOException: Connection to 3 was disconnected before the response
> was read
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
> at scala.Option.foreach(Option.scala:257)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1.apply(NetworkClientBlockingOps.scala:112)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1.apply(NetworkClientBlockingOps.scala:108)
> at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(
> NetworkClientBlockingOps.scala:137)
> at kafka.utils.NetworkClientBlockingOps$.kafka$utils$
> NetworkClientBlockingOps$$pollContinuously$extension(
> NetworkClientBlockingOps.scala:143)
> at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(
> NetworkClientBlockingOps.scala:108)
> at kafka.server.ReplicaFetcherThread.sendRequest(
> ReplicaFetcherThread.scala:253)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
> at kafka.server.AbstractFetcherThread.processFetchRequest(
> AbstractFetcherThread.scala:118)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:
> 103)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)"
>
> "WARN [Controller-3-to-broker-3-send-thread], Controller 3 epoch 1 fails
> to send request
> java.io.IOException: Connection to 2 was disconnected before the response
> was read
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
> at scala.Option.foreach(Option.scala:257)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1.apply(NetworkClientBlockingOps.scala:112)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1.apply(NetworkClientBlockingOps.scala:108)
> at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(
> NetworkClientBlockingOps.scala:137)
> at kafka.utils.NetworkClientBlockingOps$.kafka$utils$
> NetworkClientBlockingOps$$pollContinuously$extension(
> NetworkClientBlockingOps.scala:143)
> at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(
> NetworkClientBlockingOps.scala:108)
> at kafka.controller.RequestSendThread.liftedTree1$
> 1(ControllerChannelManager.scala:190)
> at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.
> scala:181)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)”
>
> In production, we build 6 Kafka brokers with 3 zookeeper nodes on the AWS
> using C3.xlarge type.
> Our JVM settings is as follow: -Xmx1G -Xms1G –server
> -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark.
> Our traffic is 500 TPS and each message has average 100KB size.
>
> I am appreciate for your time to give us any help and suggestion about
> this issue!
>
> Best,
>
> Hui
>