You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Tommy Becker <to...@tivo.com> on 2016/09/06 18:36:21 UTC

Replica fetcher continually disconnecting after broker replacement

We had a hardware failure on broker 1 of a 3 broker cluster over the weekend. The broker was replaced, and when the replacement broker came up it started to replicate partitions from the other 2 brokers as you'd expect. But while broker 1 (the replacement) was able to fetch properly from broker 2, it continually disconnects trying to do so from broker 0, resulting in the following error message every 30s:

[2016-09-06 14:45:50,103] WARN [ReplicaFetcherThread-0-0], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@73e7afd8 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 0 was disconnected before the response was read
        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:87)
        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:84)
        at scala.Option.foreach(Option.scala:257)
        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:84)
        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:80)
        at kafka.utils.NetworkClientBlockingOps$.recursivePoll$2(NetworkClientBlockingOps.scala:137)
        at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
        at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:80)
        at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:244)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:229)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:107)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:98)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)


We are running Kafka version 0.10.0.0. Any ideas of what to check for? Network connectivity between all brokers is fine.

--
Tommy Becker
Senior Software Engineer

Digitalsmiths
A TiVo Company

www.digitalsmiths.com<http://www.digitalsmiths.com>
tobecker@tivo.com<ma...@tivo.com>

________________________________

This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.

Re: Replica fetcher continually disconnecting after broker replacement

Posted by Tommy Becker <to...@tivo.com>.

Thanks for the response; we found the issue. We run in AWS, and inexplicably, the new instance we launched to replace the dead one had exceedingly low network bandwidth to exactly one of the remaining brokers, resulting in timeouts. After rolling the dice again things are replicating normally.

On 09/07/2016 03:56 PM, Ryan Pridgeon wrote:

One possibility is that broker 0 has exhausted it's available file
descriptors. If this is the case it will be able to maintain it's existing
connections, giving off the appearance that it is operating normally while
refusing new ones.

I don't recall the exact exception message but something along the lines of
'too many files open' may be present within the logs.

On Tue, Sep 6, 2016 at 2:36 PM, Tommy Becker <to...@tivo.com> wrote:

We had a hardware failure on broker 1 of a 3 broker cluster over the
weekend. The broker was replaced, and when the replacement broker came up
it started to replicate partitions from the other 2 brokers as you'd
expect. But while broker 1 (the replacement) was able to fetch properly
from broker 2, it continually disconnects trying to do so from broker 0,
resulting in the following error message every 30s:

[2016-09-06 14:45:50,103] WARN [ReplicaFetcherThread-0-0], Error in fetch
kafka.server.ReplicaFetcherThread$FetchRequest@73e7afd8
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 0 was disconnected before the response
was read
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAn
dReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlo
ckingOps.scala:87)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAn
dReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlo
ckingOps.scala:84)
at scala.Option.foreach(Option.scala:257)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAn
dReceive$extension$1.apply(NetworkClientBlockingOps.scala:84)
at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAn
dReceive$extension$1.apply(NetworkClientBlockingOps.scala:80)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$2(Networ
kClientBlockingOps.scala:137)
at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkCli
entBlockingOps$$pollContinuously$extension(NetworkClientBloc
kingOps.scala:143)
at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive
$extension(NetworkClientBlockingOps.scala:80)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcher
Thread.scala:244)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread
.scala:229)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread
.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(Abstr
actFetcherThread.scala:107)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThr
ead.scala:98)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)

We are running Kafka version 0.10.0.0. Any ideas of what to check for?
Network connectivity between all brokers is fine.

--
Tommy Becker
Senior Software Engineer

Digitalsmiths
A TiVo Company

www.digitalsmiths.com<http://www.digitalsmiths.com><http://www.digitalsmiths.com><http://www.digitalsmiths.com>
tobecker@tivo.com<ma...@tivo.com>

________________________________

This email and any attachments may contain confidential and privileged
material for the sole use of the intended recipient. Any review, copying,
or distribution of this email (or any attachments) by others is prohibited.
If you are not the intended recipient, please contact the sender
immediately and permanently delete this email and any attachments. No
employee or agent of TiVo Inc. is authorized to conclude any binding
agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
Inc. may only be made by a signed written agreement.

--
Tommy Becker
Senior Software Engineer

Digitalsmiths
A TiVo Company

www.digitalsmiths.com<http://www.digitalsmiths.com>
tobecker@tivo.com<ma...@tivo.com>

________________________________

This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.

Re: Replica fetcher continually disconnecting after broker replacement

Posted by Ryan Pridgeon <ry...@confluent.io>.

One possibility is that broker 0 has exhausted it's available file
descriptors. If this is the case it will be able to maintain it's existing
connections, giving off the appearance that it is operating normally while
refusing new ones.

I don't recall the exact exception message but something along the lines of
'too many files open' may be present within the logs.


On Tue, Sep 6, 2016 at 2:36 PM, Tommy Becker <to...@tivo.com> wrote:

> We had a hardware failure on broker 1 of a 3 broker cluster over the
> weekend. The broker was replaced, and when the replacement broker came up
> it started to replicate partitions from the other 2 brokers as you'd
> expect. But while broker 1 (the replacement) was able to fetch properly
> from broker 2, it continually disconnects trying to do so from broker 0,
> resulting in the following error message every 30s:
>
> [2016-09-06 14:45:50,103] WARN [ReplicaFetcherThread-0-0], Error in fetch
> kafka.server.ReplicaFetcherThread$FetchRequest@73e7afd8
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 0 was disconnected before the response
> was read
>        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAn
> dReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlo
> ckingOps.scala:87)
>        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAn
> dReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlo
> ckingOps.scala:84)
>        at scala.Option.foreach(Option.scala:257)
>        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAn
> dReceive$extension$1.apply(NetworkClientBlockingOps.scala:84)
>        at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAn
> dReceive$extension$1.apply(NetworkClientBlockingOps.scala:80)
>        at kafka.utils.NetworkClientBlockingOps$.recursivePoll$2(Networ
> kClientBlockingOps.scala:137)
>        at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkCli
> entBlockingOps$$pollContinuously$extension(NetworkClientBloc
> kingOps.scala:143)
>        at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive
> $extension(NetworkClientBlockingOps.scala:80)
>        at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcher
> Thread.scala:244)
>        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread
> .scala:229)
>        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread
> .scala:42)
>        at kafka.server.AbstractFetcherThread.processFetchRequest(Abstr
> actFetcherThread.scala:107)
>        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThr
> ead.scala:98)
>        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
>
>
> We are running Kafka version 0.10.0.0. Any ideas of what to check for?
> Network connectivity between all brokers is fine.
>
> --
> Tommy Becker
> Senior Software Engineer
>
> Digitalsmiths
> A TiVo Company
>
> www.digitalsmiths.com<http://www.digitalsmiths.com>
> tobecker@tivo.com<ma...@tivo.com>
>
> ________________________________
>
> This email and any attachments may contain confidential and privileged
> material for the sole use of the intended recipient. Any review, copying,
> or distribution of this email (or any attachments) by others is prohibited.
> If you are not the intended recipient, please contact the sender
> immediately and permanently delete this email and any attachments. No
> employee or agent of TiVo Inc. is authorized to conclude any binding
> agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
> Inc. may only be made by a signed written agreement.
>