You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Thomas DeVoe <td...@dataminr.com> on 2016/12/06 20:59:37 UTC

Re: One Kafka Broker Went Rogue

Hi All,

This happened again on our kafka cluster - a single kafka broker seems to
"forget" the existence of the rest of the cluster and shrinks all of its
ISRs to only exist on that node. The other two nodes get stuck in a loop
trying to connect to this rogue node and never even register that it is no
longer part of the cluster. Strangely network connection between all of
these nodes is fine at that time and restarting the node resolves it
(though with some data loss due to unclean leader elections)

Anyone have any ideas? Help would be greatly appreciated.

Thanks,



<http://dataminr.com/>


*Tom DeVoe*
Sofware Engineer, Data

6 East 32nd Street, 2nd Floor
New York, NY 10016



Dataminr is a Twitter Official Partner.
Dataminr in the news: The Economist
<http://www.economist.com/news/business/21705369-alternative-data-firms-are-shedding-new-light-corporate-performance-watchers>
 | International Business Times
<http://www.ibtimes.co.uk/dataminr-solves-twitters-needle-haystack-problem-hedge-funds-banks-1576692>
 | Equities.com
<https://www.equities.com/news/from-novelty-to-utility-how-dataminr-and-the-alternative-data-industry-is-becoming-mainstream>
 | SIA
<http://newsmanager.commpartners.com/sianews2/issues/2016-08-19/11.html>


On Tue, Nov 29, 2016 at 1:29 PM, Thomas DeVoe <td...@dataminr.com> wrote:

> Hi,
>
> I encountered a strange issue in our kafka cluster, where randomly a
> single broker entered a state where it seemed to think it was the only
> broker in the cluster (it shrank all of its ISRs to just existing on
> itself). Some details about the kafka cluster:
>
> - running in an EC2 VPC on AWS
> - 3 nodes (d2.xlarge)
> - Kafka version : 0.10.1.0
>
> More information about the incident:
>
> Around 19:57 yesterday, one of the nodes somehow lost its connection to
> the cluster and started reporting messages like this for what seemed to be
> all of its hosted topic partitions:
>
> [2016-11-28 19:57:05,426] INFO Partition [arches_stage,0] on broker 1002:
>> Shrinking ISR for partition [arches_stage,0] from 1003,1002,1001 to 1002
>> (kafka.cluster.Partition)
>> [2016-11-28 19:57:05,466] INFO Partition [connect-offsets,13] on broker
>> 1002: Shrinking ISR for partition [connect-offsets,13] from 1003,1002,1001
>> to 1002 (kafka.cluster.Partition)
>> [2016-11-28 19:57:05,489] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
>> 1003,1002,1001 to 1002 (kafka.cluster.Partition)
>> ...
>>
>
> It then added the ISRs from the other machines back in:
>
> [2016-11-28 19:57:18,013] INFO Partition [arches_stage,0] on broker 1002:
>> Expanding ISR for partition [arches_stage,0] from 1002 to 1002,1003
>> (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,015] INFO Partition [connect-offsets,13] on broker
>> 1002: Expanding ISR for partition [connect-offsets,13] from 1002 to
>> 1002,1003 (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,018] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
>> 1002 to 1002,1003 (kafka.cluster.Partition)
>> ...
>> [2016-11-28 19:57:18,222] INFO Partition [arches_stage,0] on broker 1002:
>> Expanding ISR for partition [arches_stage,0] from 1002,1003 to
>> 1002,1003,1001 (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,224] INFO Partition [connect-offsets,13] on broker
>> 1002: Expanding ISR for partition [connect-offsets,13] from 1002,1003 to
>> 1002,1003,1001 (kafka.cluster.Partition)
>> [2016-11-28 19:57:18,227] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
>> 1002,1003 to 1002,1003,1001 (kafka.cluster.Partition)
>
>
> and eventually removed them again before going on its merry way:
>
> [2016-11-28 19:58:05,408] INFO Partition [arches_stage,0] on broker 1002:
>> Shrinking ISR for partition [arches_stage,0] from 1002,1003,1001 to 1002
>> (kafka.cluster.Partition)
>> [2016-11-28 19:58:05,415] INFO Partition [connect-offsets,13] on broker
>> 1002: Shrinking ISR for partition [connect-offsets,13] from 1002,1003,1001
>> to 1002 (kafka.cluster.Partition)
>> [2016-11-28 19:58:05,416] INFO Partition [lasagna_prod_memstore,2] on
>> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
>> 1002,1003,1001 to 1002 (kafka.cluster.Partition)
>
>
> Node 1002 continued running from that point on normally (outside of the
> fact that all of it's partitions were under replicated). Also there were no
> WARN/ERROR before/after this.
>
>
> The other two nodes were not so happy however, with both failing to
> connect to via the ReplicaFetcherThread to the node in question. The
> reported this around the same time as that error:
>
> [2016-11-28 19:57:16,087] WARN [ReplicaFetcherThread-0-1002], Error in
>> fetch kafka.server.ReplicaFetcherThread$FetchRequest@6eb44718
>> (kafka.server.ReplicaFetcherThread)
>> java.io.IOException: Connection to 1002 was disconnected before the
>> response was read
>>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
>> blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(
>> NetworkClientBlockingOps.scala:115)
>>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
>> blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(
>> NetworkClientBlockingOps.scala:112)
>>         at scala.Option.foreach(Option.scala:257)
>>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
>> blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.
>> scala:112)
>>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
>> blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.
>> scala:108)
>>         at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(
>> NetworkClientBlockingOps.scala:137)
>>         at kafka.utils.NetworkClientBlockingOps$.kafka$utils$
>> NetworkClientBlockingOps$$pollContinuously$extension(
>> NetworkClientBlockingOps.scala:143)
>>         at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$
>> extension(NetworkClientBlockingOps.scala:108)
>>         at kafka.server.ReplicaFetcherThread.sendRequest(
>> ReplicaFetcherThread.scala:253)
>>         at kafka.server.ReplicaFetcherThread.fetch(
>> ReplicaFetcherThread.scala:238)
>>         at kafka.server.ReplicaFetcherThread.fetch(
>> ReplicaFetcherThread.scala:42)
>>         at kafka.server.AbstractFetcherThread.processFetchRequest(
>> AbstractFetcherThread.scala:118)
>>         at kafka.server.AbstractFetcherThread.doWork(
>> AbstractFetcherThread.scala:103)
>>         at kafka.utils.ShutdownableThread.run(
>> ShutdownableThread.scala:63)
>
>
> and then got stuck trying this every 30 seconds until I restarted node
> 1002:
>
> [2016-11-28 20:02:04,513] WARN [ReplicaFetcherThread-0-1002], Error in
>> fetch kafka.server.ReplicaFetcherThread$FetchRequest@1cd61a02
>> (kafka.server.ReplicaFetcherThread)
>> java.net.SocketTimeoutException: Failed to connect within 30000 ms
>>         at kafka.server.ReplicaFetcherThread.sendRequest(
>> ReplicaFetcherThread.scala:249)
>>         at kafka.server.ReplicaFetcherThread.fetch(
>> ReplicaFetcherThread.scala:238)
>>         at kafka.server.ReplicaFetcherThread.fetch(
>> ReplicaFetcherThread.scala:42)
>>         at kafka.server.AbstractFetcherThread.processFetchRequest(
>> AbstractFetcherThread.scala:118)
>>         at kafka.server.AbstractFetcherThread.doWork(
>> AbstractFetcherThread.scala:103)
>>         at kafka.utils.ShutdownableThread.run(
>> ShutdownableThread.scala:63)
>
>
> I restarted the node when I noticed this, however because the replicas
> were out of sync, we ended up having an unclean leader election and
> ultimately losing data for the partitions on that machine. Some potentially
> interesting things about the cluster state at that time:
>
> - I *was* able to telnet to port 9092 on the machine that went rogue from
> each of the other two kafka brokers (even while they were reporting failed
> to connect)
> - The number of open file descriptors on that machine started increased
> linearly for the entire 1.5 hours the cluster was in this state, eventually
> ending up at ~4x the usual open file descriptors. The number of open file
> descriptors went back to normal after the restart.
> - The heap size on the node in question started fluctuating very rapidly.
> The usual behavior is the heap size slowly grows over a period of ~10 hours
> and then I assume a large GC occurs and it starts this again. The node that
> had this issue had the period of that behavior drop to ~5 mins.
> - The heap size spiked to a size way higher than normal
> - While the node was in this state the System/Process CPU dropped to
> ~1/8th of its usual level.
>
> I have the full logs and more metrics collected for all 3 nodes for that
> time period and would be happy to share them, but I wasn't sure if the user
> list supported attachments.
>
> Any help would be greatly appreciated.
>
> Thanks,
>
>
> <http://dataminr.com/>
>
>
> *Tom DeVoe*
> Sofware Engineer, Data
>
> 6 East 32nd Street, 2nd Floor
> New York, NY 10016
>
>
>
> Dataminr is a Twitter Official Partner.
> Dataminr in the news: The Economist
> <http://www.economist.com/news/business/21705369-alternative-data-firms-are-shedding-new-light-corporate-performance-watchers>
>  | International Business Times
> <http://www.ibtimes.co.uk/dataminr-solves-twitters-needle-haystack-problem-hedge-funds-banks-1576692>
>  | Equities.com
> <https://www.equities.com/news/from-novelty-to-utility-how-dataminr-and-the-alternative-data-industry-is-becoming-mainstream>
>  | SIA
> <http://newsmanager.commpartners.com/sianews2/issues/2016-08-19/11.html>
>
>

Re: One Kafka Broker Went Rogue

Posted by Tony Liu <ji...@zuora.com>.

Hi,

I just found a reported issue,
https://issues.apache.org/jira/browse/KAFKA-4477, hopefully, it's useful
for you.

On Thu, Dec 29, 2016 at 12:08 PM, Tony Liu <ji...@zuora.com> wrote:

> Hi Thomas or Anyone,
>
> I also encountered the same issue like you reported, the only workaround
> is to restart that broken node, but I did not find root cause how to solve
> it right now, so I wonder do you have some progress how to solve that issue
> right now?
>
> i.e, at the beginning, I thought this issue was caused by `ulimit`, after
> I increase to 100k, that random error still happen after several days, so I
> think this issue may be related with kafka itself.
>
> thanks.
>
> On Tue, Dec 6, 2016 at 12:59 PM, Thomas DeVoe <td...@dataminr.com> wrote:
>
>> Hi All,
>>
>> This happened again on our kafka cluster - a single kafka broker seems to
>> "forget" the existence of the rest of the cluster and shrinks all of its
>> ISRs to only exist on that node. The other two nodes get stuck in a loop
>> trying to connect to this rogue node and never even register that it is no
>> longer part of the cluster. Strangely network connection between all of
>> these nodes is fine at that time and restarting the node resolves it
>> (though with some data loss due to unclean leader elections)
>>
>> Anyone have any ideas? Help would be greatly appreciated.
>>
>> Thanks,
>>
>>
>>
>> <http://dataminr.com/>
>>
>>
>> *Tom DeVoe*
>> Sofware Engineer, Data
>>
>> 6 East 32nd Street, 2nd Floor
>> New York, NY 10016
>>
>>
>>
>> Dataminr is a Twitter Official Partner.
>> Dataminr in the news: The Economist
>> <http://www.economist.com/news/business/21705369-alternative
>> -data-firms-are-shedding-new-light-corporate-performance-watchers>
>>  | International Business Times
>> <http://www.ibtimes.co.uk/dataminr-solves-twitters-needle-
>> haystack-problem-hedge-funds-banks-1576692>
>>  | Equities.com
>> <https://www.equities.com/news/from-novelty-to-utility-how-
>> dataminr-and-the-alternative-data-industry-is-becoming-mainstream>
>>  | SIA
>> <http://newsmanager.commpartners.com/sianews2/issues/2016-08-19/11.html>
>>
>>
>> On Tue, Nov 29, 2016 at 1:29 PM, Thomas DeVoe <td...@dataminr.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I encountered a strange issue in our kafka cluster, where randomly a
>> > single broker entered a state where it seemed to think it was the only
>> > broker in the cluster (it shrank all of its ISRs to just existing on
>> > itself). Some details about the kafka cluster:
>> >
>> > - running in an EC2 VPC on AWS
>> > - 3 nodes (d2.xlarge)
>> > - Kafka version : 0.10.1.0
>> >
>> > More information about the incident:
>> >
>> > Around 19:57 yesterday, one of the nodes somehow lost its connection to
>> > the cluster and started reporting messages like this for what seemed to
>> be
>> > all of its hosted topic partitions:
>> >
>> > [2016-11-28 19:57:05,426] INFO Partition [arches_stage,0] on broker
>> 1002:
>> >> Shrinking ISR for partition [arches_stage,0] from 1003,1002,1001 to
>> 1002
>> >> (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:05,466] INFO Partition [connect-offsets,13] on broker
>> >> 1002: Shrinking ISR for partition [connect-offsets,13] from
>> 1003,1002,1001
>> >> to 1002 (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:05,489] INFO Partition [lasagna_prod_memstore,2] on
>> >> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
>> >> 1003,1002,1001 to 1002 (kafka.cluster.Partition)
>> >> ...
>> >>
>> >
>> > It then added the ISRs from the other machines back in:
>> >
>> > [2016-11-28 19:57:18,013] INFO Partition [arches_stage,0] on broker
>> 1002:
>> >> Expanding ISR for partition [arches_stage,0] from 1002 to 1002,1003
>> >> (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:18,015] INFO Partition [connect-offsets,13] on broker
>> >> 1002: Expanding ISR for partition [connect-offsets,13] from 1002 to
>> >> 1002,1003 (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:18,018] INFO Partition [lasagna_prod_memstore,2] on
>> >> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
>> >> 1002 to 1002,1003 (kafka.cluster.Partition)
>> >> ...
>> >> [2016-11-28 19:57:18,222] INFO Partition [arches_stage,0] on broker
>> 1002:
>> >> Expanding ISR for partition [arches_stage,0] from 1002,1003 to
>> >> 1002,1003,1001 (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:18,224] INFO Partition [connect-offsets,13] on broker
>> >> 1002: Expanding ISR for partition [connect-offsets,13] from 1002,1003
>> to
>> >> 1002,1003,1001 (kafka.cluster.Partition)
>> >> [2016-11-28 19:57:18,227] INFO Partition [lasagna_prod_memstore,2] on
>> >> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
>> >> 1002,1003 to 1002,1003,1001 (kafka.cluster.Partition)
>> >
>> >
>> > and eventually removed them again before going on its merry way:
>> >
>> > [2016-11-28 19:58:05,408] INFO Partition [arches_stage,0] on broker
>> 1002:
>> >> Shrinking ISR for partition [arches_stage,0] from 1002,1003,1001 to
>> 1002
>> >> (kafka.cluster.Partition)
>> >> [2016-11-28 19:58:05,415] INFO Partition [connect-offsets,13] on broker
>> >> 1002: Shrinking ISR for partition [connect-offsets,13] from
>> 1002,1003,1001
>> >> to 1002 (kafka.cluster.Partition)
>> >> [2016-11-28 19:58:05,416] INFO Partition [lasagna_prod_memstore,2] on
>> >> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
>> >> 1002,1003,1001 to 1002 (kafka.cluster.Partition)
>> >
>> >
>> > Node 1002 continued running from that point on normally (outside of the
>> > fact that all of it's partitions were under replicated). Also there
>> were no
>> > WARN/ERROR before/after this.
>> >
>> >
>> > The other two nodes were not so happy however, with both failing to
>> > connect to via the ReplicaFetcherThread to the node in question. The
>> > reported this around the same time as that error:
>> >
>> > [2016-11-28 19:57:16,087] WARN [ReplicaFetcherThread-0-1002], Error in
>> >> fetch kafka.server.ReplicaFetcherThread$FetchRequest@6eb44718
>> >> (kafka.server.ReplicaFetcherThread)
>> >> java.io.IOException: Connection to 1002 was disconnected before the
>> >> response was read
>> >>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
>> >> blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(
>> >> NetworkClientBlockingOps.scala:115)
>> >>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
>> >> blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(
>> >> NetworkClientBlockingOps.scala:112)
>> >>         at scala.Option.foreach(Option.scala:257)
>> >>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
>> >> blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.
>> >> scala:112)
>> >>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
>> >> blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.
>> >> scala:108)
>> >>         at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(
>> >> NetworkClientBlockingOps.scala:137)
>> >>         at kafka.utils.NetworkClientBlockingOps$.kafka$utils$
>> >> NetworkClientBlockingOps$$pollContinuously$extension(
>> >> NetworkClientBlockingOps.scala:143)
>> >>         at kafka.utils.NetworkClientBlock
>> ingOps$.blockingSendAndReceive$
>> >> extension(NetworkClientBlockingOps.scala:108)
>> >>         at kafka.server.ReplicaFetcherThread.sendRequest(
>> >> ReplicaFetcherThread.scala:253)
>> >>         at kafka.server.ReplicaFetcherThread.fetch(
>> >> ReplicaFetcherThread.scala:238)
>> >>         at kafka.server.ReplicaFetcherThread.fetch(
>> >> ReplicaFetcherThread.scala:42)
>> >>         at kafka.server.AbstractFetcherThread.processFetchRequest(
>> >> AbstractFetcherThread.scala:118)
>> >>         at kafka.server.AbstractFetcherThread.doWork(
>> >> AbstractFetcherThread.scala:103)
>> >>         at kafka.utils.ShutdownableThread.run(
>> >> ShutdownableThread.scala:63)
>> >
>> >
>> > and then got stuck trying this every 30 seconds until I restarted node
>> > 1002:
>> >
>> > [2016-11-28 20:02:04,513] WARN [ReplicaFetcherThread-0-1002], Error in
>> >> fetch kafka.server.ReplicaFetcherThread$FetchRequest@1cd61a02
>> >> (kafka.server.ReplicaFetcherThread)
>> >> java.net.SocketTimeoutException: Failed to connect within 30000 ms
>> >>         at kafka.server.ReplicaFetcherThread.sendRequest(
>> >> ReplicaFetcherThread.scala:249)
>> >>         at kafka.server.ReplicaFetcherThread.fetch(
>> >> ReplicaFetcherThread.scala:238)
>> >>         at kafka.server.ReplicaFetcherThread.fetch(
>> >> ReplicaFetcherThread.scala:42)
>> >>         at kafka.server.AbstractFetcherThread.processFetchRequest(
>> >> AbstractFetcherThread.scala:118)
>> >>         at kafka.server.AbstractFetcherThread.doWork(
>> >> AbstractFetcherThread.scala:103)
>> >>         at kafka.utils.ShutdownableThread.run(
>> >> ShutdownableThread.scala:63)
>> >
>> >
>> > I restarted the node when I noticed this, however because the replicas
>> > were out of sync, we ended up having an unclean leader election and
>> > ultimately losing data for the partitions on that machine. Some
>> potentially
>> > interesting things about the cluster state at that time:
>> >
>> > - I *was* able to telnet to port 9092 on the machine that went rogue
>> from
>> > each of the other two kafka brokers (even while they were reporting
>> failed
>> > to connect)
>> > - The number of open file descriptors on that machine started increased
>> > linearly for the entire 1.5 hours the cluster was in this state,
>> eventually
>> > ending up at ~4x the usual open file descriptors. The number of open
>> file
>> > descriptors went back to normal after the restart.
>> > - The heap size on the node in question started fluctuating very
>> rapidly.
>> > The usual behavior is the heap size slowly grows over a period of ~10
>> hours
>> > and then I assume a large GC occurs and it starts this again. The node
>> that
>> > had this issue had the period of that behavior drop to ~5 mins.
>> > - The heap size spiked to a size way higher than normal
>> > - While the node was in this state the System/Process CPU dropped to
>> > ~1/8th of its usual level.
>> >
>> > I have the full logs and more metrics collected for all 3 nodes for that
>> > time period and would be happy to share them, but I wasn't sure if the
>> user
>> > list supported attachments.
>> >
>> > Any help would be greatly appreciated.
>> >
>> > Thanks,
>> >
>> >
>> > <http://dataminr.com/>
>> >
>> >
>> > *Tom DeVoe*
>> > Sofware Engineer, Data
>> >
>> > 6 East 32nd Street, 2nd Floor
>> > New York, NY 10016
>> >
>> >
>> >
>> > Dataminr is a Twitter Official Partner.
>> > Dataminr in the news: The Economist
>> > <http://www.economist.com/news/business/21705369-alternative
>> -data-firms-are-shedding-new-light-corporate-performance-watchers>
>> >  | International Business Times
>> > <http://www.ibtimes.co.uk/dataminr-solves-twitters-needle-
>> haystack-problem-hedge-funds-banks-1576692>
>> >  | Equities.com
>> > <https://www.equities.com/news/from-novelty-to-utility-how-
>> dataminr-and-the-alternative-data-industry-is-becoming-mainstream>
>> >  | SIA
>> > <http://newsmanager.commpartners.com/sianews2/issues/2016-08-19/11.html
>> >
>> >
>> >
>>
>
>

Re: One Kafka Broker Went Rogue

Posted by Tony Liu <ji...@zuora.com>.

Hi Thomas or Anyone,

I also encountered the same issue like you reported, the only workaround is
to restart that broken node, but I did not find root cause how to solve it
right now, so I wonder do you have some progress how to solve that issue
right now?

i.e, at the beginning, I thought this issue was caused by `ulimit`, after I
increase to 100k, that random error still happen after several days, so I
think this issue may be related with kafka itself.

thanks.

On Tue, Dec 6, 2016 at 12:59 PM, Thomas DeVoe <td...@dataminr.com> wrote:

> Hi All,
>
> This happened again on our kafka cluster - a single kafka broker seems to
> "forget" the existence of the rest of the cluster and shrinks all of its
> ISRs to only exist on that node. The other two nodes get stuck in a loop
> trying to connect to this rogue node and never even register that it is no
> longer part of the cluster. Strangely network connection between all of
> these nodes is fine at that time and restarting the node resolves it
> (though with some data loss due to unclean leader elections)
>
> Anyone have any ideas? Help would be greatly appreciated.
>
> Thanks,
>
>
>
> <http://dataminr.com/>
>
>
> *Tom DeVoe*
> Sofware Engineer, Data
>
> 6 East 32nd Street, 2nd Floor
> New York, NY 10016
>
>
>
> Dataminr is a Twitter Official Partner.
> Dataminr in the news: The Economist
> <http://www.economist.com/news/business/21705369-
> alternative-data-firms-are-shedding-new-light-corporate-
> performance-watchers>
>  | International Business Times
> <http://www.ibtimes.co.uk/dataminr-solves-twitters-
> needle-haystack-problem-hedge-funds-banks-1576692>
>  | Equities.com
> <https://www.equities.com/news/from-novelty-to-utility-
> how-dataminr-and-the-alternative-data-industry-is-becoming-mainstream>
>  | SIA
> <http://newsmanager.commpartners.com/sianews2/issues/2016-08-19/11.html>
>
>
> On Tue, Nov 29, 2016 at 1:29 PM, Thomas DeVoe <td...@dataminr.com> wrote:
>
> > Hi,
> >
> > I encountered a strange issue in our kafka cluster, where randomly a
> > single broker entered a state where it seemed to think it was the only
> > broker in the cluster (it shrank all of its ISRs to just existing on
> > itself). Some details about the kafka cluster:
> >
> > - running in an EC2 VPC on AWS
> > - 3 nodes (d2.xlarge)
> > - Kafka version : 0.10.1.0
> >
> > More information about the incident:
> >
> > Around 19:57 yesterday, one of the nodes somehow lost its connection to
> > the cluster and started reporting messages like this for what seemed to
> be
> > all of its hosted topic partitions:
> >
> > [2016-11-28 19:57:05,426] INFO Partition [arches_stage,0] on broker 1002:
> >> Shrinking ISR for partition [arches_stage,0] from 1003,1002,1001 to 1002
> >> (kafka.cluster.Partition)
> >> [2016-11-28 19:57:05,466] INFO Partition [connect-offsets,13] on broker
> >> 1002: Shrinking ISR for partition [connect-offsets,13] from
> 1003,1002,1001
> >> to 1002 (kafka.cluster.Partition)
> >> [2016-11-28 19:57:05,489] INFO Partition [lasagna_prod_memstore,2] on
> >> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
> >> 1003,1002,1001 to 1002 (kafka.cluster.Partition)
> >> ...
> >>
> >
> > It then added the ISRs from the other machines back in:
> >
> > [2016-11-28 19:57:18,013] INFO Partition [arches_stage,0] on broker 1002:
> >> Expanding ISR for partition [arches_stage,0] from 1002 to 1002,1003
> >> (kafka.cluster.Partition)
> >> [2016-11-28 19:57:18,015] INFO Partition [connect-offsets,13] on broker
> >> 1002: Expanding ISR for partition [connect-offsets,13] from 1002 to
> >> 1002,1003 (kafka.cluster.Partition)
> >> [2016-11-28 19:57:18,018] INFO Partition [lasagna_prod_memstore,2] on
> >> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
> >> 1002 to 1002,1003 (kafka.cluster.Partition)
> >> ...
> >> [2016-11-28 19:57:18,222] INFO Partition [arches_stage,0] on broker
> 1002:
> >> Expanding ISR for partition [arches_stage,0] from 1002,1003 to
> >> 1002,1003,1001 (kafka.cluster.Partition)
> >> [2016-11-28 19:57:18,224] INFO Partition [connect-offsets,13] on broker
> >> 1002: Expanding ISR for partition [connect-offsets,13] from 1002,1003 to
> >> 1002,1003,1001 (kafka.cluster.Partition)
> >> [2016-11-28 19:57:18,227] INFO Partition [lasagna_prod_memstore,2] on
> >> broker 1002: Expanding ISR for partition [lasagna_prod_memstore,2] from
> >> 1002,1003 to 1002,1003,1001 (kafka.cluster.Partition)
> >
> >
> > and eventually removed them again before going on its merry way:
> >
> > [2016-11-28 19:58:05,408] INFO Partition [arches_stage,0] on broker 1002:
> >> Shrinking ISR for partition [arches_stage,0] from 1002,1003,1001 to 1002
> >> (kafka.cluster.Partition)
> >> [2016-11-28 19:58:05,415] INFO Partition [connect-offsets,13] on broker
> >> 1002: Shrinking ISR for partition [connect-offsets,13] from
> 1002,1003,1001
> >> to 1002 (kafka.cluster.Partition)
> >> [2016-11-28 19:58:05,416] INFO Partition [lasagna_prod_memstore,2] on
> >> broker 1002: Shrinking ISR for partition [lasagna_prod_memstore,2] from
> >> 1002,1003,1001 to 1002 (kafka.cluster.Partition)
> >
> >
> > Node 1002 continued running from that point on normally (outside of the
> > fact that all of it's partitions were under replicated). Also there were
> no
> > WARN/ERROR before/after this.
> >
> >
> > The other two nodes were not so happy however, with both failing to
> > connect to via the ReplicaFetcherThread to the node in question. The
> > reported this around the same time as that error:
> >
> > [2016-11-28 19:57:16,087] WARN [ReplicaFetcherThread-0-1002], Error in
> >> fetch kafka.server.ReplicaFetcherThread$FetchRequest@6eb44718
> >> (kafka.server.ReplicaFetcherThread)
> >> java.io.IOException: Connection to 1002 was disconnected before the
> >> response was read
> >>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
> >> blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(
> >> NetworkClientBlockingOps.scala:115)
> >>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
> >> blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(
> >> NetworkClientBlockingOps.scala:112)
> >>         at scala.Option.foreach(Option.scala:257)
> >>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
> >> blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.
> >> scala:112)
> >>         at kafka.utils.NetworkClientBlockingOps$$anonfun$
> >> blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.
> >> scala:108)
> >>         at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(
> >> NetworkClientBlockingOps.scala:137)
> >>         at kafka.utils.NetworkClientBlockingOps$.kafka$utils$
> >> NetworkClientBlockingOps$$pollContinuously$extension(
> >> NetworkClientBlockingOps.scala:143)
> >>         at kafka.utils.NetworkClientBlockingOps$.
> blockingSendAndReceive$
> >> extension(NetworkClientBlockingOps.scala:108)
> >>         at kafka.server.ReplicaFetcherThread.sendRequest(
> >> ReplicaFetcherThread.scala:253)
> >>         at kafka.server.ReplicaFetcherThread.fetch(
> >> ReplicaFetcherThread.scala:238)
> >>         at kafka.server.ReplicaFetcherThread.fetch(
> >> ReplicaFetcherThread.scala:42)
> >>         at kafka.server.AbstractFetcherThread.processFetchRequest(
> >> AbstractFetcherThread.scala:118)
> >>         at kafka.server.AbstractFetcherThread.doWork(
> >> AbstractFetcherThread.scala:103)
> >>         at kafka.utils.ShutdownableThread.run(
> >> ShutdownableThread.scala:63)
> >
> >
> > and then got stuck trying this every 30 seconds until I restarted node
> > 1002:
> >
> > [2016-11-28 20:02:04,513] WARN [ReplicaFetcherThread-0-1002], Error in
> >> fetch kafka.server.ReplicaFetcherThread$FetchRequest@1cd61a02
> >> (kafka.server.ReplicaFetcherThread)
> >> java.net.SocketTimeoutException: Failed to connect within 30000 ms
> >>         at kafka.server.ReplicaFetcherThread.sendRequest(
> >> ReplicaFetcherThread.scala:249)
> >>         at kafka.server.ReplicaFetcherThread.fetch(
> >> ReplicaFetcherThread.scala:238)
> >>         at kafka.server.ReplicaFetcherThread.fetch(
> >> ReplicaFetcherThread.scala:42)
> >>         at kafka.server.AbstractFetcherThread.processFetchRequest(
> >> AbstractFetcherThread.scala:118)
> >>         at kafka.server.AbstractFetcherThread.doWork(
> >> AbstractFetcherThread.scala:103)
> >>         at kafka.utils.ShutdownableThread.run(
> >> ShutdownableThread.scala:63)
> >
> >
> > I restarted the node when I noticed this, however because the replicas
> > were out of sync, we ended up having an unclean leader election and
> > ultimately losing data for the partitions on that machine. Some
> potentially
> > interesting things about the cluster state at that time:
> >
> > - I *was* able to telnet to port 9092 on the machine that went rogue from
> > each of the other two kafka brokers (even while they were reporting
> failed
> > to connect)
> > - The number of open file descriptors on that machine started increased
> > linearly for the entire 1.5 hours the cluster was in this state,
> eventually
> > ending up at ~4x the usual open file descriptors. The number of open file
> > descriptors went back to normal after the restart.
> > - The heap size on the node in question started fluctuating very rapidly.
> > The usual behavior is the heap size slowly grows over a period of ~10
> hours
> > and then I assume a large GC occurs and it starts this again. The node
> that
> > had this issue had the period of that behavior drop to ~5 mins.
> > - The heap size spiked to a size way higher than normal
> > - While the node was in this state the System/Process CPU dropped to
> > ~1/8th of its usual level.
> >
> > I have the full logs and more metrics collected for all 3 nodes for that
> > time period and would be happy to share them, but I wasn't sure if the
> user
> > list supported attachments.
> >
> > Any help would be greatly appreciated.
> >
> > Thanks,
> >
> >
> > <http://dataminr.com/>
> >
> >
> > *Tom DeVoe*
> > Sofware Engineer, Data
> >
> > 6 East 32nd Street, 2nd Floor
> > New York, NY 10016
> >
> >
> >
> > Dataminr is a Twitter Official Partner.
> > Dataminr in the news: The Economist
> > <http://www.economist.com/news/business/21705369-
> alternative-data-firms-are-shedding-new-light-corporate-
> performance-watchers>
> >  | International Business Times
> > <http://www.ibtimes.co.uk/dataminr-solves-twitters-
> needle-haystack-problem-hedge-funds-banks-1576692>
> >  | Equities.com
> > <https://www.equities.com/news/from-novelty-to-utility-
> how-dataminr-and-the-alternative-data-industry-is-becoming-mainstream>
> >  | SIA
> > <http://newsmanager.commpartners.com/sianews2/issues/2016-08-19/11.html>
> >
> >
>