You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Dillian Murphey <cr...@gmail.com> on 2016/01/13 00:05:23 UTC

Frequent ZK session timeouts

Our 2 node kafka cluster has become unhealthy.  We're running zookeeper as
a 3 node system, which very light load.

What seems to be happening is in the controller log we get a ZK session
expire message, and in the process of re-assigning the leader for the
partitions (if I'm understanding this right, please correct me), the broker
goes offline and it interrupts our applications that are publishing
messages.

We don't see this in production, and kafka has been stable for months,
since september.

I've searched a lot and found some similiar complaints but no real
solutions.

I'm running 0.8.2 and JVM 1.6.X on ubuntu.

Thanks for any ideas at all.

Re: Frequent ZK session timeouts

Posted by Dillian Murphey <cr...@gmail.com>.

Last comment, I upgraded to java 1.7 and restarted kafka.  It's now stable.
But I have not poked at it. I'm just letting it sit for now. Could this
have been somehow related to the problem, but just not apparent in the
logs, that I was running 1.6 with 0.8.2.1?

On Tue, Jan 12, 2016 at 11:19 PM, Dillian Murphey <cr...@gmail.com>
wrote:

>
> [2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST>
> 22:16:59,629 <http://airmail.calendar/2016-01-12%2022:16:59%20PST>] TRACE
> [Controller 925537]: leader imbalance ratio for broker 925537 is 0.000000
> (kafka.controller.KafkaController)
>
> [2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST>
> 22:21:07,167 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO
> [SessionExpirationListener on 925537], ZK expired; shut down all controller
> components and try to re-elect
> (kafka.controller.KafkaController$SessionExpirationListener)
>
> [2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST>
> 22:21:07,167 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO
> [delete-topics-thread-925537], Shutting down
> (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
>
> [2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST>
> 22:21:07,169 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO
> [delete-topics-thread-925537], Shutdown completed
> (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
>
> [2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST>
> 22:21:07,169 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO
> [delete-topics-thread-925537], Stopped
>  (kafka.controller.TopicDeletionManager$Del
>
>  This occurs very frequently, even after clean slating kafka.  This is
> something that never occurs in our production env. I've read here and there
> that it could be a GC issue? Here is the tail end of recent GC log.
>
>
> 20534K(8354560K), 52.5293140 secs] [Times: user=209.09 sys=0.06,
> real=52.53 secs]
>
> 2016-01-11T23:16:05.149+0000: 784.219: [GC 784.219: [ParNew:
> 274263K->1685K(306688K), 54.8993730 secs] 793174K->520803K(8354560K),
> 54.8994450 secs] [Times: user=218.86 sys=0.03, real=54.90 secs]
>
> 2016-01-11T23:17:01.095+0000: 840.165: [GC 840.165: [ParNew:
> 274325K->1896K(306688K), 56.4208930 secs] 793443K->521139K(8354560K),
> 56.4209750 secs] [Times: user=224.88 sys=0.05, real=56.42 secs]
>
> 2016-01-11T23:17:59.024+0000: 898.093: [GC 898.093: [ParNew:
> 274536K->1705K(306688K), 58.1100630 secs] 793779K->521093K(8354560K),
> 58.1101400 secs] [Times: user=231.75 sys=0.05, real=58.12 secs]
>
> 2016-01-11T23:18:58.240+0000: 957.310: [GC 957.310: [ParNew:
> 274345K->1483K(306688K), 64.2820420 secs] 793733K->521047K(8354560K),
> 64.2821180 secs] [Times: user=241.93 sys=0.06, real=64.28 secs]
>
> 2016-01-11T23:20:03.571+0000: 1022.640: [GC 1022.640: [ParNew:
> 274123K->1379K(306688K), 61.5305280 secs] 793687K->521097K(8354560K),
> 61.5305990 secs] [Times: user=245.72 sys=0.01, real=61.53 secs]
>
> 2016-01-11T23:21:06.194+0000: 1085.263: [GC 1085.263: [ParNew:
> 274019K->1508K(306688K), 63.4433440 secs] 793737K->521372K(8354560K),
> 63.4434240 secs] [Times: user=253.33 sys=0.02, real=63.44 secs]
>
> 2016-01-11T23:22:10.413+0000: 1149.482: [GC 1149.483: [ParNew:
> 274148K->1313K(306688K), 65.6956010 secs] 794012K->521330K(8354560K),
> 65.6956660 secs] [Times: user=262.01 sys=0.05, real=65.69 secs]
>
> Heap
>
>  par new generation   total 306688K, used 132112K [0x00000005f5a00000,
> 0x000000060a6c0000, 0x000000060a6c0000)
>
>   eden space 272640K,  47% used [0x00000005f5a00000, 0x00000005fd9bbba0,
> 0x0000000606440000)
>
>   from space 34048K,   3% used [0x0000000606440000, 0x00000006065884a8,
> 0x0000000608580000)
>
>   to   space 34048K,   0% used [0x0000000608580000, 0x0000000608580000,
> 0x000000060a6c0000)
>
>  concurrent mark-sweep generation total 8047872K, used 520016K
> [0x000000060a6c0000, 0x00000007f5a00000, 0x00000007f5a00000)
>
>  concurrent-mark-sweep perm gen total 38760K, used 25768K
> [0x00000007f5a00000, 0x00000007f7fda000, 0x0000000800000000)
>
>
>
> On Tue, Jan 12, 2016 at 6:34 PM, Mayuresh Gharat <
> gharatmayuresh15@gmail.com> wrote:
>
>> Can you paste the logs?
>>
>> Thanks,
>>
>> Mayuresh
>>
>> On Tue, Jan 12, 2016 at 4:58 PM, Dillian Murphey <crackshotmule@gmail.com
>> >
>> wrote:
>>
>> > Possibly running more stable with 1.7 JVM.
>> >
>> > Can someone explain the Zookeeper session?  SHould it never expire,
>> unless
>> > the broker becomes unresponsive?  I set a massive timeout value in the
>> > broker config far beyond the amount of time I see the zk expiration. Is
>> > this entirely on the kafka side, or could zookeeper be doing something?
>> > From my zk logs I didn't see anything unusual, just exceptions as a
>> result
>> > of the zk session expiring (my guess).
>> >
>> > tnx
>> >
>> > On Tue, Jan 12, 2016 at 3:05 PM, Dillian Murphey <
>> crackshotmule@gmail.com>
>> > wrote:
>> >
>> > > Our 2 node kafka cluster has become unhealthy.  We're running
>> zookeeper
>> > as
>> > > a 3 node system, which very light load.
>> > >
>> > > What seems to be happening is in the controller log we get a ZK
>> session
>> > > expire message, and in the process of re-assigning the leader for the
>> > > partitions (if I'm understanding this right, please correct me), the
>> > broker
>> > > goes offline and it interrupts our applications that are publishing
>> > > messages.
>> > >
>> > > We don't see this in production, and kafka has been stable for months,
>> > > since september.
>> > >
>> > > I've searched a lot and found some similiar complaints but no real
>> > > solutions.
>> > >
>> > > I'm running 0.8.2 and JVM 1.6.X on ubuntu.
>> > >
>> > > Thanks for any ideas at all.
>> > >
>> > >
>> >
>>
>>
>>
>> --
>> -Regards,
>> Mayuresh R. Gharat
>> (862) 250-7125
>>
>
>

Re: Frequent ZK session timeouts

Posted by Dillian Murphey <cr...@gmail.com>.

[2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST>
22:16:59,629 <http://airmail.calendar/2016-01-12%2022:16:59%20PST>] TRACE
[Controller 925537]: leader imbalance ratio for broker 925537 is 0.000000
(kafka.controller.KafkaController)

[2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST>
22:21:07,167 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO
[SessionExpirationListener on 925537], ZK expired; shut down all controller
components and try to re-elect
(kafka.controller.KafkaController$SessionExpirationListener)

[2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST>
22:21:07,167 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO
[delete-topics-thread-925537], Shutting down
(kafka.controller.TopicDeletionManager$DeleteTopicsThread)

[2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST>
22:21:07,169 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO
[delete-topics-thread-925537], Shutdown completed
(kafka.controller.TopicDeletionManager$DeleteTopicsThread)

[2016-01-12 <http://airmail.calendar/2016-01-12%2012:00:00%20PST>
22:21:07,169 <http://airmail.calendar/2016-01-12%2022:21:07%20PST>] INFO
[delete-topics-thread-925537], Stopped
 (kafka.controller.TopicDeletionManager$Del

 This occurs very frequently, even after clean slating kafka.  This is
something that never occurs in our production env. I've read here and there
that it could be a GC issue? Here is the tail end of recent GC log.


20534K(8354560K), 52.5293140 secs] [Times: user=209.09 sys=0.06, real=52.53
secs]

2016-01-11T23:16:05.149+0000: 784.219: [GC 784.219: [ParNew:
274263K->1685K(306688K), 54.8993730 secs] 793174K->520803K(8354560K),
54.8994450 secs] [Times: user=218.86 sys=0.03, real=54.90 secs]

2016-01-11T23:17:01.095+0000: 840.165: [GC 840.165: [ParNew:
274325K->1896K(306688K), 56.4208930 secs] 793443K->521139K(8354560K),
56.4209750 secs] [Times: user=224.88 sys=0.05, real=56.42 secs]

2016-01-11T23:17:59.024+0000: 898.093: [GC 898.093: [ParNew:
274536K->1705K(306688K), 58.1100630 secs] 793779K->521093K(8354560K),
58.1101400 secs] [Times: user=231.75 sys=0.05, real=58.12 secs]

2016-01-11T23:18:58.240+0000: 957.310: [GC 957.310: [ParNew:
274345K->1483K(306688K), 64.2820420 secs] 793733K->521047K(8354560K),
64.2821180 secs] [Times: user=241.93 sys=0.06, real=64.28 secs]

2016-01-11T23:20:03.571+0000: 1022.640: [GC 1022.640: [ParNew:
274123K->1379K(306688K), 61.5305280 secs] 793687K->521097K(8354560K),
61.5305990 secs] [Times: user=245.72 sys=0.01, real=61.53 secs]

2016-01-11T23:21:06.194+0000: 1085.263: [GC 1085.263: [ParNew:
274019K->1508K(306688K), 63.4433440 secs] 793737K->521372K(8354560K),
63.4434240 secs] [Times: user=253.33 sys=0.02, real=63.44 secs]

2016-01-11T23:22:10.413+0000: 1149.482: [GC 1149.483: [ParNew:
274148K->1313K(306688K), 65.6956010 secs] 794012K->521330K(8354560K),
65.6956660 secs] [Times: user=262.01 sys=0.05, real=65.69 secs]

Heap

 par new generation   total 306688K, used 132112K [0x00000005f5a00000,
0x000000060a6c0000, 0x000000060a6c0000)

  eden space 272640K,  47% used [0x00000005f5a00000, 0x00000005fd9bbba0,
0x0000000606440000)

  from space 34048K,   3% used [0x0000000606440000, 0x00000006065884a8,
0x0000000608580000)

  to   space 34048K,   0% used [0x0000000608580000, 0x0000000608580000,
0x000000060a6c0000)

 concurrent mark-sweep generation total 8047872K, used 520016K
[0x000000060a6c0000, 0x00000007f5a00000, 0x00000007f5a00000)

 concurrent-mark-sweep perm gen total 38760K, used 25768K
[0x00000007f5a00000, 0x00000007f7fda000, 0x0000000800000000)



On Tue, Jan 12, 2016 at 6:34 PM, Mayuresh Gharat <gharatmayuresh15@gmail.com
> wrote:

> Can you paste the logs?
>
> Thanks,
>
> Mayuresh
>
> On Tue, Jan 12, 2016 at 4:58 PM, Dillian Murphey <cr...@gmail.com>
> wrote:
>
> > Possibly running more stable with 1.7 JVM.
> >
> > Can someone explain the Zookeeper session?  SHould it never expire,
> unless
> > the broker becomes unresponsive?  I set a massive timeout value in the
> > broker config far beyond the amount of time I see the zk expiration. Is
> > this entirely on the kafka side, or could zookeeper be doing something?
> > From my zk logs I didn't see anything unusual, just exceptions as a
> result
> > of the zk session expiring (my guess).
> >
> > tnx
> >
> > On Tue, Jan 12, 2016 at 3:05 PM, Dillian Murphey <
> crackshotmule@gmail.com>
> > wrote:
> >
> > > Our 2 node kafka cluster has become unhealthy.  We're running zookeeper
> > as
> > > a 3 node system, which very light load.
> > >
> > > What seems to be happening is in the controller log we get a ZK session
> > > expire message, and in the process of re-assigning the leader for the
> > > partitions (if I'm understanding this right, please correct me), the
> > broker
> > > goes offline and it interrupts our applications that are publishing
> > > messages.
> > >
> > > We don't see this in production, and kafka has been stable for months,
> > > since september.
> > >
> > > I've searched a lot and found some similiar complaints but no real
> > > solutions.
> > >
> > > I'm running 0.8.2 and JVM 1.6.X on ubuntu.
> > >
> > > Thanks for any ideas at all.
> > >
> > >
> >
>
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>

Re: Frequent ZK session timeouts

Posted by Mayuresh Gharat <gh...@gmail.com>.

Can you paste the logs?

Thanks,

Mayuresh

On Tue, Jan 12, 2016 at 4:58 PM, Dillian Murphey <cr...@gmail.com>
wrote:

> Possibly running more stable with 1.7 JVM.
>
> Can someone explain the Zookeeper session?  SHould it never expire, unless
> the broker becomes unresponsive?  I set a massive timeout value in the
> broker config far beyond the amount of time I see the zk expiration. Is
> this entirely on the kafka side, or could zookeeper be doing something?
> From my zk logs I didn't see anything unusual, just exceptions as a result
> of the zk session expiring (my guess).
>
> tnx
>
> On Tue, Jan 12, 2016 at 3:05 PM, Dillian Murphey <cr...@gmail.com>
> wrote:
>
> > Our 2 node kafka cluster has become unhealthy.  We're running zookeeper
> as
> > a 3 node system, which very light load.
> >
> > What seems to be happening is in the controller log we get a ZK session
> > expire message, and in the process of re-assigning the leader for the
> > partitions (if I'm understanding this right, please correct me), the
> broker
> > goes offline and it interrupts our applications that are publishing
> > messages.
> >
> > We don't see this in production, and kafka has been stable for months,
> > since september.
> >
> > I've searched a lot and found some similiar complaints but no real
> > solutions.
> >
> > I'm running 0.8.2 and JVM 1.6.X on ubuntu.
> >
> > Thanks for any ideas at all.
> >
> >
>



-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125

Re: Frequent ZK session timeouts

Posted by Dillian Murphey <cr...@gmail.com>.

Possibly running more stable with 1.7 JVM.

Can someone explain the Zookeeper session?  SHould it never expire, unless
the broker becomes unresponsive?  I set a massive timeout value in the
broker config far beyond the amount of time I see the zk expiration. Is
this entirely on the kafka side, or could zookeeper be doing something?
>From my zk logs I didn't see anything unusual, just exceptions as a result
of the zk session expiring (my guess).

tnx

On Tue, Jan 12, 2016 at 3:05 PM, Dillian Murphey <cr...@gmail.com>
wrote:

> Our 2 node kafka cluster has become unhealthy.  We're running zookeeper as
> a 3 node system, which very light load.
>
> What seems to be happening is in the controller log we get a ZK session
> expire message, and in the process of re-assigning the leader for the
> partitions (if I'm understanding this right, please correct me), the broker
> goes offline and it interrupts our applications that are publishing
> messages.
>
> We don't see this in production, and kafka has been stable for months,
> since september.
>
> I've searched a lot and found some similiar complaints but no real
> solutions.
>
> I'm running 0.8.2 and JVM 1.6.X on ubuntu.
>
> Thanks for any ideas at all.
>
>