You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Christofer Hedbrandh <ch...@knewton.com> on 2015/07/09 16:58:16 UTC

Leap Second Troubles

Hi Kafka users,

ZooKeeper in our staging environment was running on a very old ubuntu
version, that was exposed to the "leap second causes spuriously high CPU
usage" bug.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285

As a result, when the leap second arrived, the ZooKeeper CPU usage went up
to 100% and stayed there. In response to this, we restarted one ZooKeeper
process. The ZooKeeper restart unfortunately made the situation much worse
as we hit three different (possibly related) Kafka problems. We are using
Kafka 0.8.2 brokers, consumers and producers.


#1
One of our three brokers was kicked out or ISR for some (most but not all)
partitions, and was continuously logging "Cached zkVersion [XX] not equal
to that in zookeeper, skip updating ISR" over and over (until I eventually
stopped this broker).

INFO Partition [topic-x,xx] on broker 1: Shrinking ISR for partition
[topic-x,xx] from 1,2,3 to 1 (kafka.cluster.Partition)
INFO Partition [topic-x,xx] on broker 1: Cached zkVersion [62] not equal to
that in zookeeper, skip updating ISR (kafka.cluster.Partition)
INFO Partition [topic-y,yy] on broker 1: Shrinking ISR for partition
[topic-y,yy] from 1,2,3 to 1 (kafka.cluster.Partition)
INFO Partition [topic-y,yy] on broker 1: Cached zkVersion [39] not equal to
that in zookeeper, skip updating ISR (kafka.cluster.Partition)
INFO Partition [topic-z,zz] on broker 1: Shrinking ISR for partition
[topic-z,zz] from 1,2,3 to 1 (kafka.cluster.Partition)
INFO Partition [topic-z,zz] on broker 1: Cached zkVersion [45] not equal to
that in zookeeper, skip updating ISR (kafka.cluster.Partition)
etc.

Searching the users@kafka.apache.org archive and Googling for this log
output, gives me similar descriptions but nothing that exactly describes
this.
It is very similar to this, but without the "ERROR Conditional update of
path ..." part.
https://www.mail-archive.com/users@kafka.apache.org/msg07044.html


#2
The remaining two brokers were logging this every five seconds or so.
INFO conflict in /brokers/ids/xxx data:
{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}
stored data:
{"jmx_port":xxx,"timestamp":"1435711782536","host":"xxx","version":1,"port":9092}
(kafka.utils.ZkUtils$)
INFO I wrote this conflicted ephemeral node
[{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}]
at /brokers/ids/xxx a while back in a different session, hence I will
backoff for this node to be deleted by Zookeeper and retry
(kafka.utils.ZkUtils$)

It sounds very much like we hit this bug
https://issues.apache.org/jira/browse/KAFKA-1387


#3
The most serious issue that resulted was that some consumer groups failed
to claim all partitions. When using the ConsumerOffsetChecker, the owner of
some partitions was listed as "none", the lag was constantly increasing,
and it was clear that no consumers were processing these messages.

It is exactly what Dave Hamilton is describing here, but from this email
chain no one seems to know what caused it.
https://www.mail-archive.com/users%40kafka.apache.org/msg13364.html

It may be reasonable to assume that the consumer rebalance failures we also
saw has something to do with this. But why the rebalance failed is still
unclear.

ERROR k.c.ZookeeperConsumerConnector: error during syncedRebalance
kafka.common.ConsumerRebalanceFailedException: xxx can't rebalance after 4
retries
at
kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633)
at
kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:551)


I am curious to hear if anyone else had similar problems to this?

And also if anyone can say if these are all known bugs that are being
tracked with some ticket number?


Thanks,
Christofer

P.S. Eventually after ZooKeeper and Kafka broker and consumer restarts
everything returned to normal.

Re: Leap Second Troubles

Posted by Todd Palino <tp...@gmail.com>.
OK, in that case then I'm thinking that you ran into issues that were a
natural result of the Zookeeper ensemble having very high CPU usage.
Unfortunate, but this would not be an unexpected situation when your ZK
ensemble is having significant problems.

-Todd


On Fri, Jul 10, 2015 at 10:21 AM, Christofer Hedbrandh <
christofer@knewton.com> wrote:

> Todd, the Kafka problems started when one of three ZooKeeper nodes was
> restarted.
>
> On Thu, Jul 9, 2015 at 12:10 PM, Todd Palino <tp...@gmail.com> wrote:
>
> > Did you hit the problems in the Kafka brokers and consumers during the
> > Zookeeper problem, or after you had already cleared it?
> >
> > For us, we decided to skip the leap second problem (even though we're
> > supposedly on a version that doesn't have that bug) by shutting down ntpd
> > everywhere and then allowing it to slowly adjust the time afterwards
> > without sending the leap second.
> >
> > -Todd
> >
> >
> > On Thu, Jul 9, 2015 at 7:58 AM, Christofer Hedbrandh <
> > christofer@knewton.com
> > > wrote:
> >
> > > Hi Kafka users,
> > >
> > > ZooKeeper in our staging environment was running on a very old ubuntu
> > > version, that was exposed to the "leap second causes spuriously high
> CPU
> > > usage" bug.
> > >
> > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285
> > >
> > > As a result, when the leap second arrived, the ZooKeeper CPU usage went
> > up
> > > to 100% and stayed there. In response to this, we restarted one
> ZooKeeper
> > > process. The ZooKeeper restart unfortunately made the situation much
> > worse
> > > as we hit three different (possibly related) Kafka problems. We are
> using
> > > Kafka 0.8.2 brokers, consumers and producers.
> > >
> > >
> > > #1
> > > One of our three brokers was kicked out or ISR for some (most but not
> > all)
> > > partitions, and was continuously logging "Cached zkVersion [XX] not
> equal
> > > to that in zookeeper, skip updating ISR" over and over (until I
> > eventually
> > > stopped this broker).
> > >
> > > INFO Partition [topic-x,xx] on broker 1: Shrinking ISR for partition
> > > [topic-x,xx] from 1,2,3 to 1 (kafka.cluster.Partition)
> > > INFO Partition [topic-x,xx] on broker 1: Cached zkVersion [62] not
> equal
> > to
> > > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > > INFO Partition [topic-y,yy] on broker 1: Shrinking ISR for partition
> > > [topic-y,yy] from 1,2,3 to 1 (kafka.cluster.Partition)
> > > INFO Partition [topic-y,yy] on broker 1: Cached zkVersion [39] not
> equal
> > to
> > > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > > INFO Partition [topic-z,zz] on broker 1: Shrinking ISR for partition
> > > [topic-z,zz] from 1,2,3 to 1 (kafka.cluster.Partition)
> > > INFO Partition [topic-z,zz] on broker 1: Cached zkVersion [45] not
> equal
> > to
> > > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > > etc.
> > >
> > > Searching the users@kafka.apache.org archive and Googling for this log
> > > output, gives me similar descriptions but nothing that exactly
> describes
> > > this.
> > > It is very similar to this, but without the "ERROR Conditional update
> of
> > > path ..." part.
> > > https://www.mail-archive.com/users@kafka.apache.org/msg07044.html
> > >
> > >
> > > #2
> > > The remaining two brokers were logging this every five seconds or so.
> > > INFO conflict in /brokers/ids/xxx data:
> > >
> > >
> >
> {"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}
> > > stored data:
> > >
> > >
> >
> {"jmx_port":xxx,"timestamp":"1435711782536","host":"xxx","version":1,"port":9092}
> > > (kafka.utils.ZkUtils$)
> > > INFO I wrote this conflicted ephemeral node
> > >
> > >
> >
> [{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}]
> > > at /brokers/ids/xxx a while back in a different session, hence I will
> > > backoff for this node to be deleted by Zookeeper and retry
> > > (kafka.utils.ZkUtils$)
> > >
> > > It sounds very much like we hit this bug
> > > https://issues.apache.org/jira/browse/KAFKA-1387
> > >
> > >
> > > #3
> > > The most serious issue that resulted was that some consumer groups
> failed
> > > to claim all partitions. When using the ConsumerOffsetChecker, the
> owner
> > of
> > > some partitions was listed as "none", the lag was constantly
> increasing,
> > > and it was clear that no consumers were processing these messages.
> > >
> > > It is exactly what Dave Hamilton is describing here, but from this
> email
> > > chain no one seems to know what caused it.
> > > https://www.mail-archive.com/users%40kafka.apache.org/msg13364.html
> > >
> > > It may be reasonable to assume that the consumer rebalance failures we
> > also
> > > saw has something to do with this. But why the rebalance failed is
> still
> > > unclear.
> > >
> > > ERROR k.c.ZookeeperConsumerConnector: error during syncedRebalance
> > > kafka.common.ConsumerRebalanceFailedException: xxx can't rebalance
> after
> > 4
> > > retries
> > > at
> > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633)
> > > at
> > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:551)
> > >
> > >
> > > I am curious to hear if anyone else had similar problems to this?
> > >
> > > And also if anyone can say if these are all known bugs that are being
> > > tracked with some ticket number?
> > >
> > >
> > > Thanks,
> > > Christofer
> > >
> > > P.S. Eventually after ZooKeeper and Kafka broker and consumer restarts
> > > everything returned to normal.
> > >
> >
>

Re: Leap Second Troubles

Posted by Christofer Hedbrandh <ch...@knewton.com>.
Todd, the Kafka problems started when one of three ZooKeeper nodes was
restarted.

On Thu, Jul 9, 2015 at 12:10 PM, Todd Palino <tp...@gmail.com> wrote:

> Did you hit the problems in the Kafka brokers and consumers during the
> Zookeeper problem, or after you had already cleared it?
>
> For us, we decided to skip the leap second problem (even though we're
> supposedly on a version that doesn't have that bug) by shutting down ntpd
> everywhere and then allowing it to slowly adjust the time afterwards
> without sending the leap second.
>
> -Todd
>
>
> On Thu, Jul 9, 2015 at 7:58 AM, Christofer Hedbrandh <
> christofer@knewton.com
> > wrote:
>
> > Hi Kafka users,
> >
> > ZooKeeper in our staging environment was running on a very old ubuntu
> > version, that was exposed to the "leap second causes spuriously high CPU
> > usage" bug.
> >
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285
> >
> > As a result, when the leap second arrived, the ZooKeeper CPU usage went
> up
> > to 100% and stayed there. In response to this, we restarted one ZooKeeper
> > process. The ZooKeeper restart unfortunately made the situation much
> worse
> > as we hit three different (possibly related) Kafka problems. We are using
> > Kafka 0.8.2 brokers, consumers and producers.
> >
> >
> > #1
> > One of our three brokers was kicked out or ISR for some (most but not
> all)
> > partitions, and was continuously logging "Cached zkVersion [XX] not equal
> > to that in zookeeper, skip updating ISR" over and over (until I
> eventually
> > stopped this broker).
> >
> > INFO Partition [topic-x,xx] on broker 1: Shrinking ISR for partition
> > [topic-x,xx] from 1,2,3 to 1 (kafka.cluster.Partition)
> > INFO Partition [topic-x,xx] on broker 1: Cached zkVersion [62] not equal
> to
> > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > INFO Partition [topic-y,yy] on broker 1: Shrinking ISR for partition
> > [topic-y,yy] from 1,2,3 to 1 (kafka.cluster.Partition)
> > INFO Partition [topic-y,yy] on broker 1: Cached zkVersion [39] not equal
> to
> > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > INFO Partition [topic-z,zz] on broker 1: Shrinking ISR for partition
> > [topic-z,zz] from 1,2,3 to 1 (kafka.cluster.Partition)
> > INFO Partition [topic-z,zz] on broker 1: Cached zkVersion [45] not equal
> to
> > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > etc.
> >
> > Searching the users@kafka.apache.org archive and Googling for this log
> > output, gives me similar descriptions but nothing that exactly describes
> > this.
> > It is very similar to this, but without the "ERROR Conditional update of
> > path ..." part.
> > https://www.mail-archive.com/users@kafka.apache.org/msg07044.html
> >
> >
> > #2
> > The remaining two brokers were logging this every five seconds or so.
> > INFO conflict in /brokers/ids/xxx data:
> >
> >
> {"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}
> > stored data:
> >
> >
> {"jmx_port":xxx,"timestamp":"1435711782536","host":"xxx","version":1,"port":9092}
> > (kafka.utils.ZkUtils$)
> > INFO I wrote this conflicted ephemeral node
> >
> >
> [{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}]
> > at /brokers/ids/xxx a while back in a different session, hence I will
> > backoff for this node to be deleted by Zookeeper and retry
> > (kafka.utils.ZkUtils$)
> >
> > It sounds very much like we hit this bug
> > https://issues.apache.org/jira/browse/KAFKA-1387
> >
> >
> > #3
> > The most serious issue that resulted was that some consumer groups failed
> > to claim all partitions. When using the ConsumerOffsetChecker, the owner
> of
> > some partitions was listed as "none", the lag was constantly increasing,
> > and it was clear that no consumers were processing these messages.
> >
> > It is exactly what Dave Hamilton is describing here, but from this email
> > chain no one seems to know what caused it.
> > https://www.mail-archive.com/users%40kafka.apache.org/msg13364.html
> >
> > It may be reasonable to assume that the consumer rebalance failures we
> also
> > saw has something to do with this. But why the rebalance failed is still
> > unclear.
> >
> > ERROR k.c.ZookeeperConsumerConnector: error during syncedRebalance
> > kafka.common.ConsumerRebalanceFailedException: xxx can't rebalance after
> 4
> > retries
> > at
> >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633)
> > at
> >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:551)
> >
> >
> > I am curious to hear if anyone else had similar problems to this?
> >
> > And also if anyone can say if these are all known bugs that are being
> > tracked with some ticket number?
> >
> >
> > Thanks,
> > Christofer
> >
> > P.S. Eventually after ZooKeeper and Kafka broker and consumer restarts
> > everything returned to normal.
> >
>

Re: Leap Second Troubles

Posted by Todd Palino <tp...@gmail.com>.
Did you hit the problems in the Kafka brokers and consumers during the
Zookeeper problem, or after you had already cleared it?

For us, we decided to skip the leap second problem (even though we're
supposedly on a version that doesn't have that bug) by shutting down ntpd
everywhere and then allowing it to slowly adjust the time afterwards
without sending the leap second.

-Todd


On Thu, Jul 9, 2015 at 7:58 AM, Christofer Hedbrandh <christofer@knewton.com
> wrote:

> Hi Kafka users,
>
> ZooKeeper in our staging environment was running on a very old ubuntu
> version, that was exposed to the "leap second causes spuriously high CPU
> usage" bug.
>
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285
>
> As a result, when the leap second arrived, the ZooKeeper CPU usage went up
> to 100% and stayed there. In response to this, we restarted one ZooKeeper
> process. The ZooKeeper restart unfortunately made the situation much worse
> as we hit three different (possibly related) Kafka problems. We are using
> Kafka 0.8.2 brokers, consumers and producers.
>
>
> #1
> One of our three brokers was kicked out or ISR for some (most but not all)
> partitions, and was continuously logging "Cached zkVersion [XX] not equal
> to that in zookeeper, skip updating ISR" over and over (until I eventually
> stopped this broker).
>
> INFO Partition [topic-x,xx] on broker 1: Shrinking ISR for partition
> [topic-x,xx] from 1,2,3 to 1 (kafka.cluster.Partition)
> INFO Partition [topic-x,xx] on broker 1: Cached zkVersion [62] not equal to
> that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> INFO Partition [topic-y,yy] on broker 1: Shrinking ISR for partition
> [topic-y,yy] from 1,2,3 to 1 (kafka.cluster.Partition)
> INFO Partition [topic-y,yy] on broker 1: Cached zkVersion [39] not equal to
> that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> INFO Partition [topic-z,zz] on broker 1: Shrinking ISR for partition
> [topic-z,zz] from 1,2,3 to 1 (kafka.cluster.Partition)
> INFO Partition [topic-z,zz] on broker 1: Cached zkVersion [45] not equal to
> that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> etc.
>
> Searching the users@kafka.apache.org archive and Googling for this log
> output, gives me similar descriptions but nothing that exactly describes
> this.
> It is very similar to this, but without the "ERROR Conditional update of
> path ..." part.
> https://www.mail-archive.com/users@kafka.apache.org/msg07044.html
>
>
> #2
> The remaining two brokers were logging this every five seconds or so.
> INFO conflict in /brokers/ids/xxx data:
>
> {"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}
> stored data:
>
> {"jmx_port":xxx,"timestamp":"1435711782536","host":"xxx","version":1,"port":9092}
> (kafka.utils.ZkUtils$)
> INFO I wrote this conflicted ephemeral node
>
> [{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}]
> at /brokers/ids/xxx a while back in a different session, hence I will
> backoff for this node to be deleted by Zookeeper and retry
> (kafka.utils.ZkUtils$)
>
> It sounds very much like we hit this bug
> https://issues.apache.org/jira/browse/KAFKA-1387
>
>
> #3
> The most serious issue that resulted was that some consumer groups failed
> to claim all partitions. When using the ConsumerOffsetChecker, the owner of
> some partitions was listed as "none", the lag was constantly increasing,
> and it was clear that no consumers were processing these messages.
>
> It is exactly what Dave Hamilton is describing here, but from this email
> chain no one seems to know what caused it.
> https://www.mail-archive.com/users%40kafka.apache.org/msg13364.html
>
> It may be reasonable to assume that the consumer rebalance failures we also
> saw has something to do with this. But why the rebalance failed is still
> unclear.
>
> ERROR k.c.ZookeeperConsumerConnector: error during syncedRebalance
> kafka.common.ConsumerRebalanceFailedException: xxx can't rebalance after 4
> retries
> at
>
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633)
> at
>
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:551)
>
>
> I am curious to hear if anyone else had similar problems to this?
>
> And also if anyone can say if these are all known bugs that are being
> tracked with some ticket number?
>
>
> Thanks,
> Christofer
>
> P.S. Eventually after ZooKeeper and Kafka broker and consumer restarts
> everything returned to normal.
>