You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Rajiv Kurian <ra...@signalfx.com> on 2015/12/14 21:32:54 UTC

Fallout from upgrading to kafka 0.9 from 0.8.2.3

I upgraded one of our Kafka clusters (9 nodes) from 0.8.2.3 to 0.9
following the instructions at
http://kafka.apache.org/documentation.html#upgrade

Most things seem to work fine based on our metrics. Something I noticed is
that the network out on 3 of the nodes goes up every 5-6 minutes. I see a
corresponding increase in the Kafka bytes out metric too.

I don't see any corresponding increase in the number of Kafka messages in
or Kafka bytes in or even network in, so seems like something from
inside Kafka is generating them. We don't know what could cause this
behavior but it did not happen before the upgrade.

Here is part of our Kafka dashboard before the upgrade:
http://imgur.com/1nNM0f2

The above charts for a duration of an hour. You can see that the networking
in/out (measured by Collectd) and Kafka bytes in/out (measured by
the Kafka broker and extracted using JMX) are dependent on the messages in.


Here is our Kafka dashboard after the upgrade:
http://imgur.com/AjehzAD

You can see that every 5 minutes both the Kafka bytes out metric(measured
by the Kafka broker) and the network out metric(measured by Collectd) have
a spike (5x maybe). Has this been seen before?

I have verified through the broker logs that all of our brokers are
running Kafka 0.9 with inter.broker.protocol.version = 0.9.0.0

Our producers and consumers are older ones including that use the Java
wrapper around the SimpleConsumer.

I also see messages (possibly unrelated) of this form in the broker logs:

2015-12-14T20:18:00.018Z INFO  [kafka-request-handler-2            ]
[kafka.server.KafkaApis              ]: [KafkaApi-6] Close connection due
to error handling produce request with correlation id 294218 from client
id  with ack=0

It doesn't give me any details on what the error was exactly and I don't
recall seeing them before the upgrade.


So the problem (periodic burst of network out/ kafka bytes out) only
happens on 3 out of the 9 brokers that we upgraded. The CPU on the 3
brokers that exhibit the problem has also gone up:

https://imgur.com/nezPmQ6


Thanks,

Rajiv

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Jun Rao <ju...@confluent.io>.

Rajiv,

Upgrading from 0.8.2.1 to 0.9.0.0 should also be fine. If you can reproduce
this issue in a test environment, that would be great. The following may be
helpful in figuring out the issue.

1. Use ack=1 instead of ack=0 will allow the producer to see the error code
when the send fails.
2. If the producer sends oversized messages, the network in rate will go
up, but the message in rate won't go up. You may want to check the jmx
on BytesRejectedPerSec.
3. If the CPU is high on broker, do a bit of profiling and figure out where
the CPU is spent.

Thanks,

Jun



On Thu, Dec 17, 2015 at 10:35 PM, Rajiv Kurian <ra...@signalfx.com> wrote:

> I was mistaken about the version. We were actually using 0.8.2.1 before
> upgrading to 0.9.
>
> On Thu, Dec 17, 2015 at 6:13 PM, Dana Powers <da...@gmail.com>
> wrote:
>
> > I don't have much to add on this, but q: what is version 0.8.2.3? I
> thought
> > the latest in 0.8 series was 0.8.2.2?
> >
> > -Dana
> > On Dec 17, 2015 5:56 PM, "Rajiv Kurian" <ra...@signalfx.com> wrote:
> >
> > > Yes we are in the process of upgrading to the new producers. But the
> > > problem seems deeper than a compatibility issue. We have one
> environment
> > > where the old producers work with the new 0.9 broker. Further when we
> > > reverted our messed up 0.9 environment to 0.8.2.3 the problem with
> those
> > > topics didn't go away.
> > >
> > > Didn't see any ZK issues on the brokers. There were other topics on the
> > > very same brokers that didn't seem to be affected.
> > >
> > > On Thu, Dec 17, 2015 at 5:46 PM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Yes, the new java producer is available in 0.8.2.x and we recommend
> > > people
> > > > use that.
> > > >
> > > > Also, when those producers had the issue, were there any other things
> > > weird
> > > > in the broker (e.g., broker's ZK session expires)?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Dec 17, 2015 at 2:37 PM, Rajiv Kurian <ra...@signalfx.com>
> > > wrote:
> > > >
> > > > > I can't think of anything special about the topics besides the
> > clients
> > > > > being very old (Java wrappers over Scala).
> > > > >
> > > > > I do think it was using ack=0. But my guess is that the logging was
> > > done
> > > > by
> > > > > the Kafka producer thread. My application itself was not getting
> > > > exceptions
> > > > > from Kafka.
> > > > >
> > > > > On Thu, Dec 17, 2015 at 2:31 PM, Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Hmm, anything special with those 3 topics? Also, the broker log
> > shows
> > > > > that
> > > > > > the producer uses ack=0, which means the producer shouldn't get
> > > errors
> > > > > like
> > > > > > leader not found. Could you clarify on the ack used by the
> > producer?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Thu, Dec 17, 2015 at 12:41 PM, Rajiv Kurian <
> rajiv@signalfx.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > The topic which stopped working had clients that were only
> using
> > > the
> > > > > old
> > > > > > > Java producer that is a wrapper over the Scala producer. Again
> it
> > > > > seemed
> > > > > > to
> > > > > > > work perfectly in another of our realms where we have the same
> > > > topics,
> > > > > > same
> > > > > > > producers/consumers etc but with less traffic.
> > > > > > >
> > > > > > > On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > > > >
> > > > > > > > Are you using the new java producer?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <
> > > rajiv@signalfx.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jun,
> > > > > > > > > Answers inline:
> > > > > > > > >
> > > > > > > > > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <jun@confluent.io
> >
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > Rajiv,
> > > > > > > > > >
> > > > > > > > > > Thanks for reporting this.
> > > > > > > > > >
> > > > > > > > > > 1. How did you verify that 3 of the topics are corrupted?
> > Did
> > > > you
> > > > > > use
> > > > > > > > > > DumpLogSegments tool? Also, is there a simple way to
> > > reproduce
> > > > > the
> > > > > > > > > > corruption?
> > > > > > > > > >
> > > > > > > > > No I did not. The only reason I had to believe that was no
> > > > writers
> > > > > > > could
> > > > > > > > > write to the topic. I have actually no idea what the
> problem
> > > > was. I
> > > > > > saw
> > > > > > > > > very frequent (much more than usual) messages of the form:
> > > > > > > > > INFO  [kafka-request-handler-2            ]
> > > > [kafka.server.KafkaApis
> > > > > > > > >       ]: [KafkaApi-6] Close connection due to error
> handling
> > > > > produce
> > > > > > > > > request with correlation id 294218 from client id  with
> ack=0
> > > > > > > > > and also message of the form:
> > > > > > > > > INFO  [kafka-network-thread-9092-0        ]
> > > > > [kafka.network.Processor
> > > > > > > > >       ]: Closing socket connection to /some ip
> > > > > > > > > The cluster was actually a critical one so I had no
> recourse
> > > but
> > > > to
> > > > > > > > revert
> > > > > > > > > the change (which like noted didn't fix things). I didn't
> > have
> > > > > enough
> > > > > > > > time
> > > > > > > > > to debug further. The only way I could fix it with my
> limited
> > > > Kafka
> > > > > > > > > knowledge was (after reverting) deleting the topic and
> > > recreating
> > > > > it.
> > > > > > > > > I had updated a low priority cluster before that worked
> just
> > > > fine.
> > > > > > That
> > > > > > > > > gave me the confidence to upgrade this higher priority
> > cluster
> > > > > which
> > > > > > > did
> > > > > > > > > NOT work out. So the only way for me to try to reproduce it
> > is
> > > to
> > > > > try
> > > > > > > > this
> > > > > > > > > on our larger clusters again. But it is critical that we
> > don't
> > > > mess
> > > > > > up
> > > > > > > > this
> > > > > > > > > high priority cluster so I am afraid to try again.
> > > > > > > > >
> > > > > > > > > > 2. As Lance mentioned, if you are using snappy, make sure
> > > that
> > > > > you
> > > > > > > > > include
> > > > > > > > > > the right snappy jar (1.1.1.7).
> > > > > > > > > >
> > > > > > > > > Wonder why I don't see Lance's email in this thread. Either
> > way
> > > > we
> > > > > > are
> > > > > > > > not
> > > > > > > > > using compression of any kind on this topic.
> > > > > > > > >
> > > > > > > > > > 3. For the CPU issue, could you do a bit profiling to see
> > > which
> > > > > > > thread
> > > > > > > > is
> > > > > > > > > > busy and where it's spending time?
> > > > > > > > > >
> > > > > > > > > Since I had to revert I didn't have the time to profile.
> > > > > Intuitively
> > > > > > it
> > > > > > > > > would seem like the high number of client
> disconnects/errors
> > > and
> > > > > the
> > > > > > > > > increased network usage probably has something to do with
> the
> > > > high
> > > > > > CPU
> > > > > > > > > (total guess). Again our other (lower traffic) cluster that
> > was
> > > > > > > upgraded
> > > > > > > > > was totally fine so it doesn't seem like it happens all the
> > > time.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <
> > > > > rajiv@signalfx.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > We had to revert to 0.8.3 because three of our topics
> > seem
> > > to
> > > > > > have
> > > > > > > > > gotten
> > > > > > > > > > > corrupted during the upgrade. As soon as we did the
> > upgrade
> > > > > > > producers
> > > > > > > > > to
> > > > > > > > > > > the three topics I mentioned stopped being able to do
> > > writes.
> > > > > The
> > > > > > > > > clients
> > > > > > > > > > > complained (occasionally) about leader not found
> > > exceptions.
> > > > We
> > > > > > > > > restarted
> > > > > > > > > > > our clients and brokers but that didn't seem to help.
> > > > Actually
> > > > > > even
> > > > > > > > > after
> > > > > > > > > > > reverting to 0.8.3 these three topics were broken. To
> fix
> > > it
> > > > we
> > > > > > had
> > > > > > > > to
> > > > > > > > > > stop
> > > > > > > > > > > all clients, delete the topics, create them again and
> > then
> > > > > > restart
> > > > > > > > the
> > > > > > > > > > > clients.
> > > > > > > > > > >
> > > > > > > > > > > I realize this is not a lot of info. I couldn't wait to
> > get
> > > > > more
> > > > > > > > debug
> > > > > > > > > > info
> > > > > > > > > > > because the cluster was actually being used. Has any
> one
> > > run
> > > > > into
> > > > > > > > > > something
> > > > > > > > > > > like this? Are there any known issues with old
> > > > > > consumers/producers.
> > > > > > > > The
> > > > > > > > > > > topics that got busted had clients writing to them
> using
> > > the
> > > > > old
> > > > > > > Java
> > > > > > > > > > > wrapper over the Scala producer.
> > > > > > > > > > >
> > > > > > > > > > > Here are the steps I took to upgrade.
> > > > > > > > > > >
> > > > > > > > > > > For each broker:
> > > > > > > > > > >
> > > > > > > > > > > 1. Stop the broker.
> > > > > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > > > > 3. Wait for under replicated partitions to go down to
> 0.
> > > > > > > > > > > 4. Go to step 1.
> > > > > > > > > > > Once all the brokers were running the 0.9 code with
> > > > > > > > > > > inter.broker.protocol.version=0.8.2.X we restarted them
> > one
> > > > by
> > > > > > one
> > > > > > > > with
> > > > > > > > > > > inter.broker.protocol.version=0.9.0.0
> > > > > > > > > > >
> > > > > > > > > > > When reverting I did the following.
> > > > > > > > > > >
> > > > > > > > > > > For each broker.
> > > > > > > > > > >
> > > > > > > > > > > 1. Stop the broker.
> > > > > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > > > > 3. Wait for under replicated partitions to go down to
> 0.
> > > > > > > > > > > 4. Go to step 1.
> > > > > > > > > > >
> > > > > > > > > > > Once all the brokers were running 0.9 code with
> > > > > > > > > > > inter.broker.protocol.version=0.8.2.X  I restarted them
> > one
> > > > by
> > > > > > one
> > > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > > 0.8.2.3 broker code. This however like I mentioned did
> > not
> > > > fix
> > > > > > the
> > > > > > > > > three
> > > > > > > > > > > broken topics.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <
> > > > > > rajiv@signalfx.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Now that it has been a bit longer, the spikes I was
> > > seeing
> > > > > are
> > > > > > > gone
> > > > > > > > > but
> > > > > > > > > > > > the CPU and network in/out on the three brokers that
> > were
> > > > > > showing
> > > > > > > > the
> > > > > > > > > > > > spikes are still much higher than before the upgrade.
> > > Their
> > > > > > CPUs
> > > > > > > > have
> > > > > > > > > > > > increased from around 1-2% to 12-20%. The network in
> on
> > > the
> > > > > > same
> > > > > > > > > > brokers
> > > > > > > > > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The
> > > > network
> > > > > > out
> > > > > > > > has
> > > > > > > > > > gone
> > > > > > > > > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a
> > > > > > > corresponding
> > > > > > > > > > > > increase in kafka messages in per second or kafka
> bytes
> > > in
> > > > > per
> > > > > > > > second
> > > > > > > > > > JMX
> > > > > > > > > > > > metrics.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Rajiv
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Rajiv Kurian <ra...@signalfx.com>.

I was mistaken about the version. We were actually using 0.8.2.1 before
upgrading to 0.9.

On Thu, Dec 17, 2015 at 6:13 PM, Dana Powers <da...@gmail.com> wrote:

> I don't have much to add on this, but q: what is version 0.8.2.3? I thought
> the latest in 0.8 series was 0.8.2.2?
>
> -Dana
> On Dec 17, 2015 5:56 PM, "Rajiv Kurian" <ra...@signalfx.com> wrote:
>
> > Yes we are in the process of upgrading to the new producers. But the
> > problem seems deeper than a compatibility issue. We have one environment
> > where the old producers work with the new 0.9 broker. Further when we
> > reverted our messed up 0.9 environment to 0.8.2.3 the problem with those
> > topics didn't go away.
> >
> > Didn't see any ZK issues on the brokers. There were other topics on the
> > very same brokers that didn't seem to be affected.
> >
> > On Thu, Dec 17, 2015 at 5:46 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Yes, the new java producer is available in 0.8.2.x and we recommend
> > people
> > > use that.
> > >
> > > Also, when those producers had the issue, were there any other things
> > weird
> > > in the broker (e.g., broker's ZK session expires)?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Dec 17, 2015 at 2:37 PM, Rajiv Kurian <ra...@signalfx.com>
> > wrote:
> > >
> > > > I can't think of anything special about the topics besides the
> clients
> > > > being very old (Java wrappers over Scala).
> > > >
> > > > I do think it was using ack=0. But my guess is that the logging was
> > done
> > > by
> > > > the Kafka producer thread. My application itself was not getting
> > > exceptions
> > > > from Kafka.
> > > >
> > > > On Thu, Dec 17, 2015 at 2:31 PM, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hmm, anything special with those 3 topics? Also, the broker log
> shows
> > > > that
> > > > > the producer uses ack=0, which means the producer shouldn't get
> > errors
> > > > like
> > > > > leader not found. Could you clarify on the ack used by the
> producer?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Dec 17, 2015 at 12:41 PM, Rajiv Kurian <rajiv@signalfx.com
> >
> > > > wrote:
> > > > >
> > > > > > The topic which stopped working had clients that were only using
> > the
> > > > old
> > > > > > Java producer that is a wrapper over the Scala producer. Again it
> > > > seemed
> > > > > to
> > > > > > work perfectly in another of our realms where we have the same
> > > topics,
> > > > > same
> > > > > > producers/consumers etc but with less traffic.
> > > > > >
> > > > > > On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <ju...@confluent.io>
> > wrote:
> > > > > >
> > > > > > > Are you using the new java producer?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <
> > rajiv@signalfx.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jun,
> > > > > > > > Answers inline:
> > > > > > > >
> > > > > > > > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > > > > >
> > > > > > > > > Rajiv,
> > > > > > > > >
> > > > > > > > > Thanks for reporting this.
> > > > > > > > >
> > > > > > > > > 1. How did you verify that 3 of the topics are corrupted?
> Did
> > > you
> > > > > use
> > > > > > > > > DumpLogSegments tool? Also, is there a simple way to
> > reproduce
> > > > the
> > > > > > > > > corruption?
> > > > > > > > >
> > > > > > > > No I did not. The only reason I had to believe that was no
> > > writers
> > > > > > could
> > > > > > > > write to the topic. I have actually no idea what the problem
> > > was. I
> > > > > saw
> > > > > > > > very frequent (much more than usual) messages of the form:
> > > > > > > > INFO  [kafka-request-handler-2            ]
> > > [kafka.server.KafkaApis
> > > > > > > >       ]: [KafkaApi-6] Close connection due to error handling
> > > > produce
> > > > > > > > request with correlation id 294218 from client id  with ack=0
> > > > > > > > and also message of the form:
> > > > > > > > INFO  [kafka-network-thread-9092-0        ]
> > > > [kafka.network.Processor
> > > > > > > >       ]: Closing socket connection to /some ip
> > > > > > > > The cluster was actually a critical one so I had no recourse
> > but
> > > to
> > > > > > > revert
> > > > > > > > the change (which like noted didn't fix things). I didn't
> have
> > > > enough
> > > > > > > time
> > > > > > > > to debug further. The only way I could fix it with my limited
> > > Kafka
> > > > > > > > knowledge was (after reverting) deleting the topic and
> > recreating
> > > > it.
> > > > > > > > I had updated a low priority cluster before that worked just
> > > fine.
> > > > > That
> > > > > > > > gave me the confidence to upgrade this higher priority
> cluster
> > > > which
> > > > > > did
> > > > > > > > NOT work out. So the only way for me to try to reproduce it
> is
> > to
> > > > try
> > > > > > > this
> > > > > > > > on our larger clusters again. But it is critical that we
> don't
> > > mess
> > > > > up
> > > > > > > this
> > > > > > > > high priority cluster so I am afraid to try again.
> > > > > > > >
> > > > > > > > > 2. As Lance mentioned, if you are using snappy, make sure
> > that
> > > > you
> > > > > > > > include
> > > > > > > > > the right snappy jar (1.1.1.7).
> > > > > > > > >
> > > > > > > > Wonder why I don't see Lance's email in this thread. Either
> way
> > > we
> > > > > are
> > > > > > > not
> > > > > > > > using compression of any kind on this topic.
> > > > > > > >
> > > > > > > > > 3. For the CPU issue, could you do a bit profiling to see
> > which
> > > > > > thread
> > > > > > > is
> > > > > > > > > busy and where it's spending time?
> > > > > > > > >
> > > > > > > > Since I had to revert I didn't have the time to profile.
> > > > Intuitively
> > > > > it
> > > > > > > > would seem like the high number of client disconnects/errors
> > and
> > > > the
> > > > > > > > increased network usage probably has something to do with the
> > > high
> > > > > CPU
> > > > > > > > (total guess). Again our other (lower traffic) cluster that
> was
> > > > > > upgraded
> > > > > > > > was totally fine so it doesn't seem like it happens all the
> > time.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <
> > > > rajiv@signalfx.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > We had to revert to 0.8.3 because three of our topics
> seem
> > to
> > > > > have
> > > > > > > > gotten
> > > > > > > > > > corrupted during the upgrade. As soon as we did the
> upgrade
> > > > > > producers
> > > > > > > > to
> > > > > > > > > > the three topics I mentioned stopped being able to do
> > writes.
> > > > The
> > > > > > > > clients
> > > > > > > > > > complained (occasionally) about leader not found
> > exceptions.
> > > We
> > > > > > > > restarted
> > > > > > > > > > our clients and brokers but that didn't seem to help.
> > > Actually
> > > > > even
> > > > > > > > after
> > > > > > > > > > reverting to 0.8.3 these three topics were broken. To fix
> > it
> > > we
> > > > > had
> > > > > > > to
> > > > > > > > > stop
> > > > > > > > > > all clients, delete the topics, create them again and
> then
> > > > > restart
> > > > > > > the
> > > > > > > > > > clients.
> > > > > > > > > >
> > > > > > > > > > I realize this is not a lot of info. I couldn't wait to
> get
> > > > more
> > > > > > > debug
> > > > > > > > > info
> > > > > > > > > > because the cluster was actually being used. Has any one
> > run
> > > > into
> > > > > > > > > something
> > > > > > > > > > like this? Are there any known issues with old
> > > > > consumers/producers.
> > > > > > > The
> > > > > > > > > > topics that got busted had clients writing to them using
> > the
> > > > old
> > > > > > Java
> > > > > > > > > > wrapper over the Scala producer.
> > > > > > > > > >
> > > > > > > > > > Here are the steps I took to upgrade.
> > > > > > > > > >
> > > > > > > > > > For each broker:
> > > > > > > > > >
> > > > > > > > > > 1. Stop the broker.
> > > > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > > > > 4. Go to step 1.
> > > > > > > > > > Once all the brokers were running the 0.9 code with
> > > > > > > > > > inter.broker.protocol.version=0.8.2.X we restarted them
> one
> > > by
> > > > > one
> > > > > > > with
> > > > > > > > > > inter.broker.protocol.version=0.9.0.0
> > > > > > > > > >
> > > > > > > > > > When reverting I did the following.
> > > > > > > > > >
> > > > > > > > > > For each broker.
> > > > > > > > > >
> > > > > > > > > > 1. Stop the broker.
> > > > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > > > > 4. Go to step 1.
> > > > > > > > > >
> > > > > > > > > > Once all the brokers were running 0.9 code with
> > > > > > > > > > inter.broker.protocol.version=0.8.2.X  I restarted them
> one
> > > by
> > > > > one
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > 0.8.2.3 broker code. This however like I mentioned did
> not
> > > fix
> > > > > the
> > > > > > > > three
> > > > > > > > > > broken topics.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <
> > > > > rajiv@signalfx.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Now that it has been a bit longer, the spikes I was
> > seeing
> > > > are
> > > > > > gone
> > > > > > > > but
> > > > > > > > > > > the CPU and network in/out on the three brokers that
> were
> > > > > showing
> > > > > > > the
> > > > > > > > > > > spikes are still much higher than before the upgrade.
> > Their
> > > > > CPUs
> > > > > > > have
> > > > > > > > > > > increased from around 1-2% to 12-20%. The network in on
> > the
> > > > > same
> > > > > > > > > brokers
> > > > > > > > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The
> > > network
> > > > > out
> > > > > > > has
> > > > > > > > > gone
> > > > > > > > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a
> > > > > > corresponding
> > > > > > > > > > > increase in kafka messages in per second or kafka bytes
> > in
> > > > per
> > > > > > > second
> > > > > > > > > JMX
> > > > > > > > > > > metrics.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Rajiv
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Dana Powers <da...@gmail.com>.

I don't have much to add on this, but q: what is version 0.8.2.3? I thought
the latest in 0.8 series was 0.8.2.2?

-Dana
On Dec 17, 2015 5:56 PM, "Rajiv Kurian" <ra...@signalfx.com> wrote:

> Yes we are in the process of upgrading to the new producers. But the
> problem seems deeper than a compatibility issue. We have one environment
> where the old producers work with the new 0.9 broker. Further when we
> reverted our messed up 0.9 environment to 0.8.2.3 the problem with those
> topics didn't go away.
>
> Didn't see any ZK issues on the brokers. There were other topics on the
> very same brokers that didn't seem to be affected.
>
> On Thu, Dec 17, 2015 at 5:46 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Yes, the new java producer is available in 0.8.2.x and we recommend
> people
> > use that.
> >
> > Also, when those producers had the issue, were there any other things
> weird
> > in the broker (e.g., broker's ZK session expires)?
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Dec 17, 2015 at 2:37 PM, Rajiv Kurian <ra...@signalfx.com>
> wrote:
> >
> > > I can't think of anything special about the topics besides the clients
> > > being very old (Java wrappers over Scala).
> > >
> > > I do think it was using ack=0. But my guess is that the logging was
> done
> > by
> > > the Kafka producer thread. My application itself was not getting
> > exceptions
> > > from Kafka.
> > >
> > > On Thu, Dec 17, 2015 at 2:31 PM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hmm, anything special with those 3 topics? Also, the broker log shows
> > > that
> > > > the producer uses ack=0, which means the producer shouldn't get
> errors
> > > like
> > > > leader not found. Could you clarify on the ack used by the producer?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Dec 17, 2015 at 12:41 PM, Rajiv Kurian <ra...@signalfx.com>
> > > wrote:
> > > >
> > > > > The topic which stopped working had clients that were only using
> the
> > > old
> > > > > Java producer that is a wrapper over the Scala producer. Again it
> > > seemed
> > > > to
> > > > > work perfectly in another of our realms where we have the same
> > topics,
> > > > same
> > > > > producers/consumers etc but with less traffic.
> > > > >
> > > > > On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <ju...@confluent.io>
> wrote:
> > > > >
> > > > > > Are you using the new java producer?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <
> rajiv@signalfx.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Jun,
> > > > > > > Answers inline:
> > > > > > >
> > > > > > > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <ju...@confluent.io>
> > wrote:
> > > > > > >
> > > > > > > > Rajiv,
> > > > > > > >
> > > > > > > > Thanks for reporting this.
> > > > > > > >
> > > > > > > > 1. How did you verify that 3 of the topics are corrupted? Did
> > you
> > > > use
> > > > > > > > DumpLogSegments tool? Also, is there a simple way to
> reproduce
> > > the
> > > > > > > > corruption?
> > > > > > > >
> > > > > > > No I did not. The only reason I had to believe that was no
> > writers
> > > > > could
> > > > > > > write to the topic. I have actually no idea what the problem
> > was. I
> > > > saw
> > > > > > > very frequent (much more than usual) messages of the form:
> > > > > > > INFO  [kafka-request-handler-2            ]
> > [kafka.server.KafkaApis
> > > > > > >       ]: [KafkaApi-6] Close connection due to error handling
> > > produce
> > > > > > > request with correlation id 294218 from client id  with ack=0
> > > > > > > and also message of the form:
> > > > > > > INFO  [kafka-network-thread-9092-0        ]
> > > [kafka.network.Processor
> > > > > > >       ]: Closing socket connection to /some ip
> > > > > > > The cluster was actually a critical one so I had no recourse
> but
> > to
> > > > > > revert
> > > > > > > the change (which like noted didn't fix things). I didn't have
> > > enough
> > > > > > time
> > > > > > > to debug further. The only way I could fix it with my limited
> > Kafka
> > > > > > > knowledge was (after reverting) deleting the topic and
> recreating
> > > it.
> > > > > > > I had updated a low priority cluster before that worked just
> > fine.
> > > > That
> > > > > > > gave me the confidence to upgrade this higher priority cluster
> > > which
> > > > > did
> > > > > > > NOT work out. So the only way for me to try to reproduce it is
> to
> > > try
> > > > > > this
> > > > > > > on our larger clusters again. But it is critical that we don't
> > mess
> > > > up
> > > > > > this
> > > > > > > high priority cluster so I am afraid to try again.
> > > > > > >
> > > > > > > > 2. As Lance mentioned, if you are using snappy, make sure
> that
> > > you
> > > > > > > include
> > > > > > > > the right snappy jar (1.1.1.7).
> > > > > > > >
> > > > > > > Wonder why I don't see Lance's email in this thread. Either way
> > we
> > > > are
> > > > > > not
> > > > > > > using compression of any kind on this topic.
> > > > > > >
> > > > > > > > 3. For the CPU issue, could you do a bit profiling to see
> which
> > > > > thread
> > > > > > is
> > > > > > > > busy and where it's spending time?
> > > > > > > >
> > > > > > > Since I had to revert I didn't have the time to profile.
> > > Intuitively
> > > > it
> > > > > > > would seem like the high number of client disconnects/errors
> and
> > > the
> > > > > > > increased network usage probably has something to do with the
> > high
> > > > CPU
> > > > > > > (total guess). Again our other (lower traffic) cluster that was
> > > > > upgraded
> > > > > > > was totally fine so it doesn't seem like it happens all the
> time.
> > > > > > >
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <
> > > rajiv@signalfx.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > We had to revert to 0.8.3 because three of our topics seem
> to
> > > > have
> > > > > > > gotten
> > > > > > > > > corrupted during the upgrade. As soon as we did the upgrade
> > > > > producers
> > > > > > > to
> > > > > > > > > the three topics I mentioned stopped being able to do
> writes.
> > > The
> > > > > > > clients
> > > > > > > > > complained (occasionally) about leader not found
> exceptions.
> > We
> > > > > > > restarted
> > > > > > > > > our clients and brokers but that didn't seem to help.
> > Actually
> > > > even
> > > > > > > after
> > > > > > > > > reverting to 0.8.3 these three topics were broken. To fix
> it
> > we
> > > > had
> > > > > > to
> > > > > > > > stop
> > > > > > > > > all clients, delete the topics, create them again and then
> > > > restart
> > > > > > the
> > > > > > > > > clients.
> > > > > > > > >
> > > > > > > > > I realize this is not a lot of info. I couldn't wait to get
> > > more
> > > > > > debug
> > > > > > > > info
> > > > > > > > > because the cluster was actually being used. Has any one
> run
> > > into
> > > > > > > > something
> > > > > > > > > like this? Are there any known issues with old
> > > > consumers/producers.
> > > > > > The
> > > > > > > > > topics that got busted had clients writing to them using
> the
> > > old
> > > > > Java
> > > > > > > > > wrapper over the Scala producer.
> > > > > > > > >
> > > > > > > > > Here are the steps I took to upgrade.
> > > > > > > > >
> > > > > > > > > For each broker:
> > > > > > > > >
> > > > > > > > > 1. Stop the broker.
> > > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > > > 4. Go to step 1.
> > > > > > > > > Once all the brokers were running the 0.9 code with
> > > > > > > > > inter.broker.protocol.version=0.8.2.X we restarted them one
> > by
> > > > one
> > > > > > with
> > > > > > > > > inter.broker.protocol.version=0.9.0.0
> > > > > > > > >
> > > > > > > > > When reverting I did the following.
> > > > > > > > >
> > > > > > > > > For each broker.
> > > > > > > > >
> > > > > > > > > 1. Stop the broker.
> > > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > > > 4. Go to step 1.
> > > > > > > > >
> > > > > > > > > Once all the brokers were running 0.9 code with
> > > > > > > > > inter.broker.protocol.version=0.8.2.X  I restarted them one
> > by
> > > > one
> > > > > > with
> > > > > > > > the
> > > > > > > > > 0.8.2.3 broker code. This however like I mentioned did not
> > fix
> > > > the
> > > > > > > three
> > > > > > > > > broken topics.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <
> > > > rajiv@signalfx.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Now that it has been a bit longer, the spikes I was
> seeing
> > > are
> > > > > gone
> > > > > > > but
> > > > > > > > > > the CPU and network in/out on the three brokers that were
> > > > showing
> > > > > > the
> > > > > > > > > > spikes are still much higher than before the upgrade.
> Their
> > > > CPUs
> > > > > > have
> > > > > > > > > > increased from around 1-2% to 12-20%. The network in on
> the
> > > > same
> > > > > > > > brokers
> > > > > > > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The
> > network
> > > > out
> > > > > > has
> > > > > > > > gone
> > > > > > > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a
> > > > > corresponding
> > > > > > > > > > increase in kafka messages in per second or kafka bytes
> in
> > > per
> > > > > > second
> > > > > > > > JMX
> > > > > > > > > > metrics.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Rajiv
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Rajiv Kurian <ra...@signalfx.com>.

Yes we are in the process of upgrading to the new producers. But the
problem seems deeper than a compatibility issue. We have one environment
where the old producers work with the new 0.9 broker. Further when we
reverted our messed up 0.9 environment to 0.8.2.3 the problem with those
topics didn't go away.

Didn't see any ZK issues on the brokers. There were other topics on the
very same brokers that didn't seem to be affected.

On Thu, Dec 17, 2015 at 5:46 PM, Jun Rao <ju...@confluent.io> wrote:

> Yes, the new java producer is available in 0.8.2.x and we recommend people
> use that.
>
> Also, when those producers had the issue, were there any other things weird
> in the broker (e.g., broker's ZK session expires)?
>
> Thanks,
>
> Jun
>
> On Thu, Dec 17, 2015 at 2:37 PM, Rajiv Kurian <ra...@signalfx.com> wrote:
>
> > I can't think of anything special about the topics besides the clients
> > being very old (Java wrappers over Scala).
> >
> > I do think it was using ack=0. But my guess is that the logging was done
> by
> > the Kafka producer thread. My application itself was not getting
> exceptions
> > from Kafka.
> >
> > On Thu, Dec 17, 2015 at 2:31 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hmm, anything special with those 3 topics? Also, the broker log shows
> > that
> > > the producer uses ack=0, which means the producer shouldn't get errors
> > like
> > > leader not found. Could you clarify on the ack used by the producer?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Dec 17, 2015 at 12:41 PM, Rajiv Kurian <ra...@signalfx.com>
> > wrote:
> > >
> > > > The topic which stopped working had clients that were only using the
> > old
> > > > Java producer that is a wrapper over the Scala producer. Again it
> > seemed
> > > to
> > > > work perfectly in another of our realms where we have the same
> topics,
> > > same
> > > > producers/consumers etc but with less traffic.
> > > >
> > > > On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Are you using the new java producer?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <ra...@signalfx.com>
> > > > wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > > Answers inline:
> > > > > >
> > > > > > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <ju...@confluent.io>
> wrote:
> > > > > >
> > > > > > > Rajiv,
> > > > > > >
> > > > > > > Thanks for reporting this.
> > > > > > >
> > > > > > > 1. How did you verify that 3 of the topics are corrupted? Did
> you
> > > use
> > > > > > > DumpLogSegments tool? Also, is there a simple way to reproduce
> > the
> > > > > > > corruption?
> > > > > > >
> > > > > > No I did not. The only reason I had to believe that was no
> writers
> > > > could
> > > > > > write to the topic. I have actually no idea what the problem
> was. I
> > > saw
> > > > > > very frequent (much more than usual) messages of the form:
> > > > > > INFO  [kafka-request-handler-2            ]
> [kafka.server.KafkaApis
> > > > > >       ]: [KafkaApi-6] Close connection due to error handling
> > produce
> > > > > > request with correlation id 294218 from client id  with ack=0
> > > > > > and also message of the form:
> > > > > > INFO  [kafka-network-thread-9092-0        ]
> > [kafka.network.Processor
> > > > > >       ]: Closing socket connection to /some ip
> > > > > > The cluster was actually a critical one so I had no recourse but
> to
> > > > > revert
> > > > > > the change (which like noted didn't fix things). I didn't have
> > enough
> > > > > time
> > > > > > to debug further. The only way I could fix it with my limited
> Kafka
> > > > > > knowledge was (after reverting) deleting the topic and recreating
> > it.
> > > > > > I had updated a low priority cluster before that worked just
> fine.
> > > That
> > > > > > gave me the confidence to upgrade this higher priority cluster
> > which
> > > > did
> > > > > > NOT work out. So the only way for me to try to reproduce it is to
> > try
> > > > > this
> > > > > > on our larger clusters again. But it is critical that we don't
> mess
> > > up
> > > > > this
> > > > > > high priority cluster so I am afraid to try again.
> > > > > >
> > > > > > > 2. As Lance mentioned, if you are using snappy, make sure that
> > you
> > > > > > include
> > > > > > > the right snappy jar (1.1.1.7).
> > > > > > >
> > > > > > Wonder why I don't see Lance's email in this thread. Either way
> we
> > > are
> > > > > not
> > > > > > using compression of any kind on this topic.
> > > > > >
> > > > > > > 3. For the CPU issue, could you do a bit profiling to see which
> > > > thread
> > > > > is
> > > > > > > busy and where it's spending time?
> > > > > > >
> > > > > > Since I had to revert I didn't have the time to profile.
> > Intuitively
> > > it
> > > > > > would seem like the high number of client disconnects/errors and
> > the
> > > > > > increased network usage probably has something to do with the
> high
> > > CPU
> > > > > > (total guess). Again our other (lower traffic) cluster that was
> > > > upgraded
> > > > > > was totally fine so it doesn't seem like it happens all the time.
> > > > > >
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <
> > rajiv@signalfx.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > We had to revert to 0.8.3 because three of our topics seem to
> > > have
> > > > > > gotten
> > > > > > > > corrupted during the upgrade. As soon as we did the upgrade
> > > > producers
> > > > > > to
> > > > > > > > the three topics I mentioned stopped being able to do writes.
> > The
> > > > > > clients
> > > > > > > > complained (occasionally) about leader not found exceptions.
> We
> > > > > > restarted
> > > > > > > > our clients and brokers but that didn't seem to help.
> Actually
> > > even
> > > > > > after
> > > > > > > > reverting to 0.8.3 these three topics were broken. To fix it
> we
> > > had
> > > > > to
> > > > > > > stop
> > > > > > > > all clients, delete the topics, create them again and then
> > > restart
> > > > > the
> > > > > > > > clients.
> > > > > > > >
> > > > > > > > I realize this is not a lot of info. I couldn't wait to get
> > more
> > > > > debug
> > > > > > > info
> > > > > > > > because the cluster was actually being used. Has any one run
> > into
> > > > > > > something
> > > > > > > > like this? Are there any known issues with old
> > > consumers/producers.
> > > > > The
> > > > > > > > topics that got busted had clients writing to them using the
> > old
> > > > Java
> > > > > > > > wrapper over the Scala producer.
> > > > > > > >
> > > > > > > > Here are the steps I took to upgrade.
> > > > > > > >
> > > > > > > > For each broker:
> > > > > > > >
> > > > > > > > 1. Stop the broker.
> > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > > 4. Go to step 1.
> > > > > > > > Once all the brokers were running the 0.9 code with
> > > > > > > > inter.broker.protocol.version=0.8.2.X we restarted them one
> by
> > > one
> > > > > with
> > > > > > > > inter.broker.protocol.version=0.9.0.0
> > > > > > > >
> > > > > > > > When reverting I did the following.
> > > > > > > >
> > > > > > > > For each broker.
> > > > > > > >
> > > > > > > > 1. Stop the broker.
> > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > > 4. Go to step 1.
> > > > > > > >
> > > > > > > > Once all the brokers were running 0.9 code with
> > > > > > > > inter.broker.protocol.version=0.8.2.X  I restarted them one
> by
> > > one
> > > > > with
> > > > > > > the
> > > > > > > > 0.8.2.3 broker code. This however like I mentioned did not
> fix
> > > the
> > > > > > three
> > > > > > > > broken topics.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <
> > > rajiv@signalfx.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Now that it has been a bit longer, the spikes I was seeing
> > are
> > > > gone
> > > > > > but
> > > > > > > > > the CPU and network in/out on the three brokers that were
> > > showing
> > > > > the
> > > > > > > > > spikes are still much higher than before the upgrade. Their
> > > CPUs
> > > > > have
> > > > > > > > > increased from around 1-2% to 12-20%. The network in on the
> > > same
> > > > > > > brokers
> > > > > > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The
> network
> > > out
> > > > > has
> > > > > > > gone
> > > > > > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a
> > > > corresponding
> > > > > > > > > increase in kafka messages in per second or kafka bytes in
> > per
> > > > > second
> > > > > > > JMX
> > > > > > > > > metrics.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Rajiv
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Jun Rao <ju...@confluent.io>.

Yes, the new java producer is available in 0.8.2.x and we recommend people
use that.

Also, when those producers had the issue, were there any other things weird
in the broker (e.g., broker's ZK session expires)?

Thanks,

Jun

On Thu, Dec 17, 2015 at 2:37 PM, Rajiv Kurian <ra...@signalfx.com> wrote:

> I can't think of anything special about the topics besides the clients
> being very old (Java wrappers over Scala).
>
> I do think it was using ack=0. But my guess is that the logging was done by
> the Kafka producer thread. My application itself was not getting exceptions
> from Kafka.
>
> On Thu, Dec 17, 2015 at 2:31 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Hmm, anything special with those 3 topics? Also, the broker log shows
> that
> > the producer uses ack=0, which means the producer shouldn't get errors
> like
> > leader not found. Could you clarify on the ack used by the producer?
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Dec 17, 2015 at 12:41 PM, Rajiv Kurian <ra...@signalfx.com>
> wrote:
> >
> > > The topic which stopped working had clients that were only using the
> old
> > > Java producer that is a wrapper over the Scala producer. Again it
> seemed
> > to
> > > work perfectly in another of our realms where we have the same topics,
> > same
> > > producers/consumers etc but with less traffic.
> > >
> > > On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Are you using the new java producer?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <ra...@signalfx.com>
> > > wrote:
> > > >
> > > > > Hi Jun,
> > > > > Answers inline:
> > > > >
> > > > > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Rajiv,
> > > > > >
> > > > > > Thanks for reporting this.
> > > > > >
> > > > > > 1. How did you verify that 3 of the topics are corrupted? Did you
> > use
> > > > > > DumpLogSegments tool? Also, is there a simple way to reproduce
> the
> > > > > > corruption?
> > > > > >
> > > > > No I did not. The only reason I had to believe that was no writers
> > > could
> > > > > write to the topic. I have actually no idea what the problem was. I
> > saw
> > > > > very frequent (much more than usual) messages of the form:
> > > > > INFO  [kafka-request-handler-2            ] [kafka.server.KafkaApis
> > > > >       ]: [KafkaApi-6] Close connection due to error handling
> produce
> > > > > request with correlation id 294218 from client id  with ack=0
> > > > > and also message of the form:
> > > > > INFO  [kafka-network-thread-9092-0        ]
> [kafka.network.Processor
> > > > >       ]: Closing socket connection to /some ip
> > > > > The cluster was actually a critical one so I had no recourse but to
> > > > revert
> > > > > the change (which like noted didn't fix things). I didn't have
> enough
> > > > time
> > > > > to debug further. The only way I could fix it with my limited Kafka
> > > > > knowledge was (after reverting) deleting the topic and recreating
> it.
> > > > > I had updated a low priority cluster before that worked just fine.
> > That
> > > > > gave me the confidence to upgrade this higher priority cluster
> which
> > > did
> > > > > NOT work out. So the only way for me to try to reproduce it is to
> try
> > > > this
> > > > > on our larger clusters again. But it is critical that we don't mess
> > up
> > > > this
> > > > > high priority cluster so I am afraid to try again.
> > > > >
> > > > > > 2. As Lance mentioned, if you are using snappy, make sure that
> you
> > > > > include
> > > > > > the right snappy jar (1.1.1.7).
> > > > > >
> > > > > Wonder why I don't see Lance's email in this thread. Either way we
> > are
> > > > not
> > > > > using compression of any kind on this topic.
> > > > >
> > > > > > 3. For the CPU issue, could you do a bit profiling to see which
> > > thread
> > > > is
> > > > > > busy and where it's spending time?
> > > > > >
> > > > > Since I had to revert I didn't have the time to profile.
> Intuitively
> > it
> > > > > would seem like the high number of client disconnects/errors and
> the
> > > > > increased network usage probably has something to do with the high
> > CPU
> > > > > (total guess). Again our other (lower traffic) cluster that was
> > > upgraded
> > > > > was totally fine so it doesn't seem like it happens all the time.
> > > > >
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > >
> > > > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <
> rajiv@signalfx.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > We had to revert to 0.8.3 because three of our topics seem to
> > have
> > > > > gotten
> > > > > > > corrupted during the upgrade. As soon as we did the upgrade
> > > producers
> > > > > to
> > > > > > > the three topics I mentioned stopped being able to do writes.
> The
> > > > > clients
> > > > > > > complained (occasionally) about leader not found exceptions. We
> > > > > restarted
> > > > > > > our clients and brokers but that didn't seem to help. Actually
> > even
> > > > > after
> > > > > > > reverting to 0.8.3 these three topics were broken. To fix it we
> > had
> > > > to
> > > > > > stop
> > > > > > > all clients, delete the topics, create them again and then
> > restart
> > > > the
> > > > > > > clients.
> > > > > > >
> > > > > > > I realize this is not a lot of info. I couldn't wait to get
> more
> > > > debug
> > > > > > info
> > > > > > > because the cluster was actually being used. Has any one run
> into
> > > > > > something
> > > > > > > like this? Are there any known issues with old
> > consumers/producers.
> > > > The
> > > > > > > topics that got busted had clients writing to them using the
> old
> > > Java
> > > > > > > wrapper over the Scala producer.
> > > > > > >
> > > > > > > Here are the steps I took to upgrade.
> > > > > > >
> > > > > > > For each broker:
> > > > > > >
> > > > > > > 1. Stop the broker.
> > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > 4. Go to step 1.
> > > > > > > Once all the brokers were running the 0.9 code with
> > > > > > > inter.broker.protocol.version=0.8.2.X we restarted them one by
> > one
> > > > with
> > > > > > > inter.broker.protocol.version=0.9.0.0
> > > > > > >
> > > > > > > When reverting I did the following.
> > > > > > >
> > > > > > > For each broker.
> > > > > > >
> > > > > > > 1. Stop the broker.
> > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > 4. Go to step 1.
> > > > > > >
> > > > > > > Once all the brokers were running 0.9 code with
> > > > > > > inter.broker.protocol.version=0.8.2.X  I restarted them one by
> > one
> > > > with
> > > > > > the
> > > > > > > 0.8.2.3 broker code. This however like I mentioned did not fix
> > the
> > > > > three
> > > > > > > broken topics.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <
> > rajiv@signalfx.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Now that it has been a bit longer, the spikes I was seeing
> are
> > > gone
> > > > > but
> > > > > > > > the CPU and network in/out on the three brokers that were
> > showing
> > > > the
> > > > > > > > spikes are still much higher than before the upgrade. Their
> > CPUs
> > > > have
> > > > > > > > increased from around 1-2% to 12-20%. The network in on the
> > same
> > > > > > brokers
> > > > > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The network
> > out
> > > > has
> > > > > > gone
> > > > > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a
> > > corresponding
> > > > > > > > increase in kafka messages in per second or kafka bytes in
> per
> > > > second
> > > > > > JMX
> > > > > > > > metrics.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Rajiv
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Rajiv Kurian <ra...@signalfx.com>.

I can't think of anything special about the topics besides the clients
being very old (Java wrappers over Scala).

I do think it was using ack=0. But my guess is that the logging was done by
the Kafka producer thread. My application itself was not getting exceptions
from Kafka.

On Thu, Dec 17, 2015 at 2:31 PM, Jun Rao <ju...@confluent.io> wrote:

> Hmm, anything special with those 3 topics? Also, the broker log shows that
> the producer uses ack=0, which means the producer shouldn't get errors like
> leader not found. Could you clarify on the ack used by the producer?
>
> Thanks,
>
> Jun
>
> On Thu, Dec 17, 2015 at 12:41 PM, Rajiv Kurian <ra...@signalfx.com> wrote:
>
> > The topic which stopped working had clients that were only using the old
> > Java producer that is a wrapper over the Scala producer. Again it seemed
> to
> > work perfectly in another of our realms where we have the same topics,
> same
> > producers/consumers etc but with less traffic.
> >
> > On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Are you using the new java producer?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <ra...@signalfx.com>
> > wrote:
> > >
> > > > Hi Jun,
> > > > Answers inline:
> > > >
> > > > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Rajiv,
> > > > >
> > > > > Thanks for reporting this.
> > > > >
> > > > > 1. How did you verify that 3 of the topics are corrupted? Did you
> use
> > > > > DumpLogSegments tool? Also, is there a simple way to reproduce the
> > > > > corruption?
> > > > >
> > > > No I did not. The only reason I had to believe that was no writers
> > could
> > > > write to the topic. I have actually no idea what the problem was. I
> saw
> > > > very frequent (much more than usual) messages of the form:
> > > > INFO  [kafka-request-handler-2            ] [kafka.server.KafkaApis
> > > >       ]: [KafkaApi-6] Close connection due to error handling produce
> > > > request with correlation id 294218 from client id  with ack=0
> > > > and also message of the form:
> > > > INFO  [kafka-network-thread-9092-0        ] [kafka.network.Processor
> > > >       ]: Closing socket connection to /some ip
> > > > The cluster was actually a critical one so I had no recourse but to
> > > revert
> > > > the change (which like noted didn't fix things). I didn't have enough
> > > time
> > > > to debug further. The only way I could fix it with my limited Kafka
> > > > knowledge was (after reverting) deleting the topic and recreating it.
> > > > I had updated a low priority cluster before that worked just fine.
> That
> > > > gave me the confidence to upgrade this higher priority cluster which
> > did
> > > > NOT work out. So the only way for me to try to reproduce it is to try
> > > this
> > > > on our larger clusters again. But it is critical that we don't mess
> up
> > > this
> > > > high priority cluster so I am afraid to try again.
> > > >
> > > > > 2. As Lance mentioned, if you are using snappy, make sure that you
> > > > include
> > > > > the right snappy jar (1.1.1.7).
> > > > >
> > > > Wonder why I don't see Lance's email in this thread. Either way we
> are
> > > not
> > > > using compression of any kind on this topic.
> > > >
> > > > > 3. For the CPU issue, could you do a bit profiling to see which
> > thread
> > > is
> > > > > busy and where it's spending time?
> > > > >
> > > > Since I had to revert I didn't have the time to profile. Intuitively
> it
> > > > would seem like the high number of client disconnects/errors and the
> > > > increased network usage probably has something to do with the high
> CPU
> > > > (total guess). Again our other (lower traffic) cluster that was
> > upgraded
> > > > was totally fine so it doesn't seem like it happens all the time.
> > > >
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <rajiv@signalfx.com
> >
> > > > wrote:
> > > > >
> > > > > > We had to revert to 0.8.3 because three of our topics seem to
> have
> > > > gotten
> > > > > > corrupted during the upgrade. As soon as we did the upgrade
> > producers
> > > > to
> > > > > > the three topics I mentioned stopped being able to do writes. The
> > > > clients
> > > > > > complained (occasionally) about leader not found exceptions. We
> > > > restarted
> > > > > > our clients and brokers but that didn't seem to help. Actually
> even
> > > > after
> > > > > > reverting to 0.8.3 these three topics were broken. To fix it we
> had
> > > to
> > > > > stop
> > > > > > all clients, delete the topics, create them again and then
> restart
> > > the
> > > > > > clients.
> > > > > >
> > > > > > I realize this is not a lot of info. I couldn't wait to get more
> > > debug
> > > > > info
> > > > > > because the cluster was actually being used. Has any one run into
> > > > > something
> > > > > > like this? Are there any known issues with old
> consumers/producers.
> > > The
> > > > > > topics that got busted had clients writing to them using the old
> > Java
> > > > > > wrapper over the Scala producer.
> > > > > >
> > > > > > Here are the steps I took to upgrade.
> > > > > >
> > > > > > For each broker:
> > > > > >
> > > > > > 1. Stop the broker.
> > > > > > 2. Restart with the 0.9 broker running with
> > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > 4. Go to step 1.
> > > > > > Once all the brokers were running the 0.9 code with
> > > > > > inter.broker.protocol.version=0.8.2.X we restarted them one by
> one
> > > with
> > > > > > inter.broker.protocol.version=0.9.0.0
> > > > > >
> > > > > > When reverting I did the following.
> > > > > >
> > > > > > For each broker.
> > > > > >
> > > > > > 1. Stop the broker.
> > > > > > 2. Restart with the 0.9 broker running with
> > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > 4. Go to step 1.
> > > > > >
> > > > > > Once all the brokers were running 0.9 code with
> > > > > > inter.broker.protocol.version=0.8.2.X  I restarted them one by
> one
> > > with
> > > > > the
> > > > > > 0.8.2.3 broker code. This however like I mentioned did not fix
> the
> > > > three
> > > > > > broken topics.
> > > > > >
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <
> rajiv@signalfx.com>
> > > > > wrote:
> > > > > >
> > > > > > > Now that it has been a bit longer, the spikes I was seeing are
> > gone
> > > > but
> > > > > > > the CPU and network in/out on the three brokers that were
> showing
> > > the
> > > > > > > spikes are still much higher than before the upgrade. Their
> CPUs
> > > have
> > > > > > > increased from around 1-2% to 12-20%. The network in on the
> same
> > > > > brokers
> > > > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The network
> out
> > > has
> > > > > gone
> > > > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a
> > corresponding
> > > > > > > increase in kafka messages in per second or kafka bytes in per
> > > second
> > > > > JMX
> > > > > > > metrics.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Rajiv
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Jun Rao <ju...@confluent.io>.

Hmm, anything special with those 3 topics? Also, the broker log shows that
the producer uses ack=0, which means the producer shouldn't get errors like
leader not found. Could you clarify on the ack used by the producer?

Thanks,

Jun

On Thu, Dec 17, 2015 at 12:41 PM, Rajiv Kurian <ra...@signalfx.com> wrote:

> The topic which stopped working had clients that were only using the old
> Java producer that is a wrapper over the Scala producer. Again it seemed to
> work perfectly in another of our realms where we have the same topics, same
> producers/consumers etc but with less traffic.
>
> On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Are you using the new java producer?
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <ra...@signalfx.com>
> wrote:
> >
> > > Hi Jun,
> > > Answers inline:
> > >
> > > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Rajiv,
> > > >
> > > > Thanks for reporting this.
> > > >
> > > > 1. How did you verify that 3 of the topics are corrupted? Did you use
> > > > DumpLogSegments tool? Also, is there a simple way to reproduce the
> > > > corruption?
> > > >
> > > No I did not. The only reason I had to believe that was no writers
> could
> > > write to the topic. I have actually no idea what the problem was. I saw
> > > very frequent (much more than usual) messages of the form:
> > > INFO  [kafka-request-handler-2            ] [kafka.server.KafkaApis
> > >       ]: [KafkaApi-6] Close connection due to error handling produce
> > > request with correlation id 294218 from client id  with ack=0
> > > and also message of the form:
> > > INFO  [kafka-network-thread-9092-0        ] [kafka.network.Processor
> > >       ]: Closing socket connection to /some ip
> > > The cluster was actually a critical one so I had no recourse but to
> > revert
> > > the change (which like noted didn't fix things). I didn't have enough
> > time
> > > to debug further. The only way I could fix it with my limited Kafka
> > > knowledge was (after reverting) deleting the topic and recreating it.
> > > I had updated a low priority cluster before that worked just fine. That
> > > gave me the confidence to upgrade this higher priority cluster which
> did
> > > NOT work out. So the only way for me to try to reproduce it is to try
> > this
> > > on our larger clusters again. But it is critical that we don't mess up
> > this
> > > high priority cluster so I am afraid to try again.
> > >
> > > > 2. As Lance mentioned, if you are using snappy, make sure that you
> > > include
> > > > the right snappy jar (1.1.1.7).
> > > >
> > > Wonder why I don't see Lance's email in this thread. Either way we are
> > not
> > > using compression of any kind on this topic.
> > >
> > > > 3. For the CPU issue, could you do a bit profiling to see which
> thread
> > is
> > > > busy and where it's spending time?
> > > >
> > > Since I had to revert I didn't have the time to profile. Intuitively it
> > > would seem like the high number of client disconnects/errors and the
> > > increased network usage probably has something to do with the high CPU
> > > (total guess). Again our other (lower traffic) cluster that was
> upgraded
> > > was totally fine so it doesn't seem like it happens all the time.
> > >
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <ra...@signalfx.com>
> > > wrote:
> > > >
> > > > > We had to revert to 0.8.3 because three of our topics seem to have
> > > gotten
> > > > > corrupted during the upgrade. As soon as we did the upgrade
> producers
> > > to
> > > > > the three topics I mentioned stopped being able to do writes. The
> > > clients
> > > > > complained (occasionally) about leader not found exceptions. We
> > > restarted
> > > > > our clients and brokers but that didn't seem to help. Actually even
> > > after
> > > > > reverting to 0.8.3 these three topics were broken. To fix it we had
> > to
> > > > stop
> > > > > all clients, delete the topics, create them again and then restart
> > the
> > > > > clients.
> > > > >
> > > > > I realize this is not a lot of info. I couldn't wait to get more
> > debug
> > > > info
> > > > > because the cluster was actually being used. Has any one run into
> > > > something
> > > > > like this? Are there any known issues with old consumers/producers.
> > The
> > > > > topics that got busted had clients writing to them using the old
> Java
> > > > > wrapper over the Scala producer.
> > > > >
> > > > > Here are the steps I took to upgrade.
> > > > >
> > > > > For each broker:
> > > > >
> > > > > 1. Stop the broker.
> > > > > 2. Restart with the 0.9 broker running with
> > > > > inter.broker.protocol.version=0.8.2.X
> > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > 4. Go to step 1.
> > > > > Once all the brokers were running the 0.9 code with
> > > > > inter.broker.protocol.version=0.8.2.X we restarted them one by one
> > with
> > > > > inter.broker.protocol.version=0.9.0.0
> > > > >
> > > > > When reverting I did the following.
> > > > >
> > > > > For each broker.
> > > > >
> > > > > 1. Stop the broker.
> > > > > 2. Restart with the 0.9 broker running with
> > > > > inter.broker.protocol.version=0.8.2.X
> > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > 4. Go to step 1.
> > > > >
> > > > > Once all the brokers were running 0.9 code with
> > > > > inter.broker.protocol.version=0.8.2.X  I restarted them one by one
> > with
> > > > the
> > > > > 0.8.2.3 broker code. This however like I mentioned did not fix the
> > > three
> > > > > broken topics.
> > > > >
> > > > >
> > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <ra...@signalfx.com>
> > > > wrote:
> > > > >
> > > > > > Now that it has been a bit longer, the spikes I was seeing are
> gone
> > > but
> > > > > > the CPU and network in/out on the three brokers that were showing
> > the
> > > > > > spikes are still much higher than before the upgrade. Their CPUs
> > have
> > > > > > increased from around 1-2% to 12-20%. The network in on the same
> > > > brokers
> > > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The network out
> > has
> > > > gone
> > > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a
> corresponding
> > > > > > increase in kafka messages in per second or kafka bytes in per
> > second
> > > > JMX
> > > > > > metrics.
> > > > > >
> > > > > > Thanks,
> > > > > > Rajiv
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Rajiv Kurian <ra...@signalfx.com>.

The topic which stopped working had clients that were only using the old
Java producer that is a wrapper over the Scala producer. Again it seemed to
work perfectly in another of our realms where we have the same topics, same
producers/consumers etc but with less traffic.

On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <ju...@confluent.io> wrote:

> Are you using the new java producer?
>
> Thanks,
>
> Jun
>
> On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <ra...@signalfx.com> wrote:
>
> > Hi Jun,
> > Answers inline:
> >
> > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Rajiv,
> > >
> > > Thanks for reporting this.
> > >
> > > 1. How did you verify that 3 of the topics are corrupted? Did you use
> > > DumpLogSegments tool? Also, is there a simple way to reproduce the
> > > corruption?
> > >
> > No I did not. The only reason I had to believe that was no writers could
> > write to the topic. I have actually no idea what the problem was. I saw
> > very frequent (much more than usual) messages of the form:
> > INFO  [kafka-request-handler-2            ] [kafka.server.KafkaApis
> >       ]: [KafkaApi-6] Close connection due to error handling produce
> > request with correlation id 294218 from client id  with ack=0
> > and also message of the form:
> > INFO  [kafka-network-thread-9092-0        ] [kafka.network.Processor
> >       ]: Closing socket connection to /some ip
> > The cluster was actually a critical one so I had no recourse but to
> revert
> > the change (which like noted didn't fix things). I didn't have enough
> time
> > to debug further. The only way I could fix it with my limited Kafka
> > knowledge was (after reverting) deleting the topic and recreating it.
> > I had updated a low priority cluster before that worked just fine. That
> > gave me the confidence to upgrade this higher priority cluster which did
> > NOT work out. So the only way for me to try to reproduce it is to try
> this
> > on our larger clusters again. But it is critical that we don't mess up
> this
> > high priority cluster so I am afraid to try again.
> >
> > > 2. As Lance mentioned, if you are using snappy, make sure that you
> > include
> > > the right snappy jar (1.1.1.7).
> > >
> > Wonder why I don't see Lance's email in this thread. Either way we are
> not
> > using compression of any kind on this topic.
> >
> > > 3. For the CPU issue, could you do a bit profiling to see which thread
> is
> > > busy and where it's spending time?
> > >
> > Since I had to revert I didn't have the time to profile. Intuitively it
> > would seem like the high number of client disconnects/errors and the
> > increased network usage probably has something to do with the high CPU
> > (total guess). Again our other (lower traffic) cluster that was upgraded
> > was totally fine so it doesn't seem like it happens all the time.
> >
> > >
> > > Jun
> > >
> > >
> > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <ra...@signalfx.com>
> > wrote:
> > >
> > > > We had to revert to 0.8.3 because three of our topics seem to have
> > gotten
> > > > corrupted during the upgrade. As soon as we did the upgrade producers
> > to
> > > > the three topics I mentioned stopped being able to do writes. The
> > clients
> > > > complained (occasionally) about leader not found exceptions. We
> > restarted
> > > > our clients and brokers but that didn't seem to help. Actually even
> > after
> > > > reverting to 0.8.3 these three topics were broken. To fix it we had
> to
> > > stop
> > > > all clients, delete the topics, create them again and then restart
> the
> > > > clients.
> > > >
> > > > I realize this is not a lot of info. I couldn't wait to get more
> debug
> > > info
> > > > because the cluster was actually being used. Has any one run into
> > > something
> > > > like this? Are there any known issues with old consumers/producers.
> The
> > > > topics that got busted had clients writing to them using the old Java
> > > > wrapper over the Scala producer.
> > > >
> > > > Here are the steps I took to upgrade.
> > > >
> > > > For each broker:
> > > >
> > > > 1. Stop the broker.
> > > > 2. Restart with the 0.9 broker running with
> > > > inter.broker.protocol.version=0.8.2.X
> > > > 3. Wait for under replicated partitions to go down to 0.
> > > > 4. Go to step 1.
> > > > Once all the brokers were running the 0.9 code with
> > > > inter.broker.protocol.version=0.8.2.X we restarted them one by one
> with
> > > > inter.broker.protocol.version=0.9.0.0
> > > >
> > > > When reverting I did the following.
> > > >
> > > > For each broker.
> > > >
> > > > 1. Stop the broker.
> > > > 2. Restart with the 0.9 broker running with
> > > > inter.broker.protocol.version=0.8.2.X
> > > > 3. Wait for under replicated partitions to go down to 0.
> > > > 4. Go to step 1.
> > > >
> > > > Once all the brokers were running 0.9 code with
> > > > inter.broker.protocol.version=0.8.2.X  I restarted them one by one
> with
> > > the
> > > > 0.8.2.3 broker code. This however like I mentioned did not fix the
> > three
> > > > broken topics.
> > > >
> > > >
> > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <ra...@signalfx.com>
> > > wrote:
> > > >
> > > > > Now that it has been a bit longer, the spikes I was seeing are gone
> > but
> > > > > the CPU and network in/out on the three brokers that were showing
> the
> > > > > spikes are still much higher than before the upgrade. Their CPUs
> have
> > > > > increased from around 1-2% to 12-20%. The network in on the same
> > > brokers
> > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The network out
> has
> > > gone
> > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a corresponding
> > > > > increase in kafka messages in per second or kafka bytes in per
> second
> > > JMX
> > > > > metrics.
> > > > >
> > > > > Thanks,
> > > > > Rajiv
> > > > >
> > > >
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Jun Rao <ju...@confluent.io>.

Are you using the new java producer?

Thanks,

Jun

On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <ra...@signalfx.com> wrote:

> Hi Jun,
> Answers inline:
>
> On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <ju...@confluent.io> wrote:
>
> > Rajiv,
> >
> > Thanks for reporting this.
> >
> > 1. How did you verify that 3 of the topics are corrupted? Did you use
> > DumpLogSegments tool? Also, is there a simple way to reproduce the
> > corruption?
> >
> No I did not. The only reason I had to believe that was no writers could
> write to the topic. I have actually no idea what the problem was. I saw
> very frequent (much more than usual) messages of the form:
> INFO  [kafka-request-handler-2            ] [kafka.server.KafkaApis
>       ]: [KafkaApi-6] Close connection due to error handling produce
> request with correlation id 294218 from client id  with ack=0
> and also message of the form:
> INFO  [kafka-network-thread-9092-0        ] [kafka.network.Processor
>       ]: Closing socket connection to /some ip
> The cluster was actually a critical one so I had no recourse but to revert
> the change (which like noted didn't fix things). I didn't have enough time
> to debug further. The only way I could fix it with my limited Kafka
> knowledge was (after reverting) deleting the topic and recreating it.
> I had updated a low priority cluster before that worked just fine. That
> gave me the confidence to upgrade this higher priority cluster which did
> NOT work out. So the only way for me to try to reproduce it is to try this
> on our larger clusters again. But it is critical that we don't mess up this
> high priority cluster so I am afraid to try again.
>
> > 2. As Lance mentioned, if you are using snappy, make sure that you
> include
> > the right snappy jar (1.1.1.7).
> >
> Wonder why I don't see Lance's email in this thread. Either way we are not
> using compression of any kind on this topic.
>
> > 3. For the CPU issue, could you do a bit profiling to see which thread is
> > busy and where it's spending time?
> >
> Since I had to revert I didn't have the time to profile. Intuitively it
> would seem like the high number of client disconnects/errors and the
> increased network usage probably has something to do with the high CPU
> (total guess). Again our other (lower traffic) cluster that was upgraded
> was totally fine so it doesn't seem like it happens all the time.
>
> >
> > Jun
> >
> >
> > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <ra...@signalfx.com>
> wrote:
> >
> > > We had to revert to 0.8.3 because three of our topics seem to have
> gotten
> > > corrupted during the upgrade. As soon as we did the upgrade producers
> to
> > > the three topics I mentioned stopped being able to do writes. The
> clients
> > > complained (occasionally) about leader not found exceptions. We
> restarted
> > > our clients and brokers but that didn't seem to help. Actually even
> after
> > > reverting to 0.8.3 these three topics were broken. To fix it we had to
> > stop
> > > all clients, delete the topics, create them again and then restart the
> > > clients.
> > >
> > > I realize this is not a lot of info. I couldn't wait to get more debug
> > info
> > > because the cluster was actually being used. Has any one run into
> > something
> > > like this? Are there any known issues with old consumers/producers. The
> > > topics that got busted had clients writing to them using the old Java
> > > wrapper over the Scala producer.
> > >
> > > Here are the steps I took to upgrade.
> > >
> > > For each broker:
> > >
> > > 1. Stop the broker.
> > > 2. Restart with the 0.9 broker running with
> > > inter.broker.protocol.version=0.8.2.X
> > > 3. Wait for under replicated partitions to go down to 0.
> > > 4. Go to step 1.
> > > Once all the brokers were running the 0.9 code with
> > > inter.broker.protocol.version=0.8.2.X we restarted them one by one with
> > > inter.broker.protocol.version=0.9.0.0
> > >
> > > When reverting I did the following.
> > >
> > > For each broker.
> > >
> > > 1. Stop the broker.
> > > 2. Restart with the 0.9 broker running with
> > > inter.broker.protocol.version=0.8.2.X
> > > 3. Wait for under replicated partitions to go down to 0.
> > > 4. Go to step 1.
> > >
> > > Once all the brokers were running 0.9 code with
> > > inter.broker.protocol.version=0.8.2.X  I restarted them one by one with
> > the
> > > 0.8.2.3 broker code. This however like I mentioned did not fix the
> three
> > > broken topics.
> > >
> > >
> > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <ra...@signalfx.com>
> > wrote:
> > >
> > > > Now that it has been a bit longer, the spikes I was seeing are gone
> but
> > > > the CPU and network in/out on the three brokers that were showing the
> > > > spikes are still much higher than before the upgrade. Their CPUs have
> > > > increased from around 1-2% to 12-20%. The network in on the same
> > brokers
> > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The network out has
> > gone
> > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a corresponding
> > > > increase in kafka messages in per second or kafka bytes in per second
> > JMX
> > > > metrics.
> > > >
> > > > Thanks,
> > > > Rajiv
> > > >
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Rajiv Kurian <ra...@signalfx.com>.

Hi Jun,
Answers inline:

On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <ju...@confluent.io> wrote:

> Rajiv,
>
> Thanks for reporting this.
>
> 1. How did you verify that 3 of the topics are corrupted? Did you use
> DumpLogSegments tool? Also, is there a simple way to reproduce the
> corruption?
>
No I did not. The only reason I had to believe that was no writers could
write to the topic. I have actually no idea what the problem was. I saw
very frequent (much more than usual) messages of the form:
INFO  [kafka-request-handler-2            ] [kafka.server.KafkaApis
      ]: [KafkaApi-6] Close connection due to error handling produce
request with correlation id 294218 from client id  with ack=0
and also message of the form:
INFO  [kafka-network-thread-9092-0        ] [kafka.network.Processor
      ]: Closing socket connection to /some ip
The cluster was actually a critical one so I had no recourse but to revert
the change (which like noted didn't fix things). I didn't have enough time
to debug further. The only way I could fix it with my limited Kafka
knowledge was (after reverting) deleting the topic and recreating it.
I had updated a low priority cluster before that worked just fine. That
gave me the confidence to upgrade this higher priority cluster which did
NOT work out. So the only way for me to try to reproduce it is to try this
on our larger clusters again. But it is critical that we don't mess up this
high priority cluster so I am afraid to try again.

> 2. As Lance mentioned, if you are using snappy, make sure that you include
> the right snappy jar (1.1.1.7).
>
Wonder why I don't see Lance's email in this thread. Either way we are not
using compression of any kind on this topic.

> 3. For the CPU issue, could you do a bit profiling to see which thread is
> busy and where it's spending time?
>
Since I had to revert I didn't have the time to profile. Intuitively it
would seem like the high number of client disconnects/errors and the
increased network usage probably has something to do with the high CPU
(total guess). Again our other (lower traffic) cluster that was upgraded
was totally fine so it doesn't seem like it happens all the time.

>
> Jun
>
>
> On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <ra...@signalfx.com> wrote:
>
> > We had to revert to 0.8.3 because three of our topics seem to have gotten
> > corrupted during the upgrade. As soon as we did the upgrade producers to
> > the three topics I mentioned stopped being able to do writes. The clients
> > complained (occasionally) about leader not found exceptions. We restarted
> > our clients and brokers but that didn't seem to help. Actually even after
> > reverting to 0.8.3 these three topics were broken. To fix it we had to
> stop
> > all clients, delete the topics, create them again and then restart the
> > clients.
> >
> > I realize this is not a lot of info. I couldn't wait to get more debug
> info
> > because the cluster was actually being used. Has any one run into
> something
> > like this? Are there any known issues with old consumers/producers. The
> > topics that got busted had clients writing to them using the old Java
> > wrapper over the Scala producer.
> >
> > Here are the steps I took to upgrade.
> >
> > For each broker:
> >
> > 1. Stop the broker.
> > 2. Restart with the 0.9 broker running with
> > inter.broker.protocol.version=0.8.2.X
> > 3. Wait for under replicated partitions to go down to 0.
> > 4. Go to step 1.
> > Once all the brokers were running the 0.9 code with
> > inter.broker.protocol.version=0.8.2.X we restarted them one by one with
> > inter.broker.protocol.version=0.9.0.0
> >
> > When reverting I did the following.
> >
> > For each broker.
> >
> > 1. Stop the broker.
> > 2. Restart with the 0.9 broker running with
> > inter.broker.protocol.version=0.8.2.X
> > 3. Wait for under replicated partitions to go down to 0.
> > 4. Go to step 1.
> >
> > Once all the brokers were running 0.9 code with
> > inter.broker.protocol.version=0.8.2.X  I restarted them one by one with
> the
> > 0.8.2.3 broker code. This however like I mentioned did not fix the three
> > broken topics.
> >
> >
> > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <ra...@signalfx.com>
> wrote:
> >
> > > Now that it has been a bit longer, the spikes I was seeing are gone but
> > > the CPU and network in/out on the three brokers that were showing the
> > > spikes are still much higher than before the upgrade. Their CPUs have
> > > increased from around 1-2% to 12-20%. The network in on the same
> brokers
> > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The network out has
> gone
> > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a corresponding
> > > increase in kafka messages in per second or kafka bytes in per second
> JMX
> > > metrics.
> > >
> > > Thanks,
> > > Rajiv
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Jun Rao <ju...@confluent.io>.

Rajiv,

Thanks for reporting this.

1. How did you verify that 3 of the topics are corrupted? Did you use
DumpLogSegments tool? Also, is there a simple way to reproduce the
corruption?
2. As Lance mentioned, if you are using snappy, make sure that you include
the right snappy jar (1.1.1.7).
3. For the CPU issue, could you do a bit profiling to see which thread is
busy and where it's spending time?

Jun


On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <ra...@signalfx.com> wrote:

> We had to revert to 0.8.3 because three of our topics seem to have gotten
> corrupted during the upgrade. As soon as we did the upgrade producers to
> the three topics I mentioned stopped being able to do writes. The clients
> complained (occasionally) about leader not found exceptions. We restarted
> our clients and brokers but that didn't seem to help. Actually even after
> reverting to 0.8.3 these three topics were broken. To fix it we had to stop
> all clients, delete the topics, create them again and then restart the
> clients.
>
> I realize this is not a lot of info. I couldn't wait to get more debug info
> because the cluster was actually being used. Has any one run into something
> like this? Are there any known issues with old consumers/producers. The
> topics that got busted had clients writing to them using the old Java
> wrapper over the Scala producer.
>
> Here are the steps I took to upgrade.
>
> For each broker:
>
> 1. Stop the broker.
> 2. Restart with the 0.9 broker running with
> inter.broker.protocol.version=0.8.2.X
> 3. Wait for under replicated partitions to go down to 0.
> 4. Go to step 1.
> Once all the brokers were running the 0.9 code with
> inter.broker.protocol.version=0.8.2.X we restarted them one by one with
> inter.broker.protocol.version=0.9.0.0
>
> When reverting I did the following.
>
> For each broker.
>
> 1. Stop the broker.
> 2. Restart with the 0.9 broker running with
> inter.broker.protocol.version=0.8.2.X
> 3. Wait for under replicated partitions to go down to 0.
> 4. Go to step 1.
>
> Once all the brokers were running 0.9 code with
> inter.broker.protocol.version=0.8.2.X  I restarted them one by one with the
> 0.8.2.3 broker code. This however like I mentioned did not fix the three
> broken topics.
>
>
> On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <ra...@signalfx.com> wrote:
>
> > Now that it has been a bit longer, the spikes I was seeing are gone but
> > the CPU and network in/out on the three brokers that were showing the
> > spikes are still much higher than before the upgrade. Their CPUs have
> > increased from around 1-2% to 12-20%. The network in on the same brokers
> > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The network out has gone
> > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a corresponding
> > increase in kafka messages in per second or kafka bytes in per second JMX
> > metrics.
> >
> > Thanks,
> > Rajiv
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Lance Laursen <ll...@rubiconproject.com>.

Hey Rajiv,

Are you using snappy compression?

On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <ra...@signalfx.com> wrote:

> We had to revert to 0.8.3 because three of our topics seem to have gotten
> corrupted during the upgrade. As soon as we did the upgrade producers to
> the three topics I mentioned stopped being able to do writes. The clients
> complained (occasionally) about leader not found exceptions. We restarted
> our clients and brokers but that didn't seem to help. Actually even after
> reverting to 0.8.3 these three topics were broken. To fix it we had to stop
> all clients, delete the topics, create them again and then restart the
> clients.
>
> I realize this is not a lot of info. I couldn't wait to get more debug info
> because the cluster was actually being used. Has any one run into something
> like this? Are there any known issues with old consumers/producers. The
> topics that got busted had clients writing to them using the old Java
> wrapper over the Scala producer.
>
> Here are the steps I took to upgrade.
>
> For each broker:
>
> 1. Stop the broker.
> 2. Restart with the 0.9 broker running with
> inter.broker.protocol.version=0.8.2.X
> 3. Wait for under replicated partitions to go down to 0.
> 4. Go to step 1.
> Once all the brokers were running the 0.9 code with
> inter.broker.protocol.version=0.8.2.X we restarted them one by one with
> inter.broker.protocol.version=0.9.0.0
>
> When reverting I did the following.
>
> For each broker.
>
> 1. Stop the broker.
> 2. Restart with the 0.9 broker running with
> inter.broker.protocol.version=0.8.2.X
> 3. Wait for under replicated partitions to go down to 0.
> 4. Go to step 1.
>
> Once all the brokers were running 0.9 code with
> inter.broker.protocol.version=0.8.2.X  I restarted them one by one with the
> 0.8.2.3 broker code. This however like I mentioned did not fix the three
> broken topics.
>
>
> On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <ra...@signalfx.com> wrote:
>
> > Now that it has been a bit longer, the spikes I was seeing are gone but
> > the CPU and network in/out on the three brokers that were showing the
> > spikes are still much higher than before the upgrade. Their CPUs have
> > increased from around 1-2% to 12-20%. The network in on the same brokers
> > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The network out has gone
> > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a corresponding
> > increase in kafka messages in per second or kafka bytes in per second JMX
> > metrics.
> >
> > Thanks,
> > Rajiv
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Rajiv Kurian <ra...@signalfx.com>.

We had to revert to 0.8.3 because three of our topics seem to have gotten
corrupted during the upgrade. As soon as we did the upgrade producers to
the three topics I mentioned stopped being able to do writes. The clients
complained (occasionally) about leader not found exceptions. We restarted
our clients and brokers but that didn't seem to help. Actually even after
reverting to 0.8.3 these three topics were broken. To fix it we had to stop
all clients, delete the topics, create them again and then restart the
clients.

I realize this is not a lot of info. I couldn't wait to get more debug info
because the cluster was actually being used. Has any one run into something
like this? Are there any known issues with old consumers/producers. The
topics that got busted had clients writing to them using the old Java
wrapper over the Scala producer.

Here are the steps I took to upgrade.

For each broker:

1. Stop the broker.
2. Restart with the 0.9 broker running with
inter.broker.protocol.version=0.8.2.X
3. Wait for under replicated partitions to go down to 0.
4. Go to step 1.
Once all the brokers were running the 0.9 code with
inter.broker.protocol.version=0.8.2.X we restarted them one by one with
inter.broker.protocol.version=0.9.0.0

When reverting I did the following.

For each broker.

1. Stop the broker.
2. Restart with the 0.9 broker running with
inter.broker.protocol.version=0.8.2.X
3. Wait for under replicated partitions to go down to 0.
4. Go to step 1.

Once all the brokers were running 0.9 code with
inter.broker.protocol.version=0.8.2.X  I restarted them one by one with the
0.8.2.3 broker code. This however like I mentioned did not fix the three
broken topics.

On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <ra...@signalfx.com> wrote:

> Now that it has been a bit longer, the spikes I was seeing are gone but
> the CPU and network in/out on the three brokers that were showing the
> spikes are still much higher than before the upgrade. Their CPUs have
> increased from around 1-2% to 12-20%. The network in on the same brokers
> has gone up from under 2 Mb/sec to 19-33 Mb/sec. The network out has gone
> up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a corresponding
> increase in kafka messages in per second or kafka bytes in per second JMX
> metrics.
>
> Thanks,
> Rajiv
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Posted by Rajiv Kurian <ra...@signalfx.com>.

Now that it has been a bit longer, the spikes I was seeing are gone but the
CPU and network in/out on the three brokers that were showing the spikes
are still much higher than before the upgrade. Their CPUs have increased
from around 1-2% to 12-20%. The network in on the same brokers has gone up
from under 2 Mb/sec to 19-33 Mb/sec. The network out has gone up from under
2 Mb/sec to 29-42 Mb/sec. I don't see a corresponding increase in kafka
messages in per second or kafka bytes in per second JMX metrics.

Thanks,
Rajiv