You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Marcin Michalski <mm...@tagged.com> on 2014/04/09 20:18:19 UTC

Upgrading from 0.8.0 to 0.8.1 one broker at a time issues

Hi, has anyone upgraded their kafka from 0.8.0 to 0.8.1 successfully one
broker at a time on a live cluster?

I am seeing strange behaviors where many of my kafka topics become unusable
(by both consumers and producers). When that happens, I see lots of errors
in the server logs that look like this:

[2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
correlation id 2455 from client ReplicaFetcherThread-15-1007 on partition
[risk,0] failed due to Topic risk either doesn't exist or is in the process
of being deleted (kafka.server.KafkaApis)
[2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
correlation id 2455 from client ReplicaFetcherThread-7-1007 on partition
[message,0] failed due to Topic message either doesn't exist or is in the
process of being deleted (kafka.server.KafkaApis)

When I try to consume a message from a topic that complained about the
Topic not existing (above warning), I get the below exception:

....topic message --from-beginning
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
[2014-04-09 10:40:30,571] WARN
[console-consumer-90716_dkafkadatahub07.tag-dev.com-1397065229615-7211ba72-leader-finder-thread],
Failed to add leader for partitions [message,0]; will retry
(kafka.consumer.ConsumerFetcherManager$LeaderFinderThread)
kafka.common.UnknownTopicOrPartitionException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at java.lang.Class.newInstance0(Class.java:355)
at java.lang.Class.newInstance(Class.java:308)
at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:79)
at
kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:167)
at
kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60)
at
kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:179)
at
kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:174)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
at
kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:174)
at
kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:86)
at
kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:76)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
at
kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:76)
at
kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:95)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
----------

*More details about my issues:*
My current configuration in the environment where I am testing the upgrade
is 4 physical servers running 2 brokers each with controlled shutdown
feature enabled. When I shutdown the 2 brokers on one of the existing Kafka
0.8.0 machines and upgrade that machine to 0.8.1 and restart it, all is
fine for a bit. Once, the new brokers come up, I ran the
kafka-preferred-replica-election.sh to make sure that started brokers
become leaders of existing topics.  The replication factor on the topics is
set to 4. I tested both producing and consuming messages against brokers
that were leaders with kafka 0.8.0 and 0.8.1 and no issues were encountered.

Later, I tried to perform the control shutdown of the 2 additional brokers
on the Kafka server that has 0.8.0 version installed and after the broker
shutdown and new leaders were assigned, all of my server logs are getting
filled up with the above exceptions and most of my topics are not usable. I
have pulled and build the 0.8.1 kafka code from git last thursday so I
should be pretty much up to date. So not sure if I am doing something wrong
or if migrating from 0.8.0 to 0.8.1 on a live cluster one server at a time
is not supported. Is there a recommended migration approach that one should
take when migrating from live 0.8.0 to 0.8.1 cluster?

As to who is the leader of one of the topics that became unusable is the
broker that was successfully upgraded to 0.8.1:
Topic:message   PartitionCount:1        ReplicationFactor:4     Configs:
        Topic: message  Partition: 0   * Leader: 1007 *   Replicas:
1007,8,9,1001 Isr: 1001,1007,8

Brokers 9 and 1009 where shutdown from one physical server that had kafka
0.8.0 installed when these problems started occurring (I was planning to
upgrade them to 0.8.1). The only way I can recover from this state is to
shutdown all brokers and delete all of kafka topic logs plus zookeeper
kafka directory and start with new cluster.


Your help in this matter is greatly appreciated.

Thanks,
Martin

Re: Upgrading from 0.8.0 to 0.8.1 one broker at a time issues

Posted by Marcin Michalski <mm...@tagged.com>.

I did not see any zookeeper session expirations. However, I was able to
perform live upgrade on my local Mac OSX where I had three 0.8.0 brokers
running and took them down one a time and upgraded them to 0.8.1 and did
not encounter this issue. However in my stage environment I have 8 brokers
running across 4 nodes with 300 topics and replication factor of 4 with
various partitioning settings for each topic so it is much harder to pint
point the cause for this issue. I keep running into this problem every
single time I try to upgrade.  Maybe this is configuration related? My
stage env has much more complicated broker config files.

Thanks,
Martin


On Thu, Apr 10, 2014 at 8:19 PM, Jun Rao <ju...@gmail.com> wrote:

> One should be able to upgrade from 0.8 to 0.8.1 one broker at a time
> online. There are some corner cases that we are trying to patch in 0.8.1.1,
> which will be released soon.
>
> As for your issue, not sure what happened. Do you see any ZK session
> expirations in the broker log?
>
> Thanks,
>
> Jun
>
>
> On Thu, Apr 10, 2014 at 7:34 PM, Marcin Michalski <mmichalski@tagged.com
> >wrote:
>
> > I see that the state-change logs have warning messages of this kind
> (Broker
> > 7 is the 0.8.1 API and this is a log snippet from that broker) :
> > s associated leader epoch 11 is old. Current leader epoch is 11
> > (state.change.logger)
> > [2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request
> from
> > controller 1001 with correlation id 0 epoch 7 for partition
> > [pets_nec_buygold,0] since its asso
> > ciated leader epoch 12 is old. Current leader epoch is 12
> > (state.change.logger)
> > [2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request
> from
> > controller 1001 with correlation id 0 epoch 7 for partition
> > [cafe_notification,0] since its ass
> > ociated leader epoch 11 is old. Current leader epoch is 11
> > (state.change.logger)
> > [2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state
> > change after marking its partition as follower with correlation id 0 from
> > controller 1001 epoch
> > 6 for partition [set_primary_photo,0] since the new leader 1008 is the
> same
> > as the old leader (state.change.logger)
> > [2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state
> > change after marking its partition as follower with correlation id 0 from
> > controller 1001 epoch
> > 6 for partition [external_url,0] since the new leader 1001 is the same as
> > the old leader (state.change.logger)
> >
> > And these are the snippets of the broker log of a 0.8.0 node that I shut
> > down before I tried to upgrade it (this is when most topics became
> > unusable):
> >
> > [2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request
> from
> > controller 1001 with correlation id 0 epoch 7 for partition
> > [variant_assign,0] since its associated leader epoch 11 is old. Current
> > leader epoch is 11 (state.change.logger)
> > [2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request
> from
> > controller 1001 with correlation id 0 epoch 7 for partition
> > [meetme_new_contact_count,0] since its associated leader epoch 8 is old.
> > Current leader epoch is 8 (state.change.logger)
> > [2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state
> > change after marking its partition as follower with correlation id 0 from
> > controller 1001 epoch 6 for partition [m3_auth,0] since the new leader 7
> is
> > the same as the old leader (state.change.logger)
> > [2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state
> > change after marking its partition as follower with correlation id 0 from
> > controller 1001 epoch 6 for partition [newsfeed_likes,0] since the new
> > leader 1001 is the same as the old leader (state.change.logger)
> >
> > In terms of upgrading from 0.8.0 to 0.8.1 is there a recommended approach
> > that one should follow? Is it possible to migrate from one version to the
> > next one on a live cluster one server a time?
> >
> > Thanks,
> > Martin
> >
> >
> > On Wed, Apr 9, 2014 at 8:38 PM, Jun Rao <ju...@gmail.com> wrote:
> >
> > > Was there any error in the controller and the state-change logs?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Wed, Apr 9, 2014 at 11:18 AM, Marcin Michalski <
> mmichalski@tagged.com
> > > >wrote:
> > >
> > > > Hi, has anyone upgraded their kafka from 0.8.0 to 0.8.1 successfully
> > one
> > > > broker at a time on a live cluster?
> > > >
> > > > I am seeing strange behaviors where many of my kafka topics become
> > > unusable
> > > > (by both consumers and producers). When that happens, I see lots of
> > > errors
> > > > in the server logs that look like this:
> > > >
> > > > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
> > > > correlation id 2455 from client ReplicaFetcherThread-15-1007 on
> > partition
> > > > [risk,0] failed due to Topic risk either doesn't exist or is in the
> > > process
> > > > of being deleted (kafka.server.KafkaApis)
> > > > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
> > > > correlation id 2455 from client ReplicaFetcherThread-7-1007 on
> > partition
> > > > [message,0] failed due to Topic message either doesn't exist or is in
> > the
> > > > process of being deleted (kafka.server.KafkaApis)
> > > >
> > > > When I try to consume a message from a topic that complained about
> the
> > > > Topic not existing (above warning), I get the below exception:
> > > >
> > > > ....topic message --from-beginning
> > > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> > > > SLF4J: Defaulting to no-operation (NOP) logger implementation
> > > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
> > > further
> > > > details.
> > > > [2014-04-09 10:40:30,571] WARN
> > > >
> > > >
> > >
> >
> [console-consumer-90716_dkafkadatahub07.tag-dev.com-1397065229615-7211ba72-leader-finder-thread],
> > > > Failed to add leader for partitions [message,0]; will retry
> > > > (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread)
> > > > kafka.common.UnknownTopicOrPartitionException
> > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> > > > at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > > > at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > > > at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > at java.lang.Class.newInstance0(Class.java:355)
> > > > at java.lang.Class.newInstance(Class.java:308)
> > > > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:79)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:167)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:179)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:174)
> > > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:174)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:86)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:76)
> > > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:76)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:95)
> > > > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> > > > ----------
> > > >
> > > > *More details about my issues:*
> > > > My current configuration in the environment where I am testing the
> > > upgrade
> > > > is 4 physical servers running 2 brokers each with controlled shutdown
> > > > feature enabled. When I shutdown the 2 brokers on one of the existing
> > > Kafka
> > > > 0.8.0 machines and upgrade that machine to 0.8.1 and restart it, all
> is
> > > > fine for a bit. Once, the new brokers come up, I ran the
> > > > kafka-preferred-replica-election.sh to make sure that started brokers
> > > > become leaders of existing topics.  The replication factor on the
> > topics
> > > is
> > > > set to 4. I tested both producing and consuming messages against
> > brokers
> > > > that were leaders with kafka 0.8.0 and 0.8.1 and no issues were
> > > > encountered.
> > > >
> > > > Later, I tried to perform the control shutdown of the 2 additional
> > > brokers
> > > > on the Kafka server that has 0.8.0 version installed and after the
> > broker
> > > > shutdown and new leaders were assigned, all of my server logs are
> > getting
> > > > filled up with the above exceptions and most of my topics are not
> > > usable. I
> > > > have pulled and build the 0.8.1 kafka code from git last thursday so
> I
> > > > should be pretty much up to date. So not sure if I am doing something
> > > wrong
> > > > or if migrating from 0.8.0 to 0.8.1 on a live cluster one server at a
> > > time
> > > > is not supported. Is there a recommended migration approach that one
> > > should
> > > > take when migrating from live 0.8.0 to 0.8.1 cluster?
> > > >
> > > > As to who is the leader of one of the topics that became unusable is
> > the
> > > > broker that was successfully upgraded to 0.8.1:
> > > > Topic:message   PartitionCount:1        ReplicationFactor:4
> > Configs:
> > > >         Topic: message  Partition: 0   * Leader: 1007 *   Replicas:
> > > > 1007,8,9,1001 Isr: 1001,1007,8
> > > >
> > > > Brokers 9 and 1009 where shutdown from one physical server that had
> > kafka
> > > > 0.8.0 installed when these problems started occurring (I was planning
> > to
> > > > upgrade them to 0.8.1). The only way I can recover from this state is
> > to
> > > > shutdown all brokers and delete all of kafka topic logs plus
> zookeeper
> > > > kafka directory and start with new cluster.
> > > >
> > > >
> > > > Your help in this matter is greatly appreciated.
> > > >
> > > > Thanks,
> > > > Martin
> > > >
> > >
> >
>

Re: Upgrading from 0.8.0 to 0.8.1 one broker at a time issues

Posted by Jun Rao <ju...@gmail.com>.

One should be able to upgrade from 0.8 to 0.8.1 one broker at a time
online. There are some corner cases that we are trying to patch in 0.8.1.1,
which will be released soon.

As for your issue, not sure what happened. Do you see any ZK session
expirations in the broker log?

Thanks,

Jun


On Thu, Apr 10, 2014 at 7:34 PM, Marcin Michalski <mm...@tagged.com>wrote:

> I see that the state-change logs have warning messages of this kind (Broker
> 7 is the 0.8.1 API and this is a log snippet from that broker) :
> s associated leader epoch 11 is old. Current leader epoch is 11
> (state.change.logger)
> [2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request from
> controller 1001 with correlation id 0 epoch 7 for partition
> [pets_nec_buygold,0] since its asso
> ciated leader epoch 12 is old. Current leader epoch is 12
> (state.change.logger)
> [2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request from
> controller 1001 with correlation id 0 epoch 7 for partition
> [cafe_notification,0] since its ass
> ociated leader epoch 11 is old. Current leader epoch is 11
> (state.change.logger)
> [2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state
> change after marking its partition as follower with correlation id 0 from
> controller 1001 epoch
> 6 for partition [set_primary_photo,0] since the new leader 1008 is the same
> as the old leader (state.change.logger)
> [2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state
> change after marking its partition as follower with correlation id 0 from
> controller 1001 epoch
> 6 for partition [external_url,0] since the new leader 1001 is the same as
> the old leader (state.change.logger)
>
> And these are the snippets of the broker log of a 0.8.0 node that I shut
> down before I tried to upgrade it (this is when most topics became
> unusable):
>
> [2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request from
> controller 1001 with correlation id 0 epoch 7 for partition
> [variant_assign,0] since its associated leader epoch 11 is old. Current
> leader epoch is 11 (state.change.logger)
> [2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request from
> controller 1001 with correlation id 0 epoch 7 for partition
> [meetme_new_contact_count,0] since its associated leader epoch 8 is old.
> Current leader epoch is 8 (state.change.logger)
> [2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state
> change after marking its partition as follower with correlation id 0 from
> controller 1001 epoch 6 for partition [m3_auth,0] since the new leader 7 is
> the same as the old leader (state.change.logger)
> [2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state
> change after marking its partition as follower with correlation id 0 from
> controller 1001 epoch 6 for partition [newsfeed_likes,0] since the new
> leader 1001 is the same as the old leader (state.change.logger)
>
> In terms of upgrading from 0.8.0 to 0.8.1 is there a recommended approach
> that one should follow? Is it possible to migrate from one version to the
> next one on a live cluster one server a time?
>
> Thanks,
> Martin
>
>
> On Wed, Apr 9, 2014 at 8:38 PM, Jun Rao <ju...@gmail.com> wrote:
>
> > Was there any error in the controller and the state-change logs?
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Wed, Apr 9, 2014 at 11:18 AM, Marcin Michalski <mmichalski@tagged.com
> > >wrote:
> >
> > > Hi, has anyone upgraded their kafka from 0.8.0 to 0.8.1 successfully
> one
> > > broker at a time on a live cluster?
> > >
> > > I am seeing strange behaviors where many of my kafka topics become
> > unusable
> > > (by both consumers and producers). When that happens, I see lots of
> > errors
> > > in the server logs that look like this:
> > >
> > > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
> > > correlation id 2455 from client ReplicaFetcherThread-15-1007 on
> partition
> > > [risk,0] failed due to Topic risk either doesn't exist or is in the
> > process
> > > of being deleted (kafka.server.KafkaApis)
> > > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
> > > correlation id 2455 from client ReplicaFetcherThread-7-1007 on
> partition
> > > [message,0] failed due to Topic message either doesn't exist or is in
> the
> > > process of being deleted (kafka.server.KafkaApis)
> > >
> > > When I try to consume a message from a topic that complained about the
> > > Topic not existing (above warning), I get the below exception:
> > >
> > > ....topic message --from-beginning
> > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> > > SLF4J: Defaulting to no-operation (NOP) logger implementation
> > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
> > further
> > > details.
> > > [2014-04-09 10:40:30,571] WARN
> > >
> > >
> >
> [console-consumer-90716_dkafkadatahub07.tag-dev.com-1397065229615-7211ba72-leader-finder-thread],
> > > Failed to add leader for partitions [message,0]; will retry
> > > (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread)
> > > kafka.common.UnknownTopicOrPartitionException
> > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> > > at
> > >
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > > at
> > >
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > > at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > at java.lang.Class.newInstance0(Class.java:355)
> > > at java.lang.Class.newInstance(Class.java:308)
> > > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:79)
> > > at
> > >
> > >
> >
> kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:167)
> > > at
> > >
> > >
> >
> kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60)
> > > at
> > >
> > >
> >
> kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:179)
> > > at
> > >
> > >
> >
> kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:174)
> > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
> > > at
> > >
> > >
> >
> kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:174)
> > > at
> > >
> > >
> >
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:86)
> > > at
> > >
> > >
> >
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:76)
> > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
> > > at
> > >
> > >
> >
> kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:76)
> > > at
> > >
> > >
> >
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:95)
> > > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> > > ----------
> > >
> > > *More details about my issues:*
> > > My current configuration in the environment where I am testing the
> > upgrade
> > > is 4 physical servers running 2 brokers each with controlled shutdown
> > > feature enabled. When I shutdown the 2 brokers on one of the existing
> > Kafka
> > > 0.8.0 machines and upgrade that machine to 0.8.1 and restart it, all is
> > > fine for a bit. Once, the new brokers come up, I ran the
> > > kafka-preferred-replica-election.sh to make sure that started brokers
> > > become leaders of existing topics.  The replication factor on the
> topics
> > is
> > > set to 4. I tested both producing and consuming messages against
> brokers
> > > that were leaders with kafka 0.8.0 and 0.8.1 and no issues were
> > > encountered.
> > >
> > > Later, I tried to perform the control shutdown of the 2 additional
> > brokers
> > > on the Kafka server that has 0.8.0 version installed and after the
> broker
> > > shutdown and new leaders were assigned, all of my server logs are
> getting
> > > filled up with the above exceptions and most of my topics are not
> > usable. I
> > > have pulled and build the 0.8.1 kafka code from git last thursday so I
> > > should be pretty much up to date. So not sure if I am doing something
> > wrong
> > > or if migrating from 0.8.0 to 0.8.1 on a live cluster one server at a
> > time
> > > is not supported. Is there a recommended migration approach that one
> > should
> > > take when migrating from live 0.8.0 to 0.8.1 cluster?
> > >
> > > As to who is the leader of one of the topics that became unusable is
> the
> > > broker that was successfully upgraded to 0.8.1:
> > > Topic:message   PartitionCount:1        ReplicationFactor:4
> Configs:
> > >         Topic: message  Partition: 0   * Leader: 1007 *   Replicas:
> > > 1007,8,9,1001 Isr: 1001,1007,8
> > >
> > > Brokers 9 and 1009 where shutdown from one physical server that had
> kafka
> > > 0.8.0 installed when these problems started occurring (I was planning
> to
> > > upgrade them to 0.8.1). The only way I can recover from this state is
> to
> > > shutdown all brokers and delete all of kafka topic logs plus zookeeper
> > > kafka directory and start with new cluster.
> > >
> > >
> > > Your help in this matter is greatly appreciated.
> > >
> > > Thanks,
> > > Martin
> > >
> >
>

Re: Upgrading from 0.8.0 to 0.8.1 one broker at a time issues

Posted by Marcin Michalski <mm...@tagged.com>.

I see that the state-change logs have warning messages of this kind (Broker
7 is the 0.8.1 API and this is a log snippet from that broker) :
s associated leader epoch 11 is old. Current leader epoch is 11
(state.change.logger)
[2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request from
controller 1001 with correlation id 0 epoch 7 for partition
[pets_nec_buygold,0] since its asso
ciated leader epoch 12 is old. Current leader epoch is 12
(state.change.logger)
[2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request from
controller 1001 with correlation id 0 epoch 7 for partition
[cafe_notification,0] since its ass
ociated leader epoch 11 is old. Current leader epoch is 11
(state.change.logger)
[2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state
change after marking its partition as follower with correlation id 0 from
controller 1001 epoch
6 for partition [set_primary_photo,0] since the new leader 1008 is the same
as the old leader (state.change.logger)
[2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state
change after marking its partition as follower with correlation id 0 from
controller 1001 epoch
6 for partition [external_url,0] since the new leader 1001 is the same as
the old leader (state.change.logger)

And these are the snippets of the broker log of a 0.8.0 node that I shut
down before I tried to upgrade it (this is when most topics became
unusable):

[2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request from
controller 1001 with correlation id 0 epoch 7 for partition
[variant_assign,0] since its associated leader epoch 11 is old. Current
leader epoch is 11 (state.change.logger)
[2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request from
controller 1001 with correlation id 0 epoch 7 for partition
[meetme_new_contact_count,0] since its associated leader epoch 8 is old.
Current leader epoch is 8 (state.change.logger)
[2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state
change after marking its partition as follower with correlation id 0 from
controller 1001 epoch 6 for partition [m3_auth,0] since the new leader 7 is
the same as the old leader (state.change.logger)
[2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state
change after marking its partition as follower with correlation id 0 from
controller 1001 epoch 6 for partition [newsfeed_likes,0] since the new
leader 1001 is the same as the old leader (state.change.logger)

In terms of upgrading from 0.8.0 to 0.8.1 is there a recommended approach
that one should follow? Is it possible to migrate from one version to the
next one on a live cluster one server a time?

Thanks,
Martin

On Wed, Apr 9, 2014 at 8:38 PM, Jun Rao <ju...@gmail.com> wrote:

> Was there any error in the controller and the state-change logs?
>
> Thanks,
>
> Jun
>
>
> On Wed, Apr 9, 2014 at 11:18 AM, Marcin Michalski <mmichalski@tagged.com
> >wrote:
>
> > Hi, has anyone upgraded their kafka from 0.8.0 to 0.8.1 successfully one
> > broker at a time on a live cluster?
> >
> > I am seeing strange behaviors where many of my kafka topics become
> unusable
> > (by both consumers and producers). When that happens, I see lots of
> errors
> > in the server logs that look like this:
> >
> > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
> > correlation id 2455 from client ReplicaFetcherThread-15-1007 on partition
> > [risk,0] failed due to Topic risk either doesn't exist or is in the
> process
> > of being deleted (kafka.server.KafkaApis)
> > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
> > correlation id 2455 from client ReplicaFetcherThread-7-1007 on partition
> > [message,0] failed due to Topic message either doesn't exist or is in the
> > process of being deleted (kafka.server.KafkaApis)
> >
> > When I try to consume a message from a topic that complained about the
> > Topic not existing (above warning), I get the below exception:
> >
> > ....topic message --from-beginning
> > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> > SLF4J: Defaulting to no-operation (NOP) logger implementation
> > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
> further
> > details.
> > [2014-04-09 10:40:30,571] WARN
> >
> >
> [console-consumer-90716_dkafkadatahub07.tag-dev.com-1397065229615-7211ba72-leader-finder-thread],
> > Failed to add leader for partitions [message,0]; will retry
> > (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread)
> > kafka.common.UnknownTopicOrPartitionException
> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> > at
> >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > at
> >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > at java.lang.Class.newInstance0(Class.java:355)
> > at java.lang.Class.newInstance(Class.java:308)
> > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:79)
> > at
> >
> >
> kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:167)
> > at
> >
> >
> kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60)
> > at
> >
> >
> kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:179)
> > at
> >
> >
> kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:174)
> > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
> > at
> >
> >
> kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:174)
> > at
> >
> >
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:86)
> > at
> >
> >
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:76)
> > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
> > at
> >
> >
> kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:76)
> > at
> >
> >
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:95)
> > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> > ----------
> >
> > *More details about my issues:*
> > My current configuration in the environment where I am testing the
> upgrade
> > is 4 physical servers running 2 brokers each with controlled shutdown
> > feature enabled. When I shutdown the 2 brokers on one of the existing
> Kafka
> > 0.8.0 machines and upgrade that machine to 0.8.1 and restart it, all is
> > fine for a bit. Once, the new brokers come up, I ran the
> > kafka-preferred-replica-election.sh to make sure that started brokers
> > become leaders of existing topics.  The replication factor on the topics
> is
> > set to 4. I tested both producing and consuming messages against brokers
> > that were leaders with kafka 0.8.0 and 0.8.1 and no issues were
> > encountered.
> >
> > Later, I tried to perform the control shutdown of the 2 additional
> brokers
> > on the Kafka server that has 0.8.0 version installed and after the broker
> > shutdown and new leaders were assigned, all of my server logs are getting
> > filled up with the above exceptions and most of my topics are not
> usable. I
> > have pulled and build the 0.8.1 kafka code from git last thursday so I
> > should be pretty much up to date. So not sure if I am doing something
> wrong
> > or if migrating from 0.8.0 to 0.8.1 on a live cluster one server at a
> time
> > is not supported. Is there a recommended migration approach that one
> should
> > take when migrating from live 0.8.0 to 0.8.1 cluster?
> >
> > As to who is the leader of one of the topics that became unusable is the
> > broker that was successfully upgraded to 0.8.1:
> > Topic:message   PartitionCount:1        ReplicationFactor:4     Configs:
> >         Topic: message  Partition: 0   * Leader: 1007 *   Replicas:
> > 1007,8,9,1001 Isr: 1001,1007,8
> >
> > Brokers 9 and 1009 where shutdown from one physical server that had kafka
> > 0.8.0 installed when these problems started occurring (I was planning to
> > upgrade them to 0.8.1). The only way I can recover from this state is to
> > shutdown all brokers and delete all of kafka topic logs plus zookeeper
> > kafka directory and start with new cluster.
> >
> >
> > Your help in this matter is greatly appreciated.
> >
> > Thanks,
> > Martin
> >
>

Re: Upgrading from 0.8.0 to 0.8.1 one broker at a time issues

Posted by Jun Rao <ju...@gmail.com>.

Was there any error in the controller and the state-change logs?

Thanks,

Jun


On Wed, Apr 9, 2014 at 11:18 AM, Marcin Michalski <mm...@tagged.com>wrote:

> Hi, has anyone upgraded their kafka from 0.8.0 to 0.8.1 successfully one
> broker at a time on a live cluster?
>
> I am seeing strange behaviors where many of my kafka topics become unusable
> (by both consumers and producers). When that happens, I see lots of errors
> in the server logs that look like this:
>
> [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
> correlation id 2455 from client ReplicaFetcherThread-15-1007 on partition
> [risk,0] failed due to Topic risk either doesn't exist or is in the process
> of being deleted (kafka.server.KafkaApis)
> [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
> correlation id 2455 from client ReplicaFetcherThread-7-1007 on partition
> [message,0] failed due to Topic message either doesn't exist or is in the
> process of being deleted (kafka.server.KafkaApis)
>
> When I try to consume a message from a topic that complained about the
> Topic not existing (above warning), I get the below exception:
>
> ....topic message --from-beginning
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> [2014-04-09 10:40:30,571] WARN
>
> [console-consumer-90716_dkafkadatahub07.tag-dev.com-1397065229615-7211ba72-leader-finder-thread],
> Failed to add leader for partitions [message,0]; will retry
> (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread)
> kafka.common.UnknownTopicOrPartitionException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at java.lang.Class.newInstance0(Class.java:355)
> at java.lang.Class.newInstance(Class.java:308)
> at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:79)
> at
>
> kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:167)
> at
>
> kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60)
> at
>
> kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:179)
> at
>
> kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:174)
> at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
> at
>
> kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:174)
> at
>
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:86)
> at
>
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:76)
> at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
> at
>
> kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:76)
> at
>
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:95)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> ----------
>
> *More details about my issues:*
> My current configuration in the environment where I am testing the upgrade
> is 4 physical servers running 2 brokers each with controlled shutdown
> feature enabled. When I shutdown the 2 brokers on one of the existing Kafka
> 0.8.0 machines and upgrade that machine to 0.8.1 and restart it, all is
> fine for a bit. Once, the new brokers come up, I ran the
> kafka-preferred-replica-election.sh to make sure that started brokers
> become leaders of existing topics.  The replication factor on the topics is
> set to 4. I tested both producing and consuming messages against brokers
> that were leaders with kafka 0.8.0 and 0.8.1 and no issues were
> encountered.
>
> Later, I tried to perform the control shutdown of the 2 additional brokers
> on the Kafka server that has 0.8.0 version installed and after the broker
> shutdown and new leaders were assigned, all of my server logs are getting
> filled up with the above exceptions and most of my topics are not usable. I
> have pulled and build the 0.8.1 kafka code from git last thursday so I
> should be pretty much up to date. So not sure if I am doing something wrong
> or if migrating from 0.8.0 to 0.8.1 on a live cluster one server at a time
> is not supported. Is there a recommended migration approach that one should
> take when migrating from live 0.8.0 to 0.8.1 cluster?
>
> As to who is the leader of one of the topics that became unusable is the
> broker that was successfully upgraded to 0.8.1:
> Topic:message   PartitionCount:1        ReplicationFactor:4     Configs:
>         Topic: message  Partition: 0   * Leader: 1007 *   Replicas:
> 1007,8,9,1001 Isr: 1001,1007,8
>
> Brokers 9 and 1009 where shutdown from one physical server that had kafka
> 0.8.0 installed when these problems started occurring (I was planning to
> upgrade them to 0.8.1). The only way I can recover from this state is to
> shutdown all brokers and delete all of kafka topic logs plus zookeeper
> kafka directory and start with new cluster.
>
>
> Your help in this matter is greatly appreciated.
>
> Thanks,
> Martin
>