You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Kyle Banker <ky...@gmail.com> on 2015/02/05 18:39:18 UTC

kafka.server.ReplicaManager error

I have a 9-node Kafka cluster, and all of the brokers just started spouting
the following error:

ERROR [Replica Manager on Broker 1]: Error when processing fetch request
for partition [mytopic,57] offset 0 from follower with correlation id
58166. Possible cause: Request for offset 0 but we only have log segments
in the range 39 to 39. (kafka.server.ReplicaManager)

The "mytopic" topic has a replication factor of 3, and metrics are showing
a large number of under replicated partitions.

My assumption is that a log aged out but that the replicas weren't aware of
it.

In any case, this problem isn't fixing itself, and the volume of log
messages of this type is enormous.

What might have caused this? How does one resolve it?

Re: kafka.server.ReplicaManager error

Posted by svante karlsson <sa...@csi.se>.

In our case unclean leader selection was enabled

As the cluster should have been empty I can't really say that we did not
lose any data but as I wrote earlier, I could not get the log messages to
stop until I took down all brokers at the same time.








2015-02-05 22:16 GMT+01:00 Kyle Banker <ky...@gmail.com>:

> Thanks for sharing, svante. We're also running 0.8.2.
>
> Our cluster appears to be completely unusable at this point. We tried
> restarting the "down" broker with a clean log directory, and it's doing
> nothing. It doesn't seem to be able to get topic data, which this Zookeeper
> message appears to confirm:
>
> [ProcessThread(sid:5 cport:-1)::PrepRequestProcessor@645] - Got user-level
> KeeperException when processing sessionid:0x54b0e251a5cd0ec type:setData
> cxid:0x2b7ab zxid:0x100b9ad88 txntype:-1 reqpath:n/a Error
> Path:/brokers/topics/mytopic/partitions/143/state Error:KeeperErrorCode =
> BadVersion for /brokers/topics/mytopic/partitions/143/state
>
> It's probably worthwhile to note that we've disabled unclean leader
> election.
>
>
>
> On Thu, Feb 5, 2015 at 2:01 PM, svante karlsson <sa...@csi.se> wrote:
>
> > I believe I've had the same problem on the 0.8.2 rc2. We had a idle test
> > cluster with unknown health status and I applied rc3 without checking if
> > everything was ok before. Since that cluster had been doing nothing for a
> > couple of days and the retention time was 48 hours it's reasonable to
> > assume that no actual data was left on the cluster. The same type of logs
> > was emitted in big amounts and never stopped. I then rebooted each
> > zookeeper in series. No change, Then bumped each broker - no change,
> > Finally I took down all brokers at the same time.
> >
> > The logging stopped but then one broker did not have any partitions in
> > sync, including the the internal consumer offset topic that was living
> > (with replicas=1) on that broker. I then bumped this broker once more and
> > then my whole cluster became in sync.
> >
> > I suspect that something related to 0 size topics caused this since the
> the
> > cluster worked fine the week before during testing and also after during
> > more testing with rc3.
> >
> >
> >
> >
> >
> >
> >
> > 2015-02-05 19:22 GMT+01:00 Kyle Banker <ky...@gmail.com>:
> >
> > > Digging in a bit more, it appears that the "down" broker had likely
> > > partially failed. Thus, it was still attempting to fetch offsets that
> no
> > > longer exists. Does this make sense as an explanation of the
> > > above-mentioned behavior?
> > >
> > > On Thu, Feb 5, 2015 at 10:58 AM, Kyle Banker <ky...@gmail.com>
> > wrote:
> > >
> > > > Dug into this a bit more, and it turns out that we lost one of our 9
> > > > brokers at the exact moment when this started happening. At the time
> > that
> > > > we lost the broker, we had no under-replicated partitions. Since the
> > > broker
> > > > disappeared, we've had a fairly constant number of under replicated
> > > > partitions. This makes some sense, of course.
> > > >
> > > > Still, the log message doesn't.
> > > >
> > > > On Thu, Feb 5, 2015 at 10:39 AM, Kyle Banker <ky...@gmail.com>
> > > wrote:
> > > >
> > > >> I have a 9-node Kafka cluster, and all of the brokers just started
> > > >> spouting the following error:
> > > >>
> > > >> ERROR [Replica Manager on Broker 1]: Error when processing fetch
> > request
> > > >> for partition [mytopic,57] offset 0 from follower with correlation
> id
> > > >> 58166. Possible cause: Request for offset 0 but we only have log
> > > segments
> > > >> in the range 39 to 39. (kafka.server.ReplicaManager)
> > > >>
> > > >> The "mytopic" topic has a replication factor of 3, and metrics are
> > > >> showing a large number of under replicated partitions.
> > > >>
> > > >> My assumption is that a log aged out but that the replicas weren't
> > aware
> > > >> of it.
> > > >>
> > > >> In any case, this problem isn't fixing itself, and the volume of log
> > > >> messages of this type is enormous.
> > > >>
> > > >> What might have caused this? How does one resolve it?
> > > >>
> > > >
> > > >
> > >
> >
>

Re: kafka.server.ReplicaManager error

Posted by Kyle Banker <ky...@gmail.com>.

Thanks for sharing, svante. We're also running 0.8.2.

Our cluster appears to be completely unusable at this point. We tried
restarting the "down" broker with a clean log directory, and it's doing
nothing. It doesn't seem to be able to get topic data, which this Zookeeper
message appears to confirm:

[ProcessThread(sid:5 cport:-1)::PrepRequestProcessor@645] - Got user-level
KeeperException when processing sessionid:0x54b0e251a5cd0ec type:setData
cxid:0x2b7ab zxid:0x100b9ad88 txntype:-1 reqpath:n/a Error
Path:/brokers/topics/mytopic/partitions/143/state Error:KeeperErrorCode =
BadVersion for /brokers/topics/mytopic/partitions/143/state

It's probably worthwhile to note that we've disabled unclean leader
election.



On Thu, Feb 5, 2015 at 2:01 PM, svante karlsson <sa...@csi.se> wrote:

> I believe I've had the same problem on the 0.8.2 rc2. We had a idle test
> cluster with unknown health status and I applied rc3 without checking if
> everything was ok before. Since that cluster had been doing nothing for a
> couple of days and the retention time was 48 hours it's reasonable to
> assume that no actual data was left on the cluster. The same type of logs
> was emitted in big amounts and never stopped. I then rebooted each
> zookeeper in series. No change, Then bumped each broker - no change,
> Finally I took down all brokers at the same time.
>
> The logging stopped but then one broker did not have any partitions in
> sync, including the the internal consumer offset topic that was living
> (with replicas=1) on that broker. I then bumped this broker once more and
> then my whole cluster became in sync.
>
> I suspect that something related to 0 size topics caused this since the the
> cluster worked fine the week before during testing and also after during
> more testing with rc3.
>
>
>
>
>
>
>
> 2015-02-05 19:22 GMT+01:00 Kyle Banker <ky...@gmail.com>:
>
> > Digging in a bit more, it appears that the "down" broker had likely
> > partially failed. Thus, it was still attempting to fetch offsets that no
> > longer exists. Does this make sense as an explanation of the
> > above-mentioned behavior?
> >
> > On Thu, Feb 5, 2015 at 10:58 AM, Kyle Banker <ky...@gmail.com>
> wrote:
> >
> > > Dug into this a bit more, and it turns out that we lost one of our 9
> > > brokers at the exact moment when this started happening. At the time
> that
> > > we lost the broker, we had no under-replicated partitions. Since the
> > broker
> > > disappeared, we've had a fairly constant number of under replicated
> > > partitions. This makes some sense, of course.
> > >
> > > Still, the log message doesn't.
> > >
> > > On Thu, Feb 5, 2015 at 10:39 AM, Kyle Banker <ky...@gmail.com>
> > wrote:
> > >
> > >> I have a 9-node Kafka cluster, and all of the brokers just started
> > >> spouting the following error:
> > >>
> > >> ERROR [Replica Manager on Broker 1]: Error when processing fetch
> request
> > >> for partition [mytopic,57] offset 0 from follower with correlation id
> > >> 58166. Possible cause: Request for offset 0 but we only have log
> > segments
> > >> in the range 39 to 39. (kafka.server.ReplicaManager)
> > >>
> > >> The "mytopic" topic has a replication factor of 3, and metrics are
> > >> showing a large number of under replicated partitions.
> > >>
> > >> My assumption is that a log aged out but that the replicas weren't
> aware
> > >> of it.
> > >>
> > >> In any case, this problem isn't fixing itself, and the volume of log
> > >> messages of this type is enormous.
> > >>
> > >> What might have caused this? How does one resolve it?
> > >>
> > >
> > >
> >
>

Re: kafka.server.ReplicaManager error

Posted by svante karlsson <sa...@csi.se>.

I believe I've had the same problem on the 0.8.2 rc2. We had a idle test
cluster with unknown health status and I applied rc3 without checking if
everything was ok before. Since that cluster had been doing nothing for a
couple of days and the retention time was 48 hours it's reasonable to
assume that no actual data was left on the cluster. The same type of logs
was emitted in big amounts and never stopped. I then rebooted each
zookeeper in series. No change, Then bumped each broker - no change,
Finally I took down all brokers at the same time.

The logging stopped but then one broker did not have any partitions in
sync, including the the internal consumer offset topic that was living
(with replicas=1) on that broker. I then bumped this broker once more and
then my whole cluster became in sync.

I suspect that something related to 0 size topics caused this since the the
cluster worked fine the week before during testing and also after during
more testing with rc3.

2015-02-05 19:22 GMT+01:00 Kyle Banker <ky...@gmail.com>:

> Digging in a bit more, it appears that the "down" broker had likely
> partially failed. Thus, it was still attempting to fetch offsets that no
> longer exists. Does this make sense as an explanation of the
> above-mentioned behavior?
>
> On Thu, Feb 5, 2015 at 10:58 AM, Kyle Banker <ky...@gmail.com> wrote:
>
> > Dug into this a bit more, and it turns out that we lost one of our 9
> > brokers at the exact moment when this started happening. At the time that
> > we lost the broker, we had no under-replicated partitions. Since the
> broker
> > disappeared, we've had a fairly constant number of under replicated
> > partitions. This makes some sense, of course.
> >
> > Still, the log message doesn't.
> >
> > On Thu, Feb 5, 2015 at 10:39 AM, Kyle Banker <ky...@gmail.com>
> wrote:
> >
> >> I have a 9-node Kafka cluster, and all of the brokers just started
> >> spouting the following error:
> >>
> >> ERROR [Replica Manager on Broker 1]: Error when processing fetch request
> >> for partition [mytopic,57] offset 0 from follower with correlation id
> >> 58166. Possible cause: Request for offset 0 but we only have log
> segments
> >> in the range 39 to 39. (kafka.server.ReplicaManager)
> >>
> >> The "mytopic" topic has a replication factor of 3, and metrics are
> >> showing a large number of under replicated partitions.
> >>
> >> My assumption is that a log aged out but that the replicas weren't aware
> >> of it.
> >>
> >> In any case, this problem isn't fixing itself, and the volume of log
> >> messages of this type is enormous.
> >>
> >> What might have caused this? How does one resolve it?
> >>
> >
> >
>

Re: kafka.server.ReplicaManager error

Posted by Kyle Banker <ky...@gmail.com>.

Digging in a bit more, it appears that the "down" broker had likely
partially failed. Thus, it was still attempting to fetch offsets that no
longer exists. Does this make sense as an explanation of the
above-mentioned behavior?

On Thu, Feb 5, 2015 at 10:58 AM, Kyle Banker <ky...@gmail.com> wrote:

> Dug into this a bit more, and it turns out that we lost one of our 9
> brokers at the exact moment when this started happening. At the time that
> we lost the broker, we had no under-replicated partitions. Since the broker
> disappeared, we've had a fairly constant number of under replicated
> partitions. This makes some sense, of course.
>
> Still, the log message doesn't.
>
> On Thu, Feb 5, 2015 at 10:39 AM, Kyle Banker <ky...@gmail.com> wrote:
>
>> I have a 9-node Kafka cluster, and all of the brokers just started
>> spouting the following error:
>>
>> ERROR [Replica Manager on Broker 1]: Error when processing fetch request
>> for partition [mytopic,57] offset 0 from follower with correlation id
>> 58166. Possible cause: Request for offset 0 but we only have log segments
>> in the range 39 to 39. (kafka.server.ReplicaManager)
>>
>> The "mytopic" topic has a replication factor of 3, and metrics are
>> showing a large number of under replicated partitions.
>>
>> My assumption is that a log aged out but that the replicas weren't aware
>> of it.
>>
>> In any case, this problem isn't fixing itself, and the volume of log
>> messages of this type is enormous.
>>
>> What might have caused this? How does one resolve it?
>>
>
>

Re: kafka.server.ReplicaManager error

Posted by Kyle Banker <ky...@gmail.com>.

Dug into this a bit more, and it turns out that we lost one of our 9
brokers at the exact moment when this started happening. At the time that
we lost the broker, we had no under-replicated partitions. Since the broker
disappeared, we've had a fairly constant number of under replicated
partitions. This makes some sense, of course.

Still, the log message doesn't.

On Thu, Feb 5, 2015 at 10:39 AM, Kyle Banker <ky...@gmail.com> wrote:

> I have a 9-node Kafka cluster, and all of the brokers just started
> spouting the following error:
>
> ERROR [Replica Manager on Broker 1]: Error when processing fetch request
> for partition [mytopic,57] offset 0 from follower with correlation id
> 58166. Possible cause: Request for offset 0 but we only have log segments
> in the range 39 to 39. (kafka.server.ReplicaManager)
>
> The "mytopic" topic has a replication factor of 3, and metrics are showing
> a large number of under replicated partitions.
>
> My assumption is that a log aged out but that the replicas weren't aware
> of it.
>
> In any case, this problem isn't fixing itself, and the volume of log
> messages of this type is enormous.
>
> What might have caused this? How does one resolve it?
>