You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by "Vanlerberghe, Luc" <Lu...@bvdinfo.com> on 2017/11/08 15:33:25 UTC

0.11.0.1: ReplicaFetcherThread exceptions after clearing corrupted broker data and restarting

Hi,

We have a kafka setup with 6 brokers and topics having replication factor 3 (single partition).

After an improper shutdown, we had corrupted index files on two of our production servers, causing "WARN Found a corrupted index file due to requirement failed: Corrupt index found," messages and kafka shutting down on startup with a "FATAL Exiting Kafka.(kafka.server.KafkaServerStartable)" message.

All topics are still accessible, but unfortunately the most important one has only a single ISR left.

We decided to clear all kafka data and restart the brokers believing they would fetch all needed data back from the leader to become in-sync again, but on startup we see the following messages in the log (repeating at an alarming rate)
WARN [ReplicaFetcherThread-0-3]: Replica 4 for partition <topic>-0 reset its fetch offset from 0 to current leader 3's start offset 0 (kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-3]: Current offset 0 for partition [<topic>,0] out of range; reset offset to 0 (kafka.server.ReplicaFetcherThread)

This looks to me as a similar problem as https://issues.apache.org/jira/browse/KAFKA-6003

While trying to reassign a topic that had lost one of its ISRs (I kept the existing ISRs, but deleted the failing broker and added an existing one) we got the same messages on that existing broker.

[2017-11-08 16:21:30,893] WARN [ReplicaFetcherThread-0-1]: Replica 5 for partition <topic>-0 reset its fetch offset from 0 to current leader 1's start offset 0 (kafka.server.ReplicaFetcherThread)
[2017-11-08 16:21:30,893] ERROR [ReplicaFetcherThread-0-1]: Current offset 0 for partition [<topic>,0] out of range; reset offset to 0 (kafka.server.ReplicaFetcherThread)

This is even more annoying since I don't want to shut down that broker as well and it generates about 800M logs per hour (fortunately only about 100M compressed)

Does anybody have a clue what's going on and how to fix it?
If the fix in 0.11.0.2 would solve our issue, how soon can we expect the release (if at all)

Thanks,

Luc

RE: [Possibly spoofed] Re: 0.11.0.1: ReplicaFetcherThread exceptions after clearing corrupted broker data and restarting

Posted by "Vanlerberghe, Luc" <Lu...@bvdinfo.com>.

Thanks Ismael,

I hope this will solve the issue in the future.
For now, unless we find a workaround soon, we'll probably backup the data we have and rebuild the cluster from scratch...

Luc

-----Original Message-----
From: ismaelj@gmail.com [mailto:ismaelj@gmail.com] On Behalf Of Ismael Juma
Sent: woensdag 8 november 2017 16:57
To: Kafka Users <us...@kafka.apache.org>
Subject: [Possibly spoofed] Re: 0.11.0.1: ReplicaFetcherThread exceptions after clearing corrupted broker data and restarting

Hi Luc,

The first RC for 0.11.0.2 will be released this week.

Ismael

On Wed, Nov 8, 2017 at 3:33 PM, Vanlerberghe, Luc < Luc.Vanlerberghe@bvdinfo.com> wrote:

> Hi,
>
> We have a kafka setup with 6 brokers and topics having replication 
> factor
> 3 (single partition).
>
> After an improper shutdown, we had corrupted index files on two of our 
> production servers, causing "WARN Found a corrupted index file due to 
> requirement failed: Corrupt index found," messages and kafka shutting 
> down on startup with a "FATAL Exiting Kafka.(kafka.server.KafkaServerStartable)"
> message.
>
> All topics are still accessible, but unfortunately the most important 
> one has only a single ISR left.
>
> We decided to clear all kafka data and restart the brokers believing 
> they would fetch all needed data back from the leader to become 
> in-sync again, but on startup we see the following messages in the log 
> (repeating at an alarming rate) WARN [ReplicaFetcherThread-0-3]: 
> Replica 4 for partition <topic>-0 reset its fetch offset from 0 to 
> current leader 3's start offset 0 (kafka.server.
> ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-3]: Current offset 0 for partition 
> [<topic>,0] out of range; reset offset to 0 (kafka.server.
> ReplicaFetcherThread)
>
> This looks to me as a similar problem as https://issues.apache.org/
> jira/browse/KAFKA-6003
>
> While trying to reassign a topic that had lost one of its ISRs (I kept 
> the existing ISRs, but deleted the failing broker and added an 
> existing one) we got the same messages on that existing broker.
>
> [2017-11-08 16:21:30,893] WARN [ReplicaFetcherThread-0-1]: Replica 5 
> for partition <topic>-0 reset its fetch offset from 0 to current 
> leader 1's start offset 0 (kafka.server.ReplicaFetcherThread)
> [2017-11-08 16:21:30,893] ERROR [ReplicaFetcherThread-0-1]: Current 
> offset
> 0 for partition [<topic>,0] out of range; reset offset to 0 (kafka.server.
> ReplicaFetcherThread)
>
> This is even more annoying since I don't want to shut down that broker 
> as well and it generates about 800M logs per hour (fortunately only 
> about 100M
> compressed)
>
> Does anybody have a clue what's going on and how to fix it?
> If the fix in 0.11.0.2 would solve our issue, how soon can we expect 
> the release (if at all)
>
> Thanks,
>
> Luc
>
>

Re: 0.11.0.1: ReplicaFetcherThread exceptions after clearing corrupted broker data and restarting

Posted by Ismael Juma <is...@juma.me.uk>.

Hi Luc,

The first RC for 0.11.0.2 will be released this week.

Ismael

On Wed, Nov 8, 2017 at 3:33 PM, Vanlerberghe, Luc <
Luc.Vanlerberghe@bvdinfo.com> wrote:

> Hi,
>
> We have a kafka setup with 6 brokers and topics having replication factor
> 3 (single partition).
>
> After an improper shutdown, we had corrupted index files on two of our
> production servers, causing "WARN Found a corrupted index file due to
> requirement failed: Corrupt index found," messages and kafka shutting down
> on startup with a "FATAL Exiting Kafka.(kafka.server.KafkaServerStartable)"
> message.
>
> All topics are still accessible, but unfortunately the most important one
> has only a single ISR left.
>
> We decided to clear all kafka data and restart the brokers believing they
> would fetch all needed data back from the leader to become in-sync again,
> but on startup we see the following messages in the log (repeating at an
> alarming rate)
> WARN [ReplicaFetcherThread-0-3]: Replica 4 for partition <topic>-0 reset
> its fetch offset from 0 to current leader 3's start offset 0 (kafka.server.
> ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-3]: Current offset 0 for partition
> [<topic>,0] out of range; reset offset to 0 (kafka.server.
> ReplicaFetcherThread)
>
> This looks to me as a similar problem as https://issues.apache.org/
> jira/browse/KAFKA-6003
>
> While trying to reassign a topic that had lost one of its ISRs (I kept the
> existing ISRs, but deleted the failing broker and added an existing one) we
> got the same messages on that existing broker.
>
> [2017-11-08 16:21:30,893] WARN [ReplicaFetcherThread-0-1]: Replica 5 for
> partition <topic>-0 reset its fetch offset from 0 to current leader 1's
> start offset 0 (kafka.server.ReplicaFetcherThread)
> [2017-11-08 16:21:30,893] ERROR [ReplicaFetcherThread-0-1]: Current offset
> 0 for partition [<topic>,0] out of range; reset offset to 0 (kafka.server.
> ReplicaFetcherThread)
>
> This is even more annoying since I don't want to shut down that broker as
> well and it generates about 800M logs per hour (fortunately only about 100M
> compressed)
>
> Does anybody have a clue what's going on and how to fix it?
> If the fix in 0.11.0.2 would solve our issue, how soon can we expect the
> release (if at all)
>
> Thanks,
>
> Luc
>
>