You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Anthony Sparks <an...@gmail.com> on 2016/02/24 16:39:39 UTC

Unable to start cluster after crash (0.8.2.2)

Hello,

Our Kafka cluster (3 servers, each server has Zookeeper and Kafka installed
and running) crashed, and actually out of the 6 processes only one
Zookeeper instance remained alive.  The logs do not indicate much, the only
errors shown were:

2016-02-21T12:21:36.881+0000: 27445381.013: [GC (Allocation Failure)
27445381.013: [ParNew: 136472K->159K(153344K), 0.0047077 secs]
139578K->3265K(507264K), 0.0048552 secs] [Times: user=0.01 sys=0.00,
real=0.01 secs]

These errors were both in the Zookeeper and the Kafka logs, and it appears
they have been happening everyday (with no impact on Kafka, except for
maybe now?).

The crash is concerning, but not as concerning as what we are encountering
right now.  I am unable to get the cluster back up.  Two of the three nodes
halt with this fatal error:

[2016-02-23 21:18:47,251] FATAL [ReplicaFetcherThread-0-0], Halting because
log truncation is not allowed for topic audit_data, Current leader 0's
latest offset 52844816 is less than replica 1's latest offset 52844835
(kafka.server.ReplicaFetcherThread)

The other node that manages to stay alive is unable to fulfill writes
because we have min.ack set to 2 on the producers (requiring at least two
nodes to be available).  We could change this, but that doesn't fix our
overall problem.

In browsing the Kafka code, in ReplicaFetcherThread.scala there is this
little nugget:

// Prior to truncating the follower's log, ensure that doing so is not
disallowed by the configuration for unclean leader election.
// This situation could only happen if the unclean election configuration
for a topic changes while a replica is down. Otherwise,
// we should never encounter this situation since a non-ISR leader cannot
be elected if disallowed by the broker configuration.
if (!LogConfig.fromProps(brokerConfig.toProps,
AdminUtils.fetchTopicConfig(replicaMgr.zkClient,
topicAndPartition.topic)).uncleanLeaderElectionEnable) {
    // Log a fatal error and shutdown the broker to ensure that data loss
does not unexpectedly occur.
    fatal("Halting because log truncation is not allowed for topic
%s,".format(topicAndPartition.topic) +
      " Current leader %d's latest offset %d is less than replica %d's
latest offset %d"
      .format(sourceBroker.id, leaderEndOffset, brokerConfig.brokerId,
replica.logEndOffset.messageOffset))
    Runtime.getRuntime.halt(1)
}

For each one of our Kafka instances we have them set at:
*unclean.leader.election.enable=false
*which hasn't changed at all since we deployed the cluster (verified by
file modification stamps).  This to me would indicate the above comment
assertion is incorrect; we have encountered a non-ISR leader elected even
though it is configured not to do so.

Any ideas on how to work around this?

Thank you,

Tony Sparks

Re: Unable to start cluster after crash (0.8.2.2)

Posted by Jens Rantil <je...@tink.se>.
Double post. Please keep discussion in the other thread.

Cheers,
Jens

On Wed, Feb 24, 2016 at 4:39 PM, Anthony Sparks <an...@gmail.com>
wrote:

> Hello,
>
> Our Kafka cluster (3 servers, each server has Zookeeper and Kafka installed
> and running) crashed, and actually out of the 6 processes only one
> Zookeeper instance remained alive.  The logs do not indicate much, the only
> errors shown were:
>
> 2016-02-21T12:21:36.881+0000: 27445381.013: [GC (Allocation Failure)
> 27445381.013: [ParNew: 136472K->159K(153344K), 0.0047077 secs]
> 139578K->3265K(507264K), 0.0048552 secs] [Times: user=0.01 sys=0.00,
> real=0.01 secs]
>
> These errors were both in the Zookeeper and the Kafka logs, and it appears
> they have been happening everyday (with no impact on Kafka, except for
> maybe now?).
>
> The crash is concerning, but not as concerning as what we are encountering
> right now.  I am unable to get the cluster back up.  Two of the three nodes
> halt with this fatal error:
>
> [2016-02-23 21:18:47,251] FATAL [ReplicaFetcherThread-0-0], Halting because
> log truncation is not allowed for topic audit_data, Current leader 0's
> latest offset 52844816 is less than replica 1's latest offset 52844835
> (kafka.server.ReplicaFetcherThread)
>
> The other node that manages to stay alive is unable to fulfill writes
> because we have min.ack set to 2 on the producers (requiring at least two
> nodes to be available).  We could change this, but that doesn't fix our
> overall problem.
>
> In browsing the Kafka code, in ReplicaFetcherThread.scala there is this
> little nugget:
>
> // Prior to truncating the follower's log, ensure that doing so is not
> disallowed by the configuration for unclean leader election.
> // This situation could only happen if the unclean election configuration
> for a topic changes while a replica is down. Otherwise,
> // we should never encounter this situation since a non-ISR leader cannot
> be elected if disallowed by the broker configuration.
> if (!LogConfig.fromProps(brokerConfig.toProps,
> AdminUtils.fetchTopicConfig(replicaMgr.zkClient,
> topicAndPartition.topic)).uncleanLeaderElectionEnable) {
>     // Log a fatal error and shutdown the broker to ensure that data loss
> does not unexpectedly occur.
>     fatal("Halting because log truncation is not allowed for topic
> %s,".format(topicAndPartition.topic) +
>       " Current leader %d's latest offset %d is less than replica %d's
> latest offset %d"
>       .format(sourceBroker.id, leaderEndOffset, brokerConfig.brokerId,
> replica.logEndOffset.messageOffset))
>     Runtime.getRuntime.halt(1)
> }
>
> For each one of our Kafka instances we have them set at:
> *unclean.leader.election.enable=false
> *which hasn't changed at all since we deployed the cluster (verified by
> file modification stamps).  This to me would indicate the above comment
> assertion is incorrect; we have encountered a non-ISR leader elected even
> though it is configured not to do so.
>
> Any ideas on how to work around this?
>
> Thank you,
>
> Tony Sparks
>



-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.rantil@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook <https://www.facebook.com/#!/tink.se> Linkedin
<http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
 Twitter <https://twitter.com/tink>