You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Joe Ammann <jo...@pyx.ch> on 2020/09/18 12:18:52 UTC

Problems upgrading from 2.2 to 2.6

Hi all

We're in the process of upgrading from 2.2 to 2.6, and move from an old
set of nodes to a new set. We have an engineering environment where all
went smooth
- upgrade the software on all nodes
- restart all brokers (we didn't have the requirment to make it rolling)
- make the new nodes join the cluster
- move all topic data over with kafka-reassign-partitions.sh
- stop the now "empty" brokers on the old nodes

Now we tried to do the same in our development environment. Upgrade was
ok and the new nodes joined the cluster. All looked good. But already
the first kafka-reassign-partitions.sh of a very small topic with only a
few thousand messages took forever. It seemed like stuck in trying to
move the partitions from the old nodes to the new nodes. I didn't see
anything peculiar in the logs, so I decided to cancel the reassignment
and restarted it.

From there, everything went south. The cluster seemed to accept new
reassignments, but nothing would move. The directories in the data/ on
the new target nodes were created, but no data would arrive and the
topic remained underreplicated. We began to restart nodes, taking all
down and up again, to no avail.

Things went really bad, and soon all nodes failed to start properly
(would auto shutdown after a few seconds) with the message

[2020-09-18 13:56:29,242] ERROR [KafkaServer id=2] Fatal error during
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.lang.IllegalArgumentException: requirement failed: Split operation
is only permitted for segments with overflow

Just before that , there are warnings in the log like

[2020-09-18 13:56:29,235] WARN [Log partition=XXX_Flat-0,
dir=/app/PCY/dyn/pcykafka/data] Found a corrupted index file
corresponding to log file
/app/kafka/data/XXX_Flat-0/00000000000000004938.log due to Corrupt time
index found, time index file
(/app/kafka/data/XXX_Flat-0/00000000000000004938.timeindex) has non-zero
size but the last timestamp is 0 which is less than the first timestamp
1599477219176}, recovering segment and rebuilding index files...
(kafka.log.Log)
[2020-09-18 13:56:29,238] INFO [Log partition=XXX_Flat-0,
dir=/app/kafka/data] Loading producer state till offset 4938 with
message format version 2 (kafka.log.Log)
[2020-09-18 13:56:29,239] INFO [ProducerStateManager
partition=XXX_Flat-0] Writing producer snapshot at offset 4938
(kafka.log.ProducerStateManager)

So now I'm totally stuck and can't start the brokers anymore (neither
the new ones not the old ones). The problem is not so much a dataloss (I
have backups and can go back, but it's probably not even worth the
effort in DEV).

What I need to understand is, what happened here and how to avoid it in
test and prod environments. I have >6GB of logs, but honestly I don't
know what to look for. Obviously I tried to understand from the logs
from the time when the problems started, but I have not been able to
identify any cause.

Anybody got some hints?

CU, Joe