You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Igor Soarez <so...@apple.com.INVALID> on 2022/12/01 18:18:29 UTC

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Hi Jun,

Thank you for reviewing the KIP. Please find my replies to
your comments below.

10. Thanks for pointing out this typo; it has been corrected.

11. I agree that the additional delay in switching to the
future replica is undesirable, however I see a couple of
issues if we forward the request to the controller
as you describe:

a) If the controller persists the change to the log directory
assignment before the future replica has caught up and there
is a failure in the original log directory then if the broker
is a leader for the partition there will be no failover
and the partition will become unavailable. It is not safe to
call AssignReplicasToDirectories before the replica exists
in the designated log directory.

b) An existing behavior we'd like to maintain if possible is
the ability to move partitions between log directories when a
broker is offline, as it can be very useful to manage storage
space. ZK brokers will load and accept logs in the new
location after startup.
Maintaining this behavior requires that the broker be able to
override/correct assignments that are seen in the cluster metadata.
i.e. the broker is the authority on log directory placement in case
of mismatch.
If we want to keep this feature and have the controller send log
directory reassignments, we'll need a way to distinguish between
mismatch due to offline movement and mismatch due to controller
triggered reassignment.

To keep the delay low, instead of sending AlterReplicaLogDirs
within the lock the RPC can be queued elsewhere when the future
replica first catches up. ReplicaAlterLogDirsThread can keep
going and not switch the replicas yet.
Once the broker receives confirmation of the metadata change
it can then briefly block appends to the old replica and make the switch.
In the unlikely event that source log directory fails between the moment
AssignReplicasToDirectories is acknowledged by the controller and
before the broker is able to make the switch, then the broker
can issue AssignReplicasToDirectories of that replica back to the offline
log directory to let the controller know that the replica is actually
offline.
What do you think?

12. Indeed, the metadata.log.dir, if explicitly defined to a separate
directory should not be included in the directory UUID list sent
to the controller in broker registration and heartbeat requests.
I have updated the KIP to make this explicit.

13. Thank you for making this suggestion.
Let's address the different scenarios you enumerated:

a) When enabling JBOD for an existing KRaft cluster

In this scenario, the broker finds a single log directory configured
in `log.dirs`, with an already existing `meta.properties`, which is
simply missing `directory.id <http://directory.id/>` and `directory.ids`.

It is safe to have the broker automatically generate the log dir
UUID and update the `meta.properties` file. This removes any need
to have extra steps in upgrading and enabling of this feature,
so it is quite useful.

b) Adding a log dir to an existing JBOD KRaft cluster

In this scenario, the broker finds a shorter list of `directory.ids`
in `meta.properties` than what is configured in `log.dirs`.

KIP-631 introduced the requirement to run the command kafka-storage.sh
to format storage directories.

Currently, if the broker in KRaft mode cannot find `meta.properties`
in each log directory it will fail at startup. KIP-785 proposes
removing the need to run the format storage command, but it is
still open. If new log directories are being added the storage
command must be run anyway. So I don't think there will be any
benefit in this case.

c) Removing a log dir from an existing JBOD KRaft cluster

In this scenario the broker finds a larger list of `directory.ids`
in `meta.properties` than what is configured in `log.dirs`.

The removal of log directories requires an explicit update to
the `log.dirs` configuration, so it is also safe to have the
broker automatically update `directory.ids` in `meta.properties`
to remove the extra UUIDs. It's also useful to drop the requirement
to run the storage command after removal of log directories from
configuration, as it reduces operational burden.

c) Upgrading a JBOD ZK cluster to KRaft

In ZooKeeper brokers write `meta.properties` with `version=0`,
but in KRaft mode brokers require `version=1`.
According to the current discussion on KIP-866, the broker will
automatically upgrade meta.properties from `version=0` to `version=1`.
Assuming the meta.properties gets translated into `version=1` as part
of KIP-866, then the broker should find `meta.properties` which are
simply missing the two new fields: `directory.id <http://directory.id/>` and `directory.ids`.
So it should be OK for the broker to generate UUIDs for each log
directory and update the `meta.properties` file.

I have updated the KIP to remove the need to run the storage command
in these scenarios - apart from b) where it should already be a
requirement AFAICT - and describe how the broker will automatically
update `directory.id <http://directory.id/>` and `directory.ids` in `meta.properties` if
required.

14. Thank you for pointing this out. Yes, the controller should
simply delegate the log directory choice back to the broker,
by assigning `Uuid.ZERO` to all the replicas that were assigned
to removed log directories. I have updated the KIP under
"Controller" > "Broker registration".

15. Yes. Same as in "Metadata caching", replicas are considered
offline if the replica references a log directory which is not in the
list of online log directories for the broker ID hosting the replica.
This means the controller will not assign leadership those replicas.
This is described under "Controller" > "Handling log directory failures".

16. We'll need to bump metadata.version to gate use of the records
and RPC changes. I've added mention of this under the
"Compatibility, Deprecation, and Migration Plan" section.

Please let me know of any further feedback.

Best,

--
Igor