You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Igor Soarez <so...@apple.com.INVALID> on 2023/04/17 11:31:26 UTC

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Hi Jun,

Thank you for sharing your questions, please find my answers below.


41. There can only be user partitions on `metadata.log.dir` if that log
dir is also listed in `log.dirs`.
`LogManager` does not specifically load contents from `metadata.log.dir`.

The broker will communicate UUIDs to the controller for all log dirs
configured in `log.dirs`. If the metadata directory happens to be one
of those, it may also contain user partitions, so the controller will
know about it. If it is a completely separate log dir, it cannot hold
user partitions, so there's no need to include it.


42. I'm not sure about what exactly you refer to with "decommission the
disk", so please let me know if I'm missing your point here.

A disk can be removed from `log.dirs` and removed from the system in a
single broker restart:

  1. Shutdown the broker
  2. Unmount, remove the disk
  3. Update `log.dirs` config
  4. Start the broker

Upon restart, the broker will update `directory.ids` in the
`meta.properties` for the remaining configured log dirs.

Log dir identity cannot be inferred from the path, because the same
storage device can be remounted under a different path, so the way we
identify storage directories is by looking at their contents – the
`directory.id` field in its `meta.properties`.
But this also means  that a log dir cannot be identified if it is not
available, and so it also means that the broker can only generate
`directory.ids` if all log directories listed under `log.dirs` happen
to be available.

Consider the following example, where `log.dirs=/a,/b/,/c`, and
the following `meta.properties` (non-relevant values omitted):

    # /a/meta.properties
    directory.id=1
    directory.ids=1,2,3

    # /b/meta.properties
    directory.id=2
    directory.ids=1,2,3

    # /c/meta.properties
    directory.id=3
    directory.ids=1,2,3

If `log.dirs` is updated to remove `/c`, the broker will be able
to determine the new value for `directory.ids=1,2` by loading
`/a/meta.properties` and `/b/meta.properties`.
But if either `/a`, or `/b` happens to be unavailable, e.g. due to some
temporary disk failure we cannot determine `directory.ids`. e.g.
if `/b` is unavailable, the broker can't tell if `directory.ids` should be
`1,2`, `1,3`, or even `1,4`.

In a scenario where an operator wishes to remove a log dir from
configuration and some other log dir is also offline, the operator will have
a few options:

  a) Bring the offline log dir back online before restarting the broker.

  b) Edit `meta.properties` to remove the UUID for the deconfigured logdir
  from `directory.ids` in the remaining available log dirs. This will remove
  the need for the broker to regenerate `directory.ids` as the entry count
  for `directory.ids` and `log.dirs` will be equal.

  c) Also remove the offline log dir from `log.dirs`.


43. If the log dir was already failed at startup, indeed, the broker
will not know that. But in that case, there's no risk of a race or failure.

What I meant here relates rather to log dir failures at runtime.
I've updated this bit in the KIP to clarify.

When executing the log directory failure handler, the broker knows
which directory failed, which partitions resided there, and it can
check if any of those newly failed partitions refer to a different
log dir in the cluster metadata.

The assignment should be correct for all of them, as the broker
will be proactive in notifying the controller of any changes in
log dir assignment. But in case of some race condition, the broker
should nudge the controller to deal with the incorrectly
assigned partitions.


44. Tom Bentley and I have discussed this previously in this thread,
in emails dated Jan 10, 13, 23 and Feb 3.

When upgrading a JBOD enabled ZK cluster, we could piggyback on the ZK to
KRaft upgrade (as per KIP-866) and delay bumping `meta.properties` until the
ZK->KRaft upgrade is finalized. After then, we do not support downgrading.

But I'm not convinced we should do this, since there's another upgrade
scenario – when this change proposed in this KIP is applied to a KRaft
cluster that does not yet support JBOD.
In this scenario there are no multiple steps in which one of them is
considered final, and I'm not sure we'd want to introduce an
additional step – making the upgrade process more complex – just to
address this issue either.

I think the best approach is to keep using `version=1` in `meta.properties`.
The new properties introduced in this KIP will safely be ignored by previous
versions, in either ZK or KRaft mode, and we avoid creating conflicts with
unexpected declared versions. The presence of the extra fields is also
innocuous in the case of a second upgrade following a downgrade.

I've updated the KIP to reflect this. Let me know what you think.


Best,

--
Igor