You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by David Arthur <da...@confluent.io.INVALID> on 2023/02/01 14:51:58 UTC

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Igor, thanks for the KIP!

1. For the ZK migration part, I wonder if we could avoid monitoring the
directory failure ZNode while in dual-write mode. So far in the migration
design, we have avoided reading anything from ZK after the initial metadata
migration. We have modified the ZK brokers to use the
BrokerLifecycleManager while in migration mode as a means to register with
the KRaft controller. Could we use the proposed new fields in
BrokerRegistrationRequest and BrokerHeartbeatRequest RPCs to inform the
KRaft controller of failed directories?

2. We should probably explicitly state that this KIP will subsume KIP-589

Thanks!
David

On Wed, Jan 25, 2023 at 6:14 PM Jun Rao <ju...@confluent.io.invalid> wrote:

> Hi, Igor,
>
> Thanks for the reply. A few more comments/questions.
>
> 30. In ZK mode, LeaderAndIsr has a field isNew that indicates whether a
> replica is newly created or not. How do we convey the same thing with
> metadata records in KRaft mode?
>
> 31. If one replaces a log dir with a disk, I guess we need to assign the
> same UUID to the log dir? I am not sure if we could do this since we don't
> store the mapping from log dir name to its UUID.
>
> 32. "If the partition is not new and does not exist, and there are any
> offline log directories, the broker uses AssignReplicasToDirs to change the
> metadata assignment to an offline log directory." Hmm, should the broker
> send the HeartBeat request to propagate the failed log dir instead?
> Actually, does it need to send it at all since the same information should
> already be sent when the offline dir was first detected?
>
> 33. "If the partition already exists in another log directory and is a
> future replica in the log directory indicated by the metadata, the broker
> will replace the current replica with the future replica." It seems that
> the broker needs to make sure that the future replica catches up with the
> current replica first?
>
> 34. "As a fallback safety mechanism in case of any synchronisation issues,
> when a log directory fails, the broker will check log directory assignments
> in the metadata and if there are any missing replicas, the broker will use
> AlterReplicaLogDirs assigning the replicas to the offline log directory, to
> convey to the controller that these replicas are offline and need
> leadership and ISR updates." Hmm, the motivation says that we want to avoid
> large requests introduced in KIP-589. By sending AlterReplicaLogDirs on log
> dir fail, it seems that we are issuing the same large request every time a
> log dir fails?
>
> 35. "When replicas are moved between directories, using the existing
> AlterReplicaLogDirs RPC, the receiving broker will start moving the
> replicas using AlterReplicaLogDirs threads as usual."  AlterReplicaLogDirs
> is new request. The existing one is AlterReplicaDirRequest.
>
> 36. "If the broker is configured with multiple log directories it remains
> FENCED until it can verify that all partitions are assigned to the correct
> log directories in the cluster metadata." Hmm, this could be a bit weird.
> A broker probably shouldn't send any request other than HeartBeat to the
> controller until it's unfenced. But it may need to send the
> AssignReplicasToDirs request to complete the log dir assignment. Is this
> required?
>
> 37. "Existing replicas without a log directory are either: Assumed to live
> in a broker that isn’t yet configured with multiple log directories, and so
> live in a single log directory, even if the UUID for that directory isn’t
> yet known by the controller. It is not possible to trigger a log directory
> failure from a broker that has a single log directory, as the broker would
> simply shut down if there are no remaining online log directories." So, if
> there is a single log dir, the broker doesn't send AssignReplicasToDirs
> requests to the controller?
>
> 38. In the section under "If the partition is assigned to an online log
> directory", it seems that we need to add another case. If the partition
> exists and a future replica also exists in another dir, should we start the
> process to replicate the current replica to the future replica. This is the
> existing behavior.
>
> 39. "If there are offline log directories, the broker will fail at startup
> if there is a mismatch between the number of entries in log.dirs  and
> directory.ids." This seems a bit restricting since it means that one can't
> add new log dir when an existing log dir is down?
>
> 40. "In a future release, broker registration without online log directory
> UUIDs will be disallowed." Hmm, currently, the broker already fails after
> log recovery if no log dir is online, right?
>
> Thanks,
>
> Jun
>
>
> On Mon, Jan 23, 2023 at 3:59 AM Tom Bentley <tb...@redhat.com> wrote:
>
> > Hi Igor,
> >
> > Thanks for your replies on points 21-25. Those all LGTM. See below for a
> > further thought on the meta.properties upgrade.
> >
> > Kind regards,
> >
> > Tom
> >
> >
> > On Fri, 13 Jan 2023 at 18:09, Igor Soarez <so...@apple.com.invalid>
> > wrote:
> >
> > > Hi Tom,
> > >
> > > Thank you for having another look.
> > >
> > > 20. Upon a downgrade to a Kafka version that runs the current
> > > "version == 1" assertion, then yes — a downgrade would not be possible
> > > without first updating (manually) the meta.properties files back
> > > to the previous version.
> > >
> > > We could prevent this issue if we consider the new fields optional and
> > > not bump the version, but this takes away from the value of having
> > > a version in the first place. I'm not sure this would be a good
> > trade-off.
> > >
> > > IIUC in KIP-866 this issue was addressed by delaying the writing of
> > > meta.properties under the new version until after the full transition
> to
> > > KRaft mode is finished and relaxing the expectations around versioning.
> > > We could take a similar approach by relaxing the current
> > > expectation that the version must strictly be 1.
> > >
> > > Currently, MetaProperties#parse() checks that the version is 1 but only
> > > requires that the cluster.id and node.id properties are present in the
> > > file.
> > > We can relax the requirement, ensuring only that the version is at
> least
> > 1,
> > > and land that change before this KIP is implemented – at least one
> > release
> > > beforehand. Then there will be some previous version for which a
> > downgrade
> > > is possible, even after the version has been bumped to 2 and the two
> new
> > > properties are introduced. WDYT?
> > >
> >
> > Doing that would delay JBOD support and mean existing users of JBOD must
> > upgrade via that specific version.
> >
> > I think I favour a different way of relaxing the expectation around
> > versioning:
> > 1. In the new broker, update the meta.properties with the new keys but
> > keeping version=1 (AFAICS this won't cause problems if we roll back to
> old
> > broker binaries at any point prior to the completion of KRaft upgrade,
> > because the code ignores unknown keys).
> > 2. When the KRaft upgrade is complete (i.e. once a broker has written the
> > final ZkMigrationRecord), update meta.properties to version=2, where the
> > new keys are required.
> >
> > That would mean that during the upgrade the additional validation of
> > meta.properties is missing. I think that's acceptable because the extra
> > protection provided isn't something supported by ZK-based JBOD today, so
> > there is no loss of functionality.
> >
> > Wdty?
> >
> >
> > > 21. I think in this case we can reuse LOG_DIR_NOT_FOUND (code 57) and
> > > perhaps
> > > reword its description slightly to be a bit more generic.
> > > I've updated the KIP to mention this.
> > >
> > > 22. This check should be enforced by the broker, as the controller
> cannot
> > > know whether the current replica to logdir assignment is correct.
> > > Currently, the controller does not unfence a broker that does not
> "want"
> > > to be unfenced. The broker should not want to be unfenced while there
> > > are still mismatching (wrong or missing) replica to logdir assignments.
> > > I updated the KIP to clarify this.
> > >
> > > So a reason wouldn't be included in the heartbeat response but your
> point
> > > about diagnosing is interesting. So I've updated the KIP to propose a
> > > new broker metric reporting the number of mismatching replica
> assignments
> > > - i.e. replica assignments that are either missing or incorrect in the
> > > cluster metadata, as observed by the broker — until this metric is
> zero,
> > > the broker would not want to be unfenced. Let me know what you think.
> > >
> > > 23. I don't see why not. We should be able to extend the system tests
> > > proposed in KIP-866 to cover this.
> > >
> > > 24. Correct. Updated to remove the typo.
> > >
> > > 25. Good point. Updated to align the names.
> > >
> > >
> > > Thank you,
> > >
> > > --
> > > Igor
> > >
> > >
> > >
> >
>


-- 
-David

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Posted by Igor Soarez <i...@soarez.me>.
Hi all,

I’ve had to step away from work for personal reasons for a couple of months – until mid April 2023. I don’t think I’ll be able to continue to address feedback or update this KIP before then.

--
Igor

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Posted by Igor Soarez <so...@apple.com.INVALID>.
Hi Jun,

Thank you for sharing your questions, please find my answers below.


41. There can only be user partitions on `metadata.log.dir` if that log
dir is also listed in `log.dirs`.
`LogManager` does not specifically load contents from `metadata.log.dir`.

The broker will communicate UUIDs to the controller for all log dirs
configured in `log.dirs`. If the metadata directory happens to be one
of those, it may also contain user partitions, so the controller will
know about it. If it is a completely separate log dir, it cannot hold
user partitions, so there's no need to include it.


42. I'm not sure about what exactly you refer to with "decommission the
disk", so please let me know if I'm missing your point here.

A disk can be removed from `log.dirs` and removed from the system in a
single broker restart:

  1. Shutdown the broker
  2. Unmount, remove the disk
  3. Update `log.dirs` config
  4. Start the broker

Upon restart, the broker will update `directory.ids` in the
`meta.properties` for the remaining configured log dirs.

Log dir identity cannot be inferred from the path, because the same
storage device can be remounted under a different path, so the way we
identify storage directories is by looking at their contents – the
`directory.id` field in its `meta.properties`.
But this also means  that a log dir cannot be identified if it is not
available, and so it also means that the broker can only generate
`directory.ids` if all log directories listed under `log.dirs` happen
to be available.

Consider the following example, where `log.dirs=/a,/b/,/c`, and
the following `meta.properties` (non-relevant values omitted):

    # /a/meta.properties
    directory.id=1
    directory.ids=1,2,3

    # /b/meta.properties
    directory.id=2
    directory.ids=1,2,3

    # /c/meta.properties
    directory.id=3
    directory.ids=1,2,3

If `log.dirs` is updated to remove `/c`, the broker will be able
to determine the new value for `directory.ids=1,2` by loading
`/a/meta.properties` and `/b/meta.properties`.
But if either `/a`, or `/b` happens to be unavailable, e.g. due to some
temporary disk failure we cannot determine `directory.ids`. e.g.
if `/b` is unavailable, the broker can't tell if `directory.ids` should be
`1,2`, `1,3`, or even `1,4`.

In a scenario where an operator wishes to remove a log dir from
configuration and some other log dir is also offline, the operator will have
a few options:

  a) Bring the offline log dir back online before restarting the broker.

  b) Edit `meta.properties` to remove the UUID for the deconfigured logdir
  from `directory.ids` in the remaining available log dirs. This will remove
  the need for the broker to regenerate `directory.ids` as the entry count
  for `directory.ids` and `log.dirs` will be equal.

  c) Also remove the offline log dir from `log.dirs`.


43. If the log dir was already failed at startup, indeed, the broker
will not know that. But in that case, there's no risk of a race or failure.

What I meant here relates rather to log dir failures at runtime.
I've updated this bit in the KIP to clarify.

When executing the log directory failure handler, the broker knows
which directory failed, which partitions resided there, and it can
check if any of those newly failed partitions refer to a different
log dir in the cluster metadata.

The assignment should be correct for all of them, as the broker
will be proactive in notifying the controller of any changes in
log dir assignment. But in case of some race condition, the broker
should nudge the controller to deal with the incorrectly
assigned partitions.


44. Tom Bentley and I have discussed this previously in this thread,
in emails dated Jan 10, 13, 23 and Feb 3.

When upgrading a JBOD enabled ZK cluster, we could piggyback on the ZK to
KRaft upgrade (as per KIP-866) and delay bumping `meta.properties` until the
ZK->KRaft upgrade is finalized. After then, we do not support downgrading.

But I'm not convinced we should do this, since there's another upgrade
scenario – when this change proposed in this KIP is applied to a KRaft
cluster that does not yet support JBOD.
In this scenario there are no multiple steps in which one of them is
considered final, and I'm not sure we'd want to introduce an
additional step – making the upgrade process more complex – just to
address this issue either.

I think the best approach is to keep using `version=1` in `meta.properties`.
The new properties introduced in this KIP will safely be ignored by previous
versions, in either ZK or KRaft mode, and we avoid creating conflicts with
unexpected declared versions. The presence of the extra fields is also
innocuous in the case of a second upgrade following a downgrade.

I've updated the KIP to reflect this. Let me know what you think.


Best,

--
Igor



Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Igor,

Thanks for the reply. A few more replies and comments.

31. Thanks for the explanation. This looks good to me then.

35. Yes, you are right.

36. Yes, this seems fine since the KRaft controller allows broker requests
before it's being unfenced.

37. Yes, it's probably simpler without the optimization for the case with a
single log dir.

41. "If log directory that holds the cluster metadata topic is configured
separately to a different path — using metadata.log.dir — then the
respective UUID for this log directory is excluded from both online and
offline sets, as the broker cannot run if this particular log directory is
unavailable." Hmm, there could be user partitions on metadata.log.dir. So
the controller needs to know the disk UUID for that dir, right? Otherwise,
the controller will keep setting those replicas disk id to zero.

42. "If an entry is removed from log.dirs  the broker can also
automatically update directory.ids as long as no log directories are
offline when the broker comes back up." This makes the removal of a disk
less convenient than before since one can't just remove the disk from
log.dirs and decommission the disk in a single step.

43. "If the broker knows that the partition already exists in a different
log directory that is now offline" The broker may not know that since
currently we remove the offline partitions from LogManager.

44. "When configured for the migration and while still in ZK mode, brokers
will: update meta.properties to generate and include directory.id  and
directory.ids;" I guess there is no downgrade after this point? In KIP-866,
the meta.properties is only changed after the upgrade is finalized.

Thanks,

Jun

On Mon, Feb 6, 2023 at 10:15 AM Igor Soarez <so...@apple.com.invalid>
wrote:

> Hi David,
>
> Thank you for your suggestions and for having a look at this KIP.
>
> 1. Yes, that should be OK. I have updated the section
> "Migrating a cluster in ZK mode running with JBOD" to reflect this.
>
> 2. I've updated the motivation section to state that.
>
> Best,
>
> --
> Igor
>
>
>

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Posted by Igor Soarez <so...@apple.com.INVALID>.
Hi David,

Thank you for your suggestions and for having a look at this KIP.

1. Yes, that should be OK. I have updated the section
"Migrating a cluster in ZK mode running with JBOD" to reflect this.

2. I've updated the motivation section to state that.

Best,

--
Igor