You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by deng ziming <de...@gmail.com> on 2022/09/01 09:04:59 UTC

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Hi Igor,

I think this KIP can solve the current problems, I have some problems relating to the migration section.

Since we have bumped broker RPC version and metadata record version, there will be some problems between brokers/controllers of different versions. In ZK mode we use IBP as a flag to help solve this, in KRaft mode we use a feature flag(metadata.version) as a flag for using new RPC/metadata or not. 

Assuming that we are upgrading from 3.3 to 3.4, firstly the finalized version of metadata.version is 3.3, brokers will use version 1 of `BrokerRegistrationRequest` which contains no `OfflineLogDirectories`, finally the finalized version of metadata.version is 3.4, but brokers will no longer send `BrokerRegistrationRequest` unless we restart the broker, so controllers can’t be aware of the `OfflineLogDirectories` of each broker, so we should reconsider the suggestion of Jason to use `BrokerHeartbeatRequest` to communicate `OfflineLogDirectories`.

Of course this problem can be solved through a double roll(restart broker twice when upgrading), but we should trying to avoid it since we now have feature flag.

One solution is that we include `OfflineLogDirectories` both in `BrokerRegistrationRequest` and `BrokerHeartbeatRequest` requests, when upgrading from 3.3 to 3.4 the `BrokerRegistrationRequest.OfflineLogDirectories` is empty whereas when upgrading from 3.4 to 3.5 it will not be empty. And maybe we can also remove `LogDirectoriesOfflineRequest` you proposed in this KIP.

--
Best,
Ziming


> On Aug 18, 2022, at 11:24 PM, Igor Soarez <i...@soarez.me> wrote:
> 
> Hi Ziming,
> 
> I'm sorry it took me a while to reply.
> 
> Thank you for having a look at this KIP and providing feedback.
> 
>> 1. We have a version field in meta.properties, currently it’s 1, and we can
>> set it to 2 in this KIP, and we can give an example of server.properties and
>> it’s corresponding meta.properties generated by the storage command tool.
> 
>> 2. When using storage format tool we should specify cluster-id, it will be an
>> arduous work if we should manually specify directory-ids for all log dirs,
>> I think you can make it more clear about this change that the directory-ids
>> are generated automatically.
> 
> Thank you for these suggestions. I've updated the KIP to:
> 
> * include an example
> * clarify that the version field would be changed to 2
> * clarify that the log directory UUIDs are automatically generated
> 
>> 3. When controller place a replica, it will select a broker as well as a log
>> directory, the latter is currently accomplished in the broker side, so this
>> will be a big change?
> 
> I think there can be benefits, as Jason described previously, if we change how
> log directories are assigned as follow-up work.
> 
> From a codebase surface area perspective, it is definitely a big change
> because there are many models, types and interfaces that assume replicas are
> identified solely by a broker id.
> That will have to change to include the directory UUID, many lines of code will
> be affected.
> 
> But in terms of behavior it shouldn't be a big change at all. Brokers currently
> select the log directory with the least logs in it. This isn't a very nice
> policy, as logs can have wildly different sizes, and log directories can have
> different capacities. But it is a policy that the controller can keep.
> 
> If we decide to extend the selection policy and keep it in the broker,
> the broker will continue to be able to override the controller's selection
> of log directory, using the `AssignReplicasToDirectories` RPC.
> 
>> 4. When handling log directory failures we will update Leader and ISR using
>> the existing replica state machine, what is this state machine referring to,
>> do you mean the AlterPartition RPC?
> 
> No, I mean that it will use the same rules and mechanism
> (`PartitionChangeBuilder`) that is used when a broker is fenced, shutdown or
> unregistered.
> 
> I think maybe the term "replica state machine" isn't the very appropriate.
> I've updated the KIP to rephrase this section.
> 
> Thanks,
> 
> --
> Igor


Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Posted by Igor Soarez <so...@apple.com.INVALID>.
Hi Alexandre,

Thank you for having a look at this KIP, and thank you for pointing this out.

I like the idea of expanding the health status of a log directory beyond
just online/offline status.

This KIP currently proposes a single logdir state transition, from
online to offline, conveyed in a list of logdir UUIDs sent in the new
field `LogDirsOfflined` as part of the broker heartbeat request.

It's nice that the request schema itself doesn't allow for a heartbeat
request to convey a state transition for a logdir from offline to online,
as that transition is not (currently) valid, as brokers need to be
restarted for logdirs to be allowed to come back online.

We could make changes now to accommodate for further logdir states in
the future, by instead conveying new state and logdir pairs in the
heartbeat request, for any logdir which had a state change.
But right now, that would look a bit strange since there's only one
state we'd allow to represented in the request – offline.

Since creating new request versions/schemas is relatively easy now, and
since this logdir QoS feature would merit a KIP anyway, I'm a bit more
inclined to keep things simple for now.

Best,

--
Igor


Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Posted by Alexandre Dupriez <al...@gmail.com>.
Hi, Igor,

Thanks for the excellent, thorough and very comprehensive KIP.

Although not directly in scope of the KIP, but related to it, I would
have the following question about a potential future work on disk
degradation.

Today, what characterises as a disk failure in Kafka is an I/O
exception surfaced by the JDK libraries. There are other types of
(more or less) soft failures where a disk (or the system behind its
abstraction) remains available, but experiences degradation, typically
in the form of elevated I/O latency. Currently, Kafka is not made
aware of the “health” of a disk. It may be useful to let Kafka know
about the QoS of its disks so that it can take actions which could
improve availability, e.g. via leader movements.

The KIP builds upon the existing concepts of online and offline states
for log directories, and the propagation of a disk failure via the
broker heartbeat and registration relies on the offline(d) directories
list. I wonder if it could make sense to extend the definition of
state of a log directory beyond online/offline to be able to refer to
disk degradation. In which case, the new fields added to the broker
heartbeat and registration requests may be the place where this
alternative state can also be conveyed. Perhaps the changes to the
RPCs could be designed to accommodate this new type of semantic in the
future.

What do you think?

Thanks,
Alexandre

Le mer. 26 avr. 2023 à 14:05, Igor Soarez <so...@apple.com.invalid> a écrit :
>
> Thank you for another review Ziming, much appreciated!
>
> 1. and 2. You are correct, it would be a big and perhaps strange difference.
> Since our last exchange of emails, the proposal has changed and now it
> does follow your suggestion to bump metadata.version.
> The KIP mentions it under "Compatibility, Deprecation, and Migration Plan".
>
> 3. I tried to describe this under "Controller", under the heading
> "Handling replica assignments", but perhaps it could be improved.
> Let me know what you think.
>
> Best,
>
> --
> Igor
>

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Posted by Igor Soarez <so...@apple.com.INVALID>.
Thank you for another review Ziming, much appreciated!

1. and 2. You are correct, it would be a big and perhaps strange difference.
Since our last exchange of emails, the proposal has changed and now it
does follow your suggestion to bump metadata.version.
The KIP mentions it under "Compatibility, Deprecation, and Migration Plan".

3. I tried to describe this under "Controller", under the heading
"Handling replica assignments", but perhaps it could be improved.
Let me know what you think.

Best,

--
Igor


Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Posted by ziming deng <de...@gmail.com>.
Thank you for the continuous work, I have some small problems related to the implementation details:

1. We have decided to add a new metadata.version, and you have said that  "We should also avoid gating on metadata.version to include the `OnlineLogDirectories` field in the broker registration, the existing versioning in the message schemas should suffice to guarantee compatibility",  I think this is a big difference from what we had done in the past design, it's better to add this point in  "Compatibility, Deprecation, and Migration Plan " section.
2. Related to (1),  an important thing about migration is that we should first upgrade controllers before upgrade brokers since broker will only send `BrokerRegisterationRequest` once and we should use the new format and the controller should recoganize the new format, this is difficult in co-resident mode, so you should point out that we only support distributed mode similar to KIP-866.
3. What will happen on receiving "AssignReplicasToDirsRequest"? this should be described in "Controller" section, for example, generating `PartitionChangeRecord` or return error code.

Apart from these problems, this KIP LGTM, 
--
Best,
Ziming


> On Sep 28, 2022, at 04:59, Igor Soarez <i...@soarez.me> wrote:
> 
> Hi Ziming,
> 
> Thank you for having another look at this KIP, and please accept my apologies as to my delay in replying.
> 
> The migration introduces JBOD support, so before the migration there should only be one log directory configured per broker. This assumption simplifies how the controller learns about the initial log directory placement for existing partitions. The first broker registration referencing log directories must declare a single log directory - which must be online as brokers cannot start without any online log dirs. If there are no previous log dirs registered for this  broker, the controller assigns all existing partitions in the broker to this single directory.
> 
> Since we assume a single log dir per broker as a migration starting point and brokers cannot start without any online log directories, the `OfflineLogDirectories` field simply cannot be used in the first registration after the upgrade. However, the issue you described applies to `OnlineLogDirectories`. If use of this field is gated by metadata.version then we do require a double roll for brokers to make their first registration with a log directory.
> 
> As you describe, we could move the list of online log directories from the broker registration request to the heartbea request, but I believe we shouldn't do this because:
>  a) The intention is to address a transient problem — the migration into the feature, which should happen only once — by accepting a long term commitment as to how the online log dirs are signaled.
>  b) While the log directory status can change at any point from online to offline, the list of existing log directories is static for broker's lifetime, so it makes little sense to send this information over and over again to the controller.
> 
> We should also avoid gating on metadata.version to include the `OnlineLogDirectories` field in the broker registration. If the controllers are upgraded first, the existing versioning in the message schemas should suffice to guarantee compatibility. After the controllers are upgraded, the brokers can then be upgraded and each of them can register their single online log directory from the first instantiation.
> 
> Please, let me know if this makes sense and any other thoughts you might have.
> 
> Best,
> 
> --
> Igor
> 
> On Thu, Sep 1, 2022, at 2:04 AM, deng ziming wrote:
>> Hi Igor,
>> 
>> I think this KIP can solve the current problems, I have some problems 
>> relating to the migration section.
>> 
>> Since we have bumped broker RPC version and metadata record version, 
>> there will be some problems between brokers/controllers of different 
>> versions. In ZK mode we use IBP as a flag to help solve this, in KRaft 
>> mode we use a feature flag(metadata.version) as a flag for using new 
>> RPC/metadata or not. 
>> 
>> Assuming that we are upgrading from 3.3 to 3.4, firstly the finalized 
>> version of metadata.version is 3.3, brokers will use version 1 of 
>> `BrokerRegistrationRequest` which contains no `OfflineLogDirectories`, 
>> finally the finalized version of metadata.version is 3.4, but brokers 
>> will no longer send `BrokerRegistrationRequest` unless we restart the 
>> broker, so controllers can’t be aware of the `OfflineLogDirectories` of 
>> each broker, so we should reconsider the suggestion of Jason to use 
>> `BrokerHeartbeatRequest` to communicate `OfflineLogDirectories`.
>> 
>> Of course this problem can be solved through a double roll(restart 
>> broker twice when upgrading), but we should trying to avoid it since we 
>> now have feature flag.
>> 
>> One solution is that we include `OfflineLogDirectories` both in 
>> `BrokerRegistrationRequest` and `BrokerHeartbeatRequest` requests, when 
>> upgrading from 3.3 to 3.4 the 
>> `BrokerRegistrationRequest.OfflineLogDirectories` is empty whereas when 
>> upgrading from 3.4 to 3.5 it will not be empty. And maybe we can also 
>> remove `LogDirectoriesOfflineRequest` you proposed in this KIP.
>> 
>> --
>> Best,
>> Ziming
>> 
>> 
>>> On Aug 18, 2022, at 11:24 PM, Igor Soarez <i...@soarez.me> wrote:
>>> 
>>> Hi Ziming,
>>> 
>>> I'm sorry it took me a while to reply.
>>> 
>>> Thank you for having a look at this KIP and providing feedback.
>>> 
>>>> 1. We have a version field in meta.properties, currently it’s 1, and we can
>>>> set it to 2 in this KIP, and we can give an example of server.properties and
>>>> it’s corresponding meta.properties generated by the storage command tool.
>>> 
>>>> 2. When using storage format tool we should specify cluster-id, it will be an
>>>> arduous work if we should manually specify directory-ids for all log dirs,
>>>> I think you can make it more clear about this change that the directory-ids
>>>> are generated automatically.
>>> 
>>> Thank you for these suggestions. I've updated the KIP to:
>>> 
>>> * include an example
>>> * clarify that the version field would be changed to 2
>>> * clarify that the log directory UUIDs are automatically generated
>>> 
>>>> 3. When controller place a replica, it will select a broker as well as a log
>>>> directory, the latter is currently accomplished in the broker side, so this
>>>> will be a big change?
>>> 
>>> I think there can be benefits, as Jason described previously, if we change how
>>> log directories are assigned as follow-up work.
>>> 
>>> From a codebase surface area perspective, it is definitely a big change
>>> because there are many models, types and interfaces that assume replicas are
>>> identified solely by a broker id.
>>> That will have to change to include the directory UUID, many lines of code will
>>> be affected.
>>> 
>>> But in terms of behavior it shouldn't be a big change at all. Brokers currently
>>> select the log directory with the least logs in it. This isn't a very nice
>>> policy, as logs can have wildly different sizes, and log directories can have
>>> different capacities. But it is a policy that the controller can keep.
>>> 
>>> If we decide to extend the selection policy and keep it in the broker,
>>> the broker will continue to be able to override the controller's selection
>>> of log directory, using the `AssignReplicasToDirectories` RPC.
>>> 
>>>> 4. When handling log directory failures we will update Leader and ISR using
>>>> the existing replica state machine, what is this state machine referring to,
>>>> do you mean the AlterPartition RPC?
>>> 
>>> No, I mean that it will use the same rules and mechanism
>>> (`PartitionChangeBuilder`) that is used when a broker is fenced, shutdown or
>>> unregistered.
>>> 
>>> I think maybe the term "replica state machine" isn't the very appropriate.
>>> I've updated the KIP to rephrase this section.
>>> 
>>> Thanks,
>>> 
>>> --
>>> Igor


Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Posted by Igor Soarez <i...@soarez.me>.
Hi Ziming,

Thank you for having another look at this KIP, and please accept my apologies as to my delay in replying.

The migration introduces JBOD support, so before the migration there should only be one log directory configured per broker. This assumption simplifies how the controller learns about the initial log directory placement for existing partitions. The first broker registration referencing log directories must declare a single log directory - which must be online as brokers cannot start without any online log dirs. If there are no previous log dirs registered for this  broker, the controller assigns all existing partitions in the broker to this single directory.

Since we assume a single log dir per broker as a migration starting point and brokers cannot start without any online log directories, the `OfflineLogDirectories` field simply cannot be used in the first registration after the upgrade. However, the issue you described applies to `OnlineLogDirectories`. If use of this field is gated by metadata.version then we do require a double roll for brokers to make their first registration with a log directory.

As you describe, we could move the list of online log directories from the broker registration request to the heartbea request, but I believe we shouldn't do this because:
  a) The intention is to address a transient problem — the migration into the feature, which should happen only once — by accepting a long term commitment as to how the online log dirs are signaled.
  b) While the log directory status can change at any point from online to offline, the list of existing log directories is static for broker's lifetime, so it makes little sense to send this information over and over again to the controller.

We should also avoid gating on metadata.version to include the `OnlineLogDirectories` field in the broker registration. If the controllers are upgraded first, the existing versioning in the message schemas should suffice to guarantee compatibility. After the controllers are upgraded, the brokers can then be upgraded and each of them can register their single online log directory from the first instantiation.

Please, let me know if this makes sense and any other thoughts you might have.

Best,

--
Igor

On Thu, Sep 1, 2022, at 2:04 AM, deng ziming wrote:
> Hi Igor,
>
> I think this KIP can solve the current problems, I have some problems 
> relating to the migration section.
>
> Since we have bumped broker RPC version and metadata record version, 
> there will be some problems between brokers/controllers of different 
> versions. In ZK mode we use IBP as a flag to help solve this, in KRaft 
> mode we use a feature flag(metadata.version) as a flag for using new 
> RPC/metadata or not. 
>
> Assuming that we are upgrading from 3.3 to 3.4, firstly the finalized 
> version of metadata.version is 3.3, brokers will use version 1 of 
> `BrokerRegistrationRequest` which contains no `OfflineLogDirectories`, 
> finally the finalized version of metadata.version is 3.4, but brokers 
> will no longer send `BrokerRegistrationRequest` unless we restart the 
> broker, so controllers can’t be aware of the `OfflineLogDirectories` of 
> each broker, so we should reconsider the suggestion of Jason to use 
> `BrokerHeartbeatRequest` to communicate `OfflineLogDirectories`.
>
> Of course this problem can be solved through a double roll(restart 
> broker twice when upgrading), but we should trying to avoid it since we 
> now have feature flag.
>
> One solution is that we include `OfflineLogDirectories` both in 
> `BrokerRegistrationRequest` and `BrokerHeartbeatRequest` requests, when 
> upgrading from 3.3 to 3.4 the 
> `BrokerRegistrationRequest.OfflineLogDirectories` is empty whereas when 
> upgrading from 3.4 to 3.5 it will not be empty. And maybe we can also 
> remove `LogDirectoriesOfflineRequest` you proposed in this KIP.
>
> --
> Best,
> Ziming
>
>
>> On Aug 18, 2022, at 11:24 PM, Igor Soarez <i...@soarez.me> wrote:
>> 
>> Hi Ziming,
>> 
>> I'm sorry it took me a while to reply.
>> 
>> Thank you for having a look at this KIP and providing feedback.
>> 
>>> 1. We have a version field in meta.properties, currently it’s 1, and we can
>>> set it to 2 in this KIP, and we can give an example of server.properties and
>>> it’s corresponding meta.properties generated by the storage command tool.
>> 
>>> 2. When using storage format tool we should specify cluster-id, it will be an
>>> arduous work if we should manually specify directory-ids for all log dirs,
>>> I think you can make it more clear about this change that the directory-ids
>>> are generated automatically.
>> 
>> Thank you for these suggestions. I've updated the KIP to:
>> 
>> * include an example
>> * clarify that the version field would be changed to 2
>> * clarify that the log directory UUIDs are automatically generated
>> 
>>> 3. When controller place a replica, it will select a broker as well as a log
>>> directory, the latter is currently accomplished in the broker side, so this
>>> will be a big change?
>> 
>> I think there can be benefits, as Jason described previously, if we change how
>> log directories are assigned as follow-up work.
>> 
>> From a codebase surface area perspective, it is definitely a big change
>> because there are many models, types and interfaces that assume replicas are
>> identified solely by a broker id.
>> That will have to change to include the directory UUID, many lines of code will
>> be affected.
>> 
>> But in terms of behavior it shouldn't be a big change at all. Brokers currently
>> select the log directory with the least logs in it. This isn't a very nice
>> policy, as logs can have wildly different sizes, and log directories can have
>> different capacities. But it is a policy that the controller can keep.
>> 
>> If we decide to extend the selection policy and keep it in the broker,
>> the broker will continue to be able to override the controller's selection
>> of log directory, using the `AssignReplicasToDirectories` RPC.
>> 
>>> 4. When handling log directory failures we will update Leader and ISR using
>>> the existing replica state machine, what is this state machine referring to,
>>> do you mean the AlterPartition RPC?
>> 
>> No, I mean that it will use the same rules and mechanism
>> (`PartitionChangeBuilder`) that is used when a broker is fenced, shutdown or
>> unregistered.
>> 
>> I think maybe the term "replica state machine" isn't the very appropriate.
>> I've updated the KIP to rephrase this section.
>> 
>> Thanks,
>> 
>> --
>> Igor