You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Colin McCabe <cm...@apache.org> on 2021/12/02 22:52:35 UTC

Re: [DISCUSS] KIP-785 Automatic storage formatting

Hi Igor,

It is common for databases, filesystems, and other similar programs to require a formatting step before they are used. For example, postgres requires you to run initdb. Linux requires you to run mkfs before using a filesystem. Windows requires you to run "format c:/", or something equivalent. Ceph requires you to run the ceph-deploy tool or a similar tool. It's really not a high operational burden because it only has to be done once when the system is initialized.

With a clearly defined initialization step, you can clearly distinguish disk problems from simply the first startup of a cluster. This is actually quite important to the correctness of the system. For example, if I start up two out of three Raft nodes and their disks erroneously show up as blank, I could elect a leader with an empty log. In that case, I've silently lost all the metadata in the system.

In general, there is a bootstrapping problem where brokers may not be able to connect to the controller quorum without first having some local metadata. For example, if you are managing users using SCRAM, the SCRAM principal for the broker needs to exist before the connection can be made. We call this "bootstrapping" because it requires you to "lift yourself up by your own bootstraps." You need the metadata to fetch the metadata. The explicit initialization step breaks the cycle and allows the cluster to be successfully created.

I agree that in testing, it is nice not to have to run a separate command. To facilitate this, we could have a bash script that allows developers to start up a single node cluster without running kafka-storage.sh. That might be helpful. I suppose a docker image is another way to do it, which might also help people test.

best,
Colin


On Mon, Nov 29, 2021, at 12:20, Igor Soarez wrote:
> Hi all,
>
> Bumping this thread as it’s been a while.
>
> Looking forward to any kind of feedback, pease take a look.
>
> I created a short PR with a possible implementation - 
> https://github.com/apache/kafka/pull/11549
>
> --
> Igor
>
>
>
>> On 18 Oct 2021, at 15:11, Igor Soarez <so...@apple.com.INVALID> wrote:
>> 
>> Hi all,
>> 
>> I'd like to propose that we simplify the operation of KRaft servers a bit by removing the requirement to run kafka-storage.sh for new storage directories.
>> 
>> Please take a look at the KIP and provide your feedback:
>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-785%3A+Automatic+storage+formatting
>> 
>> --
>> Igor
>>

Re: [DISCUSS] KIP-785 Automatic storage formatting

Posted by Colin McCabe <cm...@apache.org>.
Hi Igor,

Disk failures have always required manual intervention in Kafka. For example, if you are using a RAID array and a disk goes bad, you will need to remove the disk, insert a new one, and start the RAID rebuild process. Kafka can't help with this process since we don't operate at this hardware level.

In the case where we are using JBOD, you would need to remove the bad disk, add a new good disk, reformat, and then restart the broker. The reformatting stage is the quickest stage and doesn't add a lot of overhead.

It is generally a very bad idea to re-add a bad disk to a node. The behavior of a failing disk is usually pathological. In addition to not persisting the data, you often get very slow operations, kernel errors, and unpredictable behavior. Certainly automatically re-adding a potentially bad disk doesn't do anyone any favors.

best,
Colin

On Thu, Dec 9, 2021, at 14:35, Igor Soarez wrote:
> Hi Colin,
>
> Thank you for your kind and thoughtful reply.
>
> Thank also you for clarifying why it is important to distinguish 
> between disk problems and first boot for a log directory. I completely 
> agree that loosing all metadata is a very serious issue and we should 
> strive to make that as least likely to happen as possible.
>
> Currently, the storage format step is simply ensuring each log 
> directory exists and creating a meta.properties file with clusterId and 
> nodeId in each configured log directory. The nodeId is already a 
> configuration property and clusterId is being proposed in this KIP as a 
> new one. The bootstrapping information generated by the format step can 
> optionally be made redundant. So if I understand correctly, in the 
> scenario you describe of when disks "erroneously show up as blank", 
> when the KafkaRaftServer starts, we are relying on the existence of 
> this file to prevent disaster and halt the system until there is manual 
> intervention.
>
> Currently, all the log directories must be formatted - not just the 
> metadata directory - that is, all log directories must contain 
> `metadata.properties`. This is validated in 
> BrokerMetadataCheckpoint.getBrokerMetadataAndOfflineDirs. The 
> validation that this file exists *in every log directory* is only done 
> when the “controller” role is in effect, and that includes “broker, 
> controller”. This means we currently require the external storage 
> format step to run when a non metadata disk is replaced, which just 
> seems unnecessary.
>
> Many of the ways disks fail do not enable this scenario where data is 
> lost. The disk might be unmounted, become read-only, or otherwise 
> generate IO failures. In any of these cases, an automatic step to 
> format the log directory would also fail and prevent an amnesiac 
> metadata quorum. To risk data loss in the scenario you describe we need 
> the disk to be available and usable but also blank. I see think of 
> use-cases here where this isn't a concern, such as a) a platform where 
> disks are slow to be repaired and replaced or b) if the controller 
> group is large enough to make simultaneous disk failure in a quorum 
> highly unlikely. In such cases, a non default option to disable this 
> metadata.properties pre-existence guard can have a net positive value.
>
> I am aware of the similar initialization steps for other systems. I’m 
> however having some difficulty envisioning always requiring manual 
> intervention upon disk failure in general as a desirable solution in 
> Kafka. Not having an automated way to deal with unformatted log 
> directories means that an operator then needs to intervene and run this 
> command before the instance is operational again. Unless it's actually 
> protecting the user, Kafka shouldn't be any more difficult to use than 
> necessary.
>
> Please, let me know your thoughts on this.
>
> Best,
>
> --
> Igor
>
>
>> On 2 Dec 2021, at 22:52, Colin McCabe <cm...@apache.org> wrote:
>> 
>> Hi Igor,
>> 
>> It is common for databases, filesystems, and other similar programs to require a formatting step before they are used. For example, postgres requires you to run initdb. Linux requires you to run mkfs before using a filesystem. Windows requires you to run "format c:/", or something equivalent. Ceph requires you to run the ceph-deploy tool or a similar tool. It's really not a high operational burden because it only has to be done once when the system is initialized.
>> 
>> With a clearly defined initialization step, you can clearly distinguish disk problems from simply the first startup of a cluster. This is actually quite important to the correctness of the system. For example, if I start up two out of three Raft nodes and their disks erroneously show up as blank, I could elect a leader with an empty log. In that case, I've silently lost all the metadata in the system.
>> 
>> In general, there is a bootstrapping problem where brokers may not be able to connect to the controller quorum without first having some local metadata. For example, if you are managing users using SCRAM, the SCRAM principal for the broker needs to exist before the connection can be made. We call this "bootstrapping" because it requires you to "lift yourself up by your own bootstraps." You need the metadata to fetch the metadata. The explicit initialization step breaks the cycle and allows the cluster to be successfully created.
>> 
>> I agree that in testing, it is nice not to have to run a separate command. To facilitate this, we could have a bash script that allows developers to start up a single node cluster without running kafka-storage.sh. That might be helpful. I suppose a docker image is another way to do it, which might also help people test.
>> 
>> best,
>> Colin
>> 
>> 
>> On Mon, Nov 29, 2021, at 12:20, Igor Soarez wrote:
>>> Hi all,
>>> 
>>> Bumping this thread as it’s been a while.
>>> 
>>> Looking forward to any kind of feedback, pease take a look.
>>> 
>>> I created a short PR with a possible implementation - 
>>> https://github.com/apache/kafka/pull/11549
>>> 
>>> --
>>> Igor
>>> 
>>> 
>>> 
>>>> On 18 Oct 2021, at 15:11, Igor Soarez <so...@apple.com.INVALID> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I'd like to propose that we simplify the operation of KRaft servers a bit by removing the requirement to run kafka-storage.sh for new storage directories.
>>>> 
>>>> Please take a look at the KIP and provide your feedback:
>>>> 
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-785%3A+Automatic+storage+formatting
>>>> 
>>>> --
>>>> Igor
>>>>

Re: [DISCUSS] KIP-785 Automatic storage formatting

Posted by Igor Soarez <so...@apple.com.INVALID>.
Hi Colin,

Thank you for your kind and thoughtful reply.

Thank also you for clarifying why it is important to distinguish between disk problems and first boot for a log directory. I completely agree that loosing all metadata is a very serious issue and we should strive to make that as least likely to happen as possible.

Currently, the storage format step is simply ensuring each log directory exists and creating a meta.properties file with clusterId and nodeId in each configured log directory. The nodeId is already a configuration property and clusterId is being proposed in this KIP as a new one. The bootstrapping information generated by the format step can optionally be made redundant. So if I understand correctly, in the scenario you describe of when disks "erroneously show up as blank", when the KafkaRaftServer starts, we are relying on the existence of this file to prevent disaster and halt the system until there is manual intervention.

Currently, all the log directories must be formatted - not just the metadata directory - that is, all log directories must contain `metadata.properties`. This is validated in BrokerMetadataCheckpoint.getBrokerMetadataAndOfflineDirs. The validation that this file exists *in every log directory* is only done when the “controller” role is in effect, and that includes “broker, controller”. This means we currently require the external storage format step to run when a non metadata disk is replaced, which just seems unnecessary.

Many of the ways disks fail do not enable this scenario where data is lost. The disk might be unmounted, become read-only, or otherwise generate IO failures. In any of these cases, an automatic step to format the log directory would also fail and prevent an amnesiac metadata quorum. To risk data loss in the scenario you describe we need the disk to be available and usable but also blank. I see think of use-cases here where this isn't a concern, such as a) a platform where disks are slow to be repaired and replaced or b) if the controller group is large enough to make simultaneous disk failure in a quorum highly unlikely. In such cases, a non default option to disable this metadata.properties pre-existence guard can have a net positive value.

I am aware of the similar initialization steps for other systems. I’m however having some difficulty envisioning always requiring manual intervention upon disk failure in general as a desirable solution in Kafka. Not having an automated way to deal with unformatted log directories means that an operator then needs to intervene and run this command before the instance is operational again. Unless it's actually protecting the user, Kafka shouldn't be any more difficult to use than necessary.

Please, let me know your thoughts on this.

Best,

--
Igor


> On 2 Dec 2021, at 22:52, Colin McCabe <cm...@apache.org> wrote:
> 
> Hi Igor,
> 
> It is common for databases, filesystems, and other similar programs to require a formatting step before they are used. For example, postgres requires you to run initdb. Linux requires you to run mkfs before using a filesystem. Windows requires you to run "format c:/", or something equivalent. Ceph requires you to run the ceph-deploy tool or a similar tool. It's really not a high operational burden because it only has to be done once when the system is initialized.
> 
> With a clearly defined initialization step, you can clearly distinguish disk problems from simply the first startup of a cluster. This is actually quite important to the correctness of the system. For example, if I start up two out of three Raft nodes and their disks erroneously show up as blank, I could elect a leader with an empty log. In that case, I've silently lost all the metadata in the system.
> 
> In general, there is a bootstrapping problem where brokers may not be able to connect to the controller quorum without first having some local metadata. For example, if you are managing users using SCRAM, the SCRAM principal for the broker needs to exist before the connection can be made. We call this "bootstrapping" because it requires you to "lift yourself up by your own bootstraps." You need the metadata to fetch the metadata. The explicit initialization step breaks the cycle and allows the cluster to be successfully created.
> 
> I agree that in testing, it is nice not to have to run a separate command. To facilitate this, we could have a bash script that allows developers to start up a single node cluster without running kafka-storage.sh. That might be helpful. I suppose a docker image is another way to do it, which might also help people test.
> 
> best,
> Colin
> 
> 
> On Mon, Nov 29, 2021, at 12:20, Igor Soarez wrote:
>> Hi all,
>> 
>> Bumping this thread as it’s been a while.
>> 
>> Looking forward to any kind of feedback, pease take a look.
>> 
>> I created a short PR with a possible implementation - 
>> https://github.com/apache/kafka/pull/11549
>> 
>> --
>> Igor
>> 
>> 
>> 
>>> On 18 Oct 2021, at 15:11, Igor Soarez <so...@apple.com.INVALID> wrote:
>>> 
>>> Hi all,
>>> 
>>> I'd like to propose that we simplify the operation of KRaft servers a bit by removing the requirement to run kafka-storage.sh for new storage directories.
>>> 
>>> Please take a look at the KIP and provide your feedback:
>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-785%3A+Automatic+storage+formatting
>>> 
>>> --
>>> Igor
>>>