You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Luke Chen <sh...@gmail.com> on 2024/04/03 06:59:03 UTC

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Hi Christo,

1. I agree with Doguscan that in KRaft mode, the controller won't send RPCs
to the brokers (except in the migration path).
So, I think we could adopt the similar way we did to `AlterReplicaLogDirs` (
KIP-858
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement>)
that let the broker notify controller any update, instead of controller to
broker. And once the controller receives all the complete requests from
brokers, it'll enter "Disabled" state. WDYT?

2. Why should we wait until all brokers to respond before moving to
"Disabled" state in "KRaft mode"?
Currently, only the leader node does the remote log upload/fetch tasks, so
does that mean the controller only need to make sure the leader completes
the stopPartition?
If during the leader node stopPartition process triggered leadership
change, then the new leader should receive and apply the configRecord
update before the leadership change record based on the KRaft design, which
means there will be no gap that the follower node becomes the leader and
starting doing unexpected upload/fetch tasks, right?
I agree we should make sure in ZK mode, all brokers are completed the
stopPartitions before moving to "Disabled" state because ZK node watcher is
working in a separate thread. But not sure about KRaft mode.

Thanks.
Luke


On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <ch...@gmail.com>
wrote:

> Heya everyone!
>
> re: Doguscan
>
> I believe the answer to 101 needs a bit more discussion. As far as I know,
> tiered storage today has methods to update a metadata of a segment to say
> "hey, I would like this deleted", but actual deletion is left to plugin
> implementations (or any background cleaners). In other words, there is no
> "immediate" deletion. In this KIP, we would like to continue doing the same
> if the retention policy is set to delete. So I believe the answer is
> actually that a) we will update the metadata of the segments to mark them
> as deleted and b) we will advance the log start offset. Any deletion of
> actual files will still be delegated to plugin implementations. I believe
> this is further supported by "*remote.log.disable.policy=delete:* Logs that
> are archived in the remote storage will not be part of the contiguous
> "active" log and will be deleted asynchronously as part of the disablement
> process"
>
> Following from the above, I believe for 102 it is fine to allow setting of
> remote.log.disable.policy on a disabled topic in much the same way we allow
> other remote-related configurations to be set on a topic (i.e.
> local.retention.*) - it just won't have an effect. Granted, I do believe we
> should restrict the policy being changed while a disablement is ongoing.
>
> re: Satish and Kamal
>
> 104, 1 and 2 are fair asks, I will work with Doguscan to update the KIP
> with the information!
>
> Best,
> Christo
>
> On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <na...@gmail.com>
> wrote:
>
> > Hi Satish, I will try to answer as much as I can and the others could
> chime
> > in with further details.
> >
> >
> >
> >
> >
> > *101. For remote.log.disable.policy=delete: Does it delete the remote log
> > data immediately and the data in remote storage will not be taken into
> > account by any replica? That means log-start-offset is moved to the
> earlier
> > local-log-start-offset.*
> > *Exactly. RemoteLogData will be deleted immediately. *
> > *So before the deletion starts we move LogStart offset to
> > LocalLogStartOffset to ensure that no RemoteLog will be accessed after
> that
> > point.*
> >
> >
> > * 102. Can we update the remote.log.disable.policy after tiered storage
> is
> > disabled on a topic?*
> >
> > *This is a good point. I think we should not allow modifying this
> > configuration*
> > *because changing the policy from Deletion to Retain when there is an
> > ongoing Deletion will result in an undefined behaviour and where we
> retain
> > half of the remote log and delete the other half.*
> >
> > * 103. Do we plan to add any metrics related to this feature?*
> >
> >
> >
> > *Any recommendations?*
> > *We may emit a gauge showing the enablement state of a topic but we could
> > gather that info from the logs as well.*
> > *The total duration for remote topic deletion could be added as well but
> > this is more of a metric for the RemotePartitionRemover itself.*
> >
> >
> >
> > *104. Please add configuration details about copier thread pool,
> expiration
> > thread pool and the migration of the existing
> > RemoteLogManagerScheduledThreadPool.*
> >
> > *Will add the details.*
> >
> > 105. How is the behaviour with topic or partition deletion request
> > handled when tiered storage disablement request is still being
> > processed on a topic?
> >
> > *If the disablement policy is Delete then a successive topic deletion
> > request is going to be a NOOP because RemoteLogs is already being
> deleted.*
> > *If the disablement policy is Retain, then we only moved the
> LogStartOffset
> > and didn't touch RemoteLogs anyway, so the delete topic request will
> > result*
> >
> > *in the initiation of RemoteLog deletion.*
> >
> >
> > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
> > kamal.chandraprakash@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Thanks for the KIP! Overall the KIP looks good and covered most of the
> > > items.
> > >
> > > 1. Could you explain how the brokers will handle the DisableRemoteTopic
> > API
> > > request?
> > >
> > > 2. Who will initiate the controller interaction sequence? Does the
> > > controller listens for
> > > topic config updates and initiate the disablement?
> > >
> > > --
> > > Kamal
> > >
> > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
> satish.duggana@gmail.com
> > >
> > > wrote:
> > >
> > > > Thanks Mehari, Divij, Christo etal for the KIP.
> > > >
> > > > I had an initial review of the KIP and left the below comments.
> > > >
> > > > 101. For remote.log.disable.policy=delete:
> > > > Does it delete the remote log data immediately and the data in remote
> > > > storage will not be taken into account by any replica? That means
> > > > log-start-offset is moved to the earlier local-log-start-offset.
> > > >
> > > > 102. Can we update the remote.log.disable.policy after tiered storage
> > > > is disabled on a topic?
> > > >
> > > > 103. Do we plan to add any metrics related to this feature?
> > > >
> > > > 104. Please add configuration details about copier thread pool,
> > > > expiration thread pool and the migration of the existing
> > > > RemoteLogManagerScheduledThreadPool.
> > > >
> > > > 105. How is the behaviour with topic or partition deletion request
> > > > handled when tiered storage disablement request is still being
> > > > processed on a topic?
> > > >
> > > > ~Satish.
> > > >
> > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
> namal.doguscan@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > Hi Christo and Luke,
> > > > >
> > > > > I think the KRaft section of the KIP requires slight improvement.
> The
> > > > metadata propagation in KRaft is handled by the RAFT layer instead of
> > > > sending Controller -> Broker RPCs. In fact, KIP-631 deprecated these
> > > RPCs.
> > > > >
> > > > > I will come up with some recommendations on how we could improve
> that
> > > > one but until then, @Luke please feel free to review the KIP.
> > > > >
> > > > > @Satish, if we want this to make it to Kafka 3.8 I believe we need
> to
> > > > aim to get the KIP approved in the following weeks otherwise it will
> > slip
> > > > and we can not support it in Zookeeper mode.
> > > > >
> > > > > I also would like to better understand what is the community's
> stand
> > > for
> > > > adding a new feature for Zookeeper since it is marked as deprecated
> > > already.
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
> christololov@gmail.com>
> > > > wrote:
> > > > >>
> > > > >> Heya,
> > > > >>
> > > > >> I do have some time to put into this, but to be honest I am still
> > > after
> > > > >> reviews of the KIP itself :)
> > > > >>
> > > > >> After the latest changes it ought to be detailing both a Zookeeper
> > > > approach
> > > > >> and a KRaft approach.
> > > > >>
> > > > >> Do you have any thoughts on how it could be improved or should I
> > > start a
> > > > >> voting thread?
> > > > >>
> > > > >> Best,
> > > > >> Christo
> > > > >>
> > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <sh...@gmail.com>
> wrote:
> > > > >>
> > > > >> > Hi Christo,
> > > > >> >
> > > > >> > Any update with this KIP?
> > > > >> > If you don't have time to complete it, I can collaborate with
> you
> > to
> > > > work
> > > > >> > on it.
> > > > >> >
> > > > >> > Thanks.
> > > > >> > Luke
> > > > >> >
> > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish Duggana <
> > > > satish.duggana@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hi Christo,
> > > > >> > > Thanks for volunteering to contribute to the KIP discussion. I
> > > > suggest
> > > > >> > > considering this KIP for both ZK and KRaft as it will be
> helpful
> > > for
> > > > >> > > this feature to be available in 3.8.0 running with ZK
> clusters.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Satish.
> > > > >> > >
> > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
> > > christololov@gmail.com
> > > > >
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > Hello!
> > > > >> > > >
> > > > >> > > > I volunteer to get this KIP moving forward and implemented
> in
> > > > Apache
> > > > >> > > Kafka
> > > > >> > > > 3.8.
> > > > >> > > >
> > > > >> > > > I have caught up with Mehari offline and we have agreed that
> > > given
> > > > >> > Apache
> > > > >> > > > Kafka 4.0 being around the corner we would like to propose
> > this
> > > > feature
> > > > >> > > > only for KRaft clusters.
> > > > >> > > >
> > > > >> > > > Any and all reviews and comments are welcome!
> > > > >> > > >
> > > > >> > > > Best,
> > > > >> > > > Christo
> > > > >> > > >
> > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan Namal <
> > > > namal.doguscan@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi everyone, any progress on the status of this KIP?
> Overall
> > > > looks
> > > > >> > > good to
> > > > >> > > > > me but I wonder whether we still need to support it for
> > > > Zookeeper
> > > > >> > mode
> > > > >> > > > > given that it will be deprecated in the next 3 months.
> > > > >> > > > >
> > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari" wrote:
> > > > >> > > > > > Hi everyone,
> > > > >> > > > > > I would like to start a discussion on KIP-950: Tiered
> > > Storage
> > > > >> > > Disablement
> > > > >> > > > > (
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
> > > > >> > > > > ).
> > > > >> > > > > >
> > > > >> > > > > > This KIP proposes adding the ability to disable and
> > > re-enable
> > > > >> > tiered
> > > > >> > > > > storage on a topic.
> > > > >> > > > > >
> > > > >> > > > > > Thanks,
> > > > >> > > > > > Mehari
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Satish Duggana <sa...@gmail.com>.

Hi Christo, Doguscan,
Can you please address the comments in this mail thread and update the
KIP accordingly?

Thanks,
Satish.

On Wed, 3 Apr 2024 at 12:30, Luke Chen <sh...@gmail.com> wrote:
>
> Hi Christo,
>
> 1. I agree with Doguscan that in KRaft mode, the controller won't send RPCs
> to the brokers (except in the migration path).
> So, I think we could adopt the similar way we did to `AlterReplicaLogDirs` (
> KIP-858
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement>)
> that let the broker notify controller any update, instead of controller to
> broker. And once the controller receives all the complete requests from
> brokers, it'll enter "Disabled" state. WDYT?
>
> 2. Why should we wait until all brokers to respond before moving to
> "Disabled" state in "KRaft mode"?
> Currently, only the leader node does the remote log upload/fetch tasks, so
> does that mean the controller only need to make sure the leader completes
> the stopPartition?
> If during the leader node stopPartition process triggered leadership
> change, then the new leader should receive and apply the configRecord
> update before the leadership change record based on the KRaft design, which
> means there will be no gap that the follower node becomes the leader and
> starting doing unexpected upload/fetch tasks, right?
> I agree we should make sure in ZK mode, all brokers are completed the
> stopPartitions before moving to "Disabled" state because ZK node watcher is
> working in a separate thread. But not sure about KRaft mode.
>
> Thanks.
> Luke
>
>
> On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <ch...@gmail.com>
> wrote:
>
> > Heya everyone!
> >
> > re: Doguscan
> >
> > I believe the answer to 101 needs a bit more discussion. As far as I know,
> > tiered storage today has methods to update a metadata of a segment to say
> > "hey, I would like this deleted", but actual deletion is left to plugin
> > implementations (or any background cleaners). In other words, there is no
> > "immediate" deletion. In this KIP, we would like to continue doing the same
> > if the retention policy is set to delete. So I believe the answer is
> > actually that a) we will update the metadata of the segments to mark them
> > as deleted and b) we will advance the log start offset. Any deletion of
> > actual files will still be delegated to plugin implementations. I believe
> > this is further supported by "*remote.log.disable.policy=delete:* Logs that
> > are archived in the remote storage will not be part of the contiguous
> > "active" log and will be deleted asynchronously as part of the disablement
> > process"
> >
> > Following from the above, I believe for 102 it is fine to allow setting of
> > remote.log.disable.policy on a disabled topic in much the same way we allow
> > other remote-related configurations to be set on a topic (i.e.
> > local.retention.*) - it just won't have an effect. Granted, I do believe we
> > should restrict the policy being changed while a disablement is ongoing.
> >
> > re: Satish and Kamal
> >
> > 104, 1 and 2 are fair asks, I will work with Doguscan to update the KIP
> > with the information!
> >
> > Best,
> > Christo
> >
> > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <na...@gmail.com>
> > wrote:
> >
> > > Hi Satish, I will try to answer as much as I can and the others could
> > chime
> > > in with further details.
> > >
> > >
> > >
> > >
> > >
> > > *101. For remote.log.disable.policy=delete: Does it delete the remote log
> > > data immediately and the data in remote storage will not be taken into
> > > account by any replica? That means log-start-offset is moved to the
> > earlier
> > > local-log-start-offset.*
> > > *Exactly. RemoteLogData will be deleted immediately. *
> > > *So before the deletion starts we move LogStart offset to
> > > LocalLogStartOffset to ensure that no RemoteLog will be accessed after
> > that
> > > point.*
> > >
> > >
> > > * 102. Can we update the remote.log.disable.policy after tiered storage
> > is
> > > disabled on a topic?*
> > >
> > > *This is a good point. I think we should not allow modifying this
> > > configuration*
> > > *because changing the policy from Deletion to Retain when there is an
> > > ongoing Deletion will result in an undefined behaviour and where we
> > retain
> > > half of the remote log and delete the other half.*
> > >
> > > * 103. Do we plan to add any metrics related to this feature?*
> > >
> > >
> > >
> > > *Any recommendations?*
> > > *We may emit a gauge showing the enablement state of a topic but we could
> > > gather that info from the logs as well.*
> > > *The total duration for remote topic deletion could be added as well but
> > > this is more of a metric for the RemotePartitionRemover itself.*
> > >
> > >
> > >
> > > *104. Please add configuration details about copier thread pool,
> > expiration
> > > thread pool and the migration of the existing
> > > RemoteLogManagerScheduledThreadPool.*
> > >
> > > *Will add the details.*
> > >
> > > 105. How is the behaviour with topic or partition deletion request
> > > handled when tiered storage disablement request is still being
> > > processed on a topic?
> > >
> > > *If the disablement policy is Delete then a successive topic deletion
> > > request is going to be a NOOP because RemoteLogs is already being
> > deleted.*
> > > *If the disablement policy is Retain, then we only moved the
> > LogStartOffset
> > > and didn't touch RemoteLogs anyway, so the delete topic request will
> > > result*
> > >
> > > *in the initiation of RemoteLog deletion.*
> > >
> > >
> > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
> > > kamal.chandraprakash@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Thanks for the KIP! Overall the KIP looks good and covered most of the
> > > > items.
> > > >
> > > > 1. Could you explain how the brokers will handle the DisableRemoteTopic
> > > API
> > > > request?
> > > >
> > > > 2. Who will initiate the controller interaction sequence? Does the
> > > > controller listens for
> > > > topic config updates and initiate the disablement?
> > > >
> > > > --
> > > > Kamal
> > > >
> > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
> > satish.duggana@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Thanks Mehari, Divij, Christo etal for the KIP.
> > > > >
> > > > > I had an initial review of the KIP and left the below comments.
> > > > >
> > > > > 101. For remote.log.disable.policy=delete:
> > > > > Does it delete the remote log data immediately and the data in remote
> > > > > storage will not be taken into account by any replica? That means
> > > > > log-start-offset is moved to the earlier local-log-start-offset.
> > > > >
> > > > > 102. Can we update the remote.log.disable.policy after tiered storage
> > > > > is disabled on a topic?
> > > > >
> > > > > 103. Do we plan to add any metrics related to this feature?
> > > > >
> > > > > 104. Please add configuration details about copier thread pool,
> > > > > expiration thread pool and the migration of the existing
> > > > > RemoteLogManagerScheduledThreadPool.
> > > > >
> > > > > 105. How is the behaviour with topic or partition deletion request
> > > > > handled when tiered storage disablement request is still being
> > > > > processed on a topic?
> > > > >
> > > > > ~Satish.
> > > > >
> > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
> > namal.doguscan@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > > Hi Christo and Luke,
> > > > > >
> > > > > > I think the KRaft section of the KIP requires slight improvement.
> > The
> > > > > metadata propagation in KRaft is handled by the RAFT layer instead of
> > > > > sending Controller -> Broker RPCs. In fact, KIP-631 deprecated these
> > > > RPCs.
> > > > > >
> > > > > > I will come up with some recommendations on how we could improve
> > that
> > > > > one but until then, @Luke please feel free to review the KIP.
> > > > > >
> > > > > > @Satish, if we want this to make it to Kafka 3.8 I believe we need
> > to
> > > > > aim to get the KIP approved in the following weeks otherwise it will
> > > slip
> > > > > and we can not support it in Zookeeper mode.
> > > > > >
> > > > > > I also would like to better understand what is the community's
> > stand
> > > > for
> > > > > adding a new feature for Zookeeper since it is marked as deprecated
> > > > already.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
> > christololov@gmail.com>
> > > > > wrote:
> > > > > >>
> > > > > >> Heya,
> > > > > >>
> > > > > >> I do have some time to put into this, but to be honest I am still
> > > > after
> > > > > >> reviews of the KIP itself :)
> > > > > >>
> > > > > >> After the latest changes it ought to be detailing both a Zookeeper
> > > > > approach
> > > > > >> and a KRaft approach.
> > > > > >>
> > > > > >> Do you have any thoughts on how it could be improved or should I
> > > > start a
> > > > > >> voting thread?
> > > > > >>
> > > > > >> Best,
> > > > > >> Christo
> > > > > >>
> > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <sh...@gmail.com>
> > wrote:
> > > > > >>
> > > > > >> > Hi Christo,
> > > > > >> >
> > > > > >> > Any update with this KIP?
> > > > > >> > If you don't have time to complete it, I can collaborate with
> > you
> > > to
> > > > > work
> > > > > >> > on it.
> > > > > >> >
> > > > > >> > Thanks.
> > > > > >> > Luke
> > > > > >> >
> > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish Duggana <
> > > > > satish.duggana@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Hi Christo,
> > > > > >> > > Thanks for volunteering to contribute to the KIP discussion. I
> > > > > suggest
> > > > > >> > > considering this KIP for both ZK and KRaft as it will be
> > helpful
> > > > for
> > > > > >> > > this feature to be available in 3.8.0 running with ZK
> > clusters.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Satish.
> > > > > >> > >
> > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
> > > > christololov@gmail.com
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > Hello!
> > > > > >> > > >
> > > > > >> > > > I volunteer to get this KIP moving forward and implemented
> > in
> > > > > Apache
> > > > > >> > > Kafka
> > > > > >> > > > 3.8.
> > > > > >> > > >
> > > > > >> > > > I have caught up with Mehari offline and we have agreed that
> > > > given
> > > > > >> > Apache
> > > > > >> > > > Kafka 4.0 being around the corner we would like to propose
> > > this
> > > > > feature
> > > > > >> > > > only for KRaft clusters.
> > > > > >> > > >
> > > > > >> > > > Any and all reviews and comments are welcome!
> > > > > >> > > >
> > > > > >> > > > Best,
> > > > > >> > > > Christo
> > > > > >> > > >
> > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan Namal <
> > > > > namal.doguscan@gmail.com>
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi everyone, any progress on the status of this KIP?
> > Overall
> > > > > looks
> > > > > >> > > good to
> > > > > >> > > > > me but I wonder whether we still need to support it for
> > > > > Zookeeper
> > > > > >> > mode
> > > > > >> > > > > given that it will be deprecated in the next 3 months.
> > > > > >> > > > >
> > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari" wrote:
> > > > > >> > > > > > Hi everyone,
> > > > > >> > > > > > I would like to start a discussion on KIP-950: Tiered
> > > > Storage
> > > > > >> > > Disablement
> > > > > >> > > > > (
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > >
> > > > > >> >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
> > > > > >> > > > > ).
> > > > > >> > > > > >
> > > > > >> > > > > > This KIP proposes adding the ability to disable and
> > > > re-enable
> > > > > >> > tiered
> > > > > >> > > > > storage on a topic.
> > > > > >> > > > > >
> > > > > >> > > > > > Thanks,
> > > > > >> > > > > > Mehari
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > >
> > > > > >> >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Kamal Chandraprakash <ka...@gmail.com>.

Christo,

IMO, we can converge the implementation for both the policies
and remove the remote-log-segments in async fashion.

The only difference between the "delete" and "retain" policies is
that, for "delete" we move the log-start-offset to match with the
local-log-start-offset and then the expiration threads will clean
those remote-log segments due to breach by log-start-offset.
The proposal is that we use the same remote-log-expiration threads
to delete the remote-log-segments instead of the request-handler-thread
as the time taken to delete the remote log segments is non-deterministic.

For both the policies, we have to stop the copier threads and close
the RLM task once all the segments have expired from the remote
storage.

https://github.com/apache/kafka/pull/13947/files#r1281675818

--
Kamal

On Thu, May 9, 2024 at 6:51 PM Christo Lolov <ch...@gmail.com> wrote:

> Heya!
>
> re: Luke
>
> 1. I am not certain I follow the question. From DISABLED you can only go to
> ENABLED regardless of whether your cluster is backed by Zookeeper or KRaft.
> Am I misunderstanding your point?
>
> 2. Apologies, this was a leftover from previous versions. I have updated
> the Zookeeper section. The steps ought to be: controller receives change,
> commits necessary data to Zookeeper, enqueues disablement and starts
> sending StopReplicas request to brokers; brokers receive StopReplicas and
> propagate them all the way to RemoteLogManager#stopPartitions which takes
> care of the rest.
>
> 3. Correct, it should say DISABLED - this should now be corrected.
>
> 4. I was thinking that if there is a mismatch we will just fail accepting
> the request for disablement. This should be the same in both Zookeeper and
> KRaft. Or am I misunderstanding your question?
>
> 5. Yeah. I am now doing a second pass on all diagrams and will update them
> by the end of the day!
>
> 6. I think my current train of thought is that there will be unlimited
> retries until all brokers respond in a similar way to how deletion of a
> topic works today in ZK. In the meantime the state will continue to be
> DISABLING. Do you have a better suggestion?
>
> re: Kamal
>
> Yep, I will update all diagrams
>
> I am not certain I follow the reasoning for making retain and delete the
> same. Deletion when the policy is retain happens asynchronously due to
> expiration. I think that deletion when the policy is delete ought to (at
> least for the initial implementation) happen synchronously. Should people
> run into timeout problems we can always then have a follow-up KIP where we
> make it asynchronous.
>
> Best,
> Christo
>
> On Tue, 7 May 2024 at 10:04, Kamal Chandraprakash <
> kamal.chandraprakash@gmail.com> wrote:
>
> > Hi Christo,
> >
> > Thanks for the update!
> >
> > For both the policies "retain" and "delete", can we maintain the same
> > approach to delete the segments async?
> >
> > > If the disablement policy is set to delete, the Log start offset (LSO)
> is
> > updated to match the Local Log Start Offset and the remote log is deleted
> > by calling the RemoteStorageManager#deleteLogSegmentData().
> >
> > In the KIP, it's mentioned that when the disable policy is set to
> "delete",
> > the remote-log-segments will be
> > deleted in-sync. The stopPartition call might get timed out when the
> number
> > of remote log segments to
> > delete is huge. We can further extend the same approach for the topic
> > deletion requests.
> >
> > Also, Could you please update the state diagram about the transitions? It
> > is not clear when to transit from
> > DISABLING to DISABLED state?
> >
> > --
> > Kamal
> >
> > On Mon, May 6, 2024 at 6:55 PM Kamal Chandraprakash <
> > kamal.chandraprakash@gmail.com> wrote:
> >
> > > Ignore the above message. Got the answers after reading the state
> > > transition section.
> > >
> > > > If the disablement policy is delete, tasks scheduled for the
> > > topic-partitions in the RemoteDataExpirationThreadPool will also be
> > > canceled.
> > >
> > > We are deleting the segments synchronously. Should we delete them
> > > asynchronously? The same approach can be extended to topic deletion
> > > requests.
> > >
> > > > 6. In ZK mode, what will the controller do if the "stopReplicas"
> > > responses not received from all brokers? Reverting the changes?
> > >
> > > Since we are deleting the segments synchronously. This case can be
> bound
> > > to happen when the number of remote log segments to
> > > delete is huge.
> > >
> > >
> > > On Mon, May 6, 2024, 18:12 Kamal Chandraprakash <
> > > kamal.chandraprakash@gmail.com> wrote:
> > >
> > >> Hi Christo,
> > >>
> > >> Thanks for the update!
> > >>
> > >> 1. In the ZK mode, how will the transition from DISABLING to DISABLED
> > >> state happen?
> > >> For the "retain" policy, until we delete all the remote-log segments,
> > the
> > >> state will be
> > >> DISABLING and the deletion can happen only when they breach either the
> > >> retention
> > >> time (or) size.
> > >>
> > >> How does the controller monitor that all the remote log segments are
> > >> deleted for all
> > >> the partitions of the topic before transitioning the state to
> DISABLED?
> > >>
> > >> 2. In Kraft, we have only ENABLED -> DISABLED state. How are we
> > >> supporting the case
> > >> "retain" -> "enable"?
> > >>
> > >> If the remote storage is degraded, we want to avoid uploading the
> > >> segments temporarily
> > >> and resume back once the remote storage is healthy. Is the case
> > supported?
> > >>
> > >>
> > >>
> > >> On Fri, May 3, 2024 at 12:12 PM Luke Chen <sh...@gmail.com> wrote:
> > >>
> > >>> Also, I think using `stopReplicas` request is a good idea because it
> > >>> won't cause any problems while migrating to KRaft mode.
> > >>> The stopReplicas request is one of the request that KRaft controller
> > >>> will send to ZK brokers during migration.
> > >>>
> > >>> Thanks.
> > >>> Luke
> > >>>
> > >>> On Fri, May 3, 2024 at 11:48 AM Luke Chen <sh...@gmail.com> wrote:
> > >>>
> > >>>> Hi Christo,
> > >>>>
> > >>>> Thanks for the update.
> > >>>>
> > >>>> Questions:
> > >>>> 1. For this
> > >>>> "The possible state transition from DISABLED state is to the
> ENABLED."
> > >>>> I think it only applies for KRaft mode. In ZK mode, the possible
> state
> > >>>> is "DISABLING", right?
> > >>>>
> > >>>> 2. For this:
> > >>>> "If the cluster is using Zookeeper as the control plane, enabling
> > >>>> remote storage for a topic triggers the controller to send this
> > information
> > >>>> to Zookeeper. Each broker listens for changes in Zookeeper, and
> when a
> > >>>> change is detected, the broker triggers
> > >>>> RemoteLogManager#onLeadershipChange()."
> > >>>>
> > >>>> I think the way ZK brokers knows the leadership change is by getting
> > >>>> the LeaderAndISRRequeset from the controller, not listening for
> > changes in
> > >>>> ZK.
> > >>>>
> > >>>> 3. In the KRaft handler steps, you said:
> > >>>> "The controller also updates the Topic metadata to increment the
> > >>>> tiered_epoch and update the tiered_stateto DISABLING state."
> > >>>>
> > >>>> Should it be "DISABLED" state since it's KRaft mode?
> > >>>>
> > >>>> 4. I was thinking how we handle the tiered_epoch not match error.
> > >>>> For ZK, I think the controller won't write any data into ZK Znode,
> > >>>> For KRaft, either configRecord or updateTopicMetadata records won't
> be
> > >>>> written.
> > >>>> Is that right? Because the current workflow makes me think there
> will
> > >>>> be partial data updated in ZK/KRaft when tiered_epoch error.
> > >>>>
> > >>>> 5. Since we changed to use stopReplicas (V5) request now, the
> diagram
> > >>>> for ZK workflow might also need to update.
> > >>>>
> > >>>> 6. In ZK mode, what will the controller do if the "stopReplicas"
> > >>>> responses not received from all brokers? Reverting the changes?
> > >>>> This won't happen in KRaft mode because it's broker's responsibility
> > to
> > >>>> fetch metadata update from controller.
> > >>>>
> > >>>>
> > >>>> Thank you.
> > >>>> Luke
> > >>>>
> > >>>>
> > >>>> On Fri, Apr 19, 2024 at 10:23 PM Christo Lolov <
> > christololov@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Heya all!
> > >>>>>
> > >>>>> I have updated KIP-950. A list of what I have updated is:
> > >>>>>
> > >>>>> * Explicitly state that Zookeeper-backed clusters will have ENABLED
> > ->
> > >>>>> DISABLING -> DISABLED while KRaft-backed clusters will only have
> > ENABLED ->
> > >>>>> DISABLED
> > >>>>> * Added two configurations for the new thread pools and explained
> > >>>>> where values will be picked-up mid Kafka version upgrade
> > >>>>> * Explained how leftover remote partitions will be scheduled for
> > >>>>> deletion
> > >>>>> * Updated the API to use StopReplica V5 rather than a whole new
> > >>>>> controller-to-broker API
> > >>>>> * Explained that the disablement procedure will be triggered by the
> > >>>>> controller listening for an (Incremental)AlterConfig change
> > >>>>> * Explained that we will first move log start offset and then
> issue a
> > >>>>> deletion
> > >>>>> * Went into more details that changing remote.log.disable.policy
> > after
> > >>>>> disablement won't do anything and that if a customer would like
> > additional
> > >>>>> data deleted they would have to use already existing methods
> > >>>>>
> > >>>>> Let me know if there are any new comments or I have missed
> something!
> > >>>>>
> > >>>>> Best,
> > >>>>> Christo
> > >>>>>
> > >>>>> On Mon, 15 Apr 2024 at 12:40, Christo Lolov <
> christololov@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Heya Doguscan,
> > >>>>>>
> > >>>>>> I believe that the state of the world after this KIP will be the
> > >>>>>> following:
> > >>>>>>
> > >>>>>> For Zookeeper-backed clusters there will be 3 states: ENABLED,
> > >>>>>> DISABLING and DISABLED. We want this because Zookeeper-backed
> > clusters will
> > >>>>>> await a confirmation from the brokers that they have indeed
> stopped
> > >>>>>> tiered-related operations on the topic.
> > >>>>>>
> > >>>>>> For KRaft-backed clusters there will be only 2 states: ENABLED and
> > >>>>>> DISABLED. KRaft takes a fire-and-forget approach for topic
> > deletion. I
> > >>>>>> believe the same approach ought to be taken for tiered topics. The
> > >>>>>> mechanism which will ensure that leftover state in remote due to
> > failures
> > >>>>>> is cleaned up to me is the retention mechanism. In today's code, a
> > leader
> > >>>>>> deletes all segments it finds in remote with offsets below the log
> > start
> > >>>>>> offset. I believe this will be good enough for cleaning up
> leftover
> > state
> > >>>>>> in remote due to failures.
> > >>>>>>
> > >>>>>> I know that quite a few changes have been discussed so I will aim
> to
> > >>>>>> put them on paper in the upcoming days and let everyone know!
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Christo
> > >>>>>>
> > >>>>>> On Tue, 9 Apr 2024 at 14:49, Doğuşcan Namal <
> > namal.doguscan@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> +1 let's not introduce a new api and mark it immediately as
> > >>>>>>> deprecated :)
> > >>>>>>>
> > >>>>>>> On your second comment Luke, one thing we need to clarify is when
> > do
> > >>>>>>> we consider remote storage to be DISABLED for a topic?
> > >>>>>>> Particularly, what is the state when the remote storage is being
> > >>>>>>> deleted in case of disablement.policy=delete? Is it DISABLING or
> > DISABLED?
> > >>>>>>>
> > >>>>>>> If we move directly to the DISABLED state,
> > >>>>>>>
> > >>>>>>> a) in case of failures, the leaders should continue remote
> storage
> > >>>>>>> deletion even if the topic is moved to the DISABLED state,
> > otherwise we
> > >>>>>>> risk having stray data on remote storage.
> > >>>>>>> b) on each restart, we should initiate the remote storage
> deletion
> > >>>>>>> because although we replayed a record with a DISABLED state, we
> > can not be
> > >>>>>>> sure if the remote data is deleted or not.
> > >>>>>>>
> > >>>>>>> We could either consider keeping the remote topic in DISABLING
> > state
> > >>>>>>> until all of the remote storage data is deleted, or we need an
> > additional
> > >>>>>>> mechanism to handle the remote stray data.
> > >>>>>>>
> > >>>>>>> The existing topic deletion, for instance, handles stray logs on
> > >>>>>>> disk by detecting them on KafkaBroker startup and deleting before
> > the
> > >>>>>>> ReplicaManager is started.
> > >>>>>>> Maybe we need a similar mechanism here as well if we don't want a
> > >>>>>>> DISABLING state. Otherwise, we need a callback from Brokers to
> > validate
> > >>>>>>> that remote storage data is deleted and now we could move to the
> > DISABLED
> > >>>>>>> state.
> > >>>>>>>
> > >>>>>>> Thanks.
> > >>>>>>>
> > >>>>>>> On Tue, 9 Apr 2024 at 12:45, Luke Chen <sh...@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Christo,
> > >>>>>>>>
> > >>>>>>>> > I would then opt for moving information from
> DisableRemoteTopic
> > >>>>>>>> within the StopReplicas API which will then disappear in KRaft
> > >>>>>>>> world as it
> > >>>>>>>> is already scheduled for deprecation. What do you think?
> > >>>>>>>>
> > >>>>>>>> Sounds good to me.
> > >>>>>>>>
> > >>>>>>>> Thanks.
> > >>>>>>>> Luke
> > >>>>>>>>
> > >>>>>>>> On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <
> > >>>>>>>> christololov@gmail.com> wrote:
> > >>>>>>>>
> > >>>>>>>> > Heya Luke!
> > >>>>>>>> >
> > >>>>>>>> > I thought a bit more about it and I reached the same
> conclusion
> > >>>>>>>> as you for
> > >>>>>>>> > 2 as a follow-up from 1. In other words, in KRaft world I
> don't
> > >>>>>>>> think the
> > >>>>>>>> > controller needs to wait for acknowledgements for the brokers.
> > >>>>>>>> All we care
> > >>>>>>>> > about is that the leader (who is responsible for
> > >>>>>>>> archiving/deleting data in
> > >>>>>>>> > tiered storage) knows about the change and applies it
> properly.
> > >>>>>>>> If there is
> > >>>>>>>> > a leadership change halfway through the operation then the new
> > >>>>>>>> leader still
> > >>>>>>>> > needs to apply the message from the state topic and we know
> > that a
> > >>>>>>>> > disable-message will be applied before a
> reenablement-message. I
> > >>>>>>>> will
> > >>>>>>>> > change the KIP later today/tomorrow morning to reflect this
> > >>>>>>>> reasoning.
> > >>>>>>>> >
> > >>>>>>>> > However, with this I believe that introducing a new API just
> for
> > >>>>>>>> > Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a
> bit
> > >>>>>>>> of an
> > >>>>>>>> > overkill. I would then opt for moving information from
> > >>>>>>>> DisableRemoteTopic
> > >>>>>>>> > within the StopReplicas API which will then disappear in KRaft
> > >>>>>>>> world as it
> > >>>>>>>> > is already scheduled for deprecation. What do you think?
> > >>>>>>>> >
> > >>>>>>>> > Best,
> > >>>>>>>> > Christo
> > >>>>>>>> >
> > >>>>>>>> > On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com>
> > wrote:
> > >>>>>>>> >
> > >>>>>>>> > > Hi Christo,
> > >>>>>>>> > >
> > >>>>>>>> > > 1. I agree with Doguscan that in KRaft mode, the controller
> > >>>>>>>> won't send
> > >>>>>>>> > RPCs
> > >>>>>>>> > > to the brokers (except in the migration path).
> > >>>>>>>> > > So, I think we could adopt the similar way we did to
> > >>>>>>>> > `AlterReplicaLogDirs`
> > >>>>>>>> > > (
> > >>>>>>>> > > KIP-858
> > >>>>>>>> > > <
> > >>>>>>>> > >
> > >>>>>>>> >
> > >>>>>>>>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
> > >>>>>>>> > > >)
> > >>>>>>>> > > that let the broker notify controller any update, instead of
> > >>>>>>>> controller
> > >>>>>>>> > to
> > >>>>>>>> > > broker. And once the controller receives all the complete
> > >>>>>>>> requests from
> > >>>>>>>> > > brokers, it'll enter "Disabled" state. WDYT?
> > >>>>>>>> > >
> > >>>>>>>> > > 2. Why should we wait until all brokers to respond before
> > >>>>>>>> moving to
> > >>>>>>>> > > "Disabled" state in "KRaft mode"?
> > >>>>>>>> > > Currently, only the leader node does the remote log
> > >>>>>>>> upload/fetch tasks,
> > >>>>>>>> > so
> > >>>>>>>> > > does that mean the controller only need to make sure the
> > leader
> > >>>>>>>> completes
> > >>>>>>>> > > the stopPartition?
> > >>>>>>>> > > If during the leader node stopPartition process triggered
> > >>>>>>>> leadership
> > >>>>>>>> > > change, then the new leader should receive and apply the
> > >>>>>>>> configRecord
> > >>>>>>>> > > update before the leadership change record based on the
> KRaft
> > >>>>>>>> design,
> > >>>>>>>> > which
> > >>>>>>>> > > means there will be no gap that the follower node becomes
> the
> > >>>>>>>> leader and
> > >>>>>>>> > > starting doing unexpected upload/fetch tasks, right?
> > >>>>>>>> > > I agree we should make sure in ZK mode, all brokers are
> > >>>>>>>> completed the
> > >>>>>>>> > > stopPartitions before moving to "Disabled" state because ZK
> > >>>>>>>> node watcher
> > >>>>>>>> > is
> > >>>>>>>> > > working in a separate thread. But not sure about KRaft mode.
> > >>>>>>>> > >
> > >>>>>>>> > > Thanks.
> > >>>>>>>> > > Luke
> > >>>>>>>> > >
> > >>>>>>>> > >
> > >>>>>>>> > > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <
> > >>>>>>>> christololov@gmail.com>
> > >>>>>>>> > > wrote:
> > >>>>>>>> > >
> > >>>>>>>> > > > Heya everyone!
> > >>>>>>>> > > >
> > >>>>>>>> > > > re: Doguscan
> > >>>>>>>> > > >
> > >>>>>>>> > > > I believe the answer to 101 needs a bit more discussion.
> As
> > >>>>>>>> far as I
> > >>>>>>>> > > know,
> > >>>>>>>> > > > tiered storage today has methods to update a metadata of a
> > >>>>>>>> segment to
> > >>>>>>>> > say
> > >>>>>>>> > > > "hey, I would like this deleted", but actual deletion is
> > left
> > >>>>>>>> to plugin
> > >>>>>>>> > > > implementations (or any background cleaners). In other
> > words,
> > >>>>>>>> there is
> > >>>>>>>> > no
> > >>>>>>>> > > > "immediate" deletion. In this KIP, we would like to
> continue
> > >>>>>>>> doing the
> > >>>>>>>> > > same
> > >>>>>>>> > > > if the retention policy is set to delete. So I believe the
> > >>>>>>>> answer is
> > >>>>>>>> > > > actually that a) we will update the metadata of the
> segments
> > >>>>>>>> to mark
> > >>>>>>>> > them
> > >>>>>>>> > > > as deleted and b) we will advance the log start offset.
> Any
> > >>>>>>>> deletion of
> > >>>>>>>> > > > actual files will still be delegated to plugin
> > >>>>>>>> implementations. I
> > >>>>>>>> > believe
> > >>>>>>>> > > > this is further supported by
> > >>>>>>>> "*remote.log.disable.policy=delete:* Logs
> > >>>>>>>> > > that
> > >>>>>>>> > > > are archived in the remote storage will not be part of the
> > >>>>>>>> contiguous
> > >>>>>>>> > > > "active" log and will be deleted asynchronously as part of
> > the
> > >>>>>>>> > > disablement
> > >>>>>>>> > > > process"
> > >>>>>>>> > > >
> > >>>>>>>> > > > Following from the above, I believe for 102 it is fine to
> > >>>>>>>> allow setting
> > >>>>>>>> > > of
> > >>>>>>>> > > > remote.log.disable.policy on a disabled topic in much the
> > >>>>>>>> same way we
> > >>>>>>>> > > allow
> > >>>>>>>> > > > other remote-related configurations to be set on a topic
> > (i.e.
> > >>>>>>>> > > > local.retention.*) - it just won't have an effect.
> Granted,
> > I
> > >>>>>>>> do
> > >>>>>>>> > believe
> > >>>>>>>> > > we
> > >>>>>>>> > > > should restrict the policy being changed while a
> disablement
> > >>>>>>>> is
> > >>>>>>>> > ongoing.
> > >>>>>>>> > > >
> > >>>>>>>> > > > re: Satish and Kamal
> > >>>>>>>> > > >
> > >>>>>>>> > > > 104, 1 and 2 are fair asks, I will work with Doguscan to
> > >>>>>>>> update the KIP
> > >>>>>>>> > > > with the information!
> > >>>>>>>> > > >
> > >>>>>>>> > > > Best,
> > >>>>>>>> > > > Christo
> > >>>>>>>> > > >
> > >>>>>>>> > > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <
> > >>>>>>>> namal.doguscan@gmail.com
> > >>>>>>>> > >
> > >>>>>>>> > > > wrote:
> > >>>>>>>> > > >
> > >>>>>>>> > > > > Hi Satish, I will try to answer as much as I can and the
> > >>>>>>>> others could
> > >>>>>>>> > > > chime
> > >>>>>>>> > > > > in with further details.
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *101. For remote.log.disable.policy=delete: Does it
> delete
> > >>>>>>>> the remote
> > >>>>>>>> > > log
> > >>>>>>>> > > > > data immediately and the data in remote storage will not
> > be
> > >>>>>>>> taken
> > >>>>>>>> > into
> > >>>>>>>> > > > > account by any replica? That means log-start-offset is
> > >>>>>>>> moved to the
> > >>>>>>>> > > > earlier
> > >>>>>>>> > > > > local-log-start-offset.*
> > >>>>>>>> > > > > *Exactly. RemoteLogData will be deleted immediately. *
> > >>>>>>>> > > > > *So before the deletion starts we move LogStart offset
> to
> > >>>>>>>> > > > > LocalLogStartOffset to ensure that no RemoteLog will be
> > >>>>>>>> accessed
> > >>>>>>>> > after
> > >>>>>>>> > > > that
> > >>>>>>>> > > > > point.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > * 102. Can we update the remote.log.disable.policy after
> > >>>>>>>> tiered
> > >>>>>>>> > storage
> > >>>>>>>> > > > is
> > >>>>>>>> > > > > disabled on a topic?*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *This is a good point. I think we should not allow
> > >>>>>>>> modifying this
> > >>>>>>>> > > > > configuration*
> > >>>>>>>> > > > > *because changing the policy from Deletion to Retain
> when
> > >>>>>>>> there is an
> > >>>>>>>> > > > > ongoing Deletion will result in an undefined behaviour
> and
> > >>>>>>>> where we
> > >>>>>>>> > > > retain
> > >>>>>>>> > > > > half of the remote log and delete the other half.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > * 103. Do we plan to add any metrics related to this
> > >>>>>>>> feature?*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *Any recommendations?*
> > >>>>>>>> > > > > *We may emit a gauge showing the enablement state of a
> > >>>>>>>> topic but we
> > >>>>>>>> > > could
> > >>>>>>>> > > > > gather that info from the logs as well.*
> > >>>>>>>> > > > > *The total duration for remote topic deletion could be
> > >>>>>>>> added as well
> > >>>>>>>> > > but
> > >>>>>>>> > > > > this is more of a metric for the RemotePartitionRemover
> > >>>>>>>> itself.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *104. Please add configuration details about copier
> thread
> > >>>>>>>> pool,
> > >>>>>>>> > > > expiration
> > >>>>>>>> > > > > thread pool and the migration of the existing
> > >>>>>>>> > > > > RemoteLogManagerScheduledThreadPool.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *Will add the details.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > 105. How is the behaviour with topic or partition
> deletion
> > >>>>>>>> request
> > >>>>>>>> > > > > handled when tiered storage disablement request is still
> > >>>>>>>> being
> > >>>>>>>> > > > > processed on a topic?
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *If the disablement policy is Delete then a successive
> > >>>>>>>> topic deletion
> > >>>>>>>> > > > > request is going to be a NOOP because RemoteLogs is
> > already
> > >>>>>>>> being
> > >>>>>>>> > > > deleted.*
> > >>>>>>>> > > > > *If the disablement policy is Retain, then we only moved
> > the
> > >>>>>>>> > > > LogStartOffset
> > >>>>>>>> > > > > and didn't touch RemoteLogs anyway, so the delete topic
> > >>>>>>>> request will
> > >>>>>>>> > > > > result*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *in the initiation of RemoteLog deletion.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
> > >>>>>>>> > > > > kamal.chandraprakash@gmail.com> wrote:
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > > Hi,
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > Thanks for the KIP! Overall the KIP looks good and
> > >>>>>>>> covered most of
> > >>>>>>>> > > the
> > >>>>>>>> > > > > > items.
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > 1. Could you explain how the brokers will handle the
> > >>>>>>>> > > DisableRemoteTopic
> > >>>>>>>> > > > > API
> > >>>>>>>> > > > > > request?
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > 2. Who will initiate the controller interaction
> > sequence?
> > >>>>>>>> Does the
> > >>>>>>>> > > > > > controller listens for
> > >>>>>>>> > > > > > topic config updates and initiate the disablement?
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > --
> > >>>>>>>> > > > > > Kamal
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
> > >>>>>>>> > > > satish.duggana@gmail.com
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > wrote:
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > I had an initial review of the KIP and left the
> below
> > >>>>>>>> comments.
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > 101. For remote.log.disable.policy=delete:
> > >>>>>>>> > > > > > > Does it delete the remote log data immediately and
> the
> > >>>>>>>> data in
> > >>>>>>>> > > remote
> > >>>>>>>> > > > > > > storage will not be taken into account by any
> replica?
> > >>>>>>>> That means
> > >>>>>>>> > > > > > > log-start-offset is moved to the earlier
> > >>>>>>>> local-log-start-offset.
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > 102. Can we update the remote.log.disable.policy
> after
> > >>>>>>>> tiered
> > >>>>>>>> > > storage
> > >>>>>>>> > > > > > > is disabled on a topic?
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > 103. Do we plan to add any metrics related to this
> > >>>>>>>> feature?
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > 104. Please add configuration details about copier
> > >>>>>>>> thread pool,
> > >>>>>>>> > > > > > > expiration thread pool and the migration of the
> > existing
> > >>>>>>>> > > > > > > RemoteLogManagerScheduledThreadPool.
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > 105. How is the behaviour with topic or partition
> > >>>>>>>> deletion
> > >>>>>>>> > request
> > >>>>>>>> > > > > > > handled when tiered storage disablement request is
> > >>>>>>>> still being
> > >>>>>>>> > > > > > > processed on a topic?
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > ~Satish.
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
> > >>>>>>>> > > > namal.doguscan@gmail.com
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > > wrote:
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > Hi Christo and Luke,
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > I think the KRaft section of the KIP requires
> slight
> > >>>>>>>> > improvement.
> > >>>>>>>> > > > The
> > >>>>>>>> > > > > > > metadata propagation in KRaft is handled by the RAFT
> > >>>>>>>> layer
> > >>>>>>>> > instead
> > >>>>>>>> > > of
> > >>>>>>>> > > > > > > sending Controller -> Broker RPCs. In fact, KIP-631
> > >>>>>>>> deprecated
> > >>>>>>>> > > these
> > >>>>>>>> > > > > > RPCs.
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > I will come up with some recommendations on how we
> > >>>>>>>> could
> > >>>>>>>> > improve
> > >>>>>>>> > > > that
> > >>>>>>>> > > > > > > one but until then, @Luke please feel free to review
> > >>>>>>>> the KIP.
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > @Satish, if we want this to make it to Kafka 3.8 I
> > >>>>>>>> believe we
> > >>>>>>>> > > need
> > >>>>>>>> > > > to
> > >>>>>>>> > > > > > > aim to get the KIP approved in the following weeks
> > >>>>>>>> otherwise it
> > >>>>>>>> > > will
> > >>>>>>>> > > > > slip
> > >>>>>>>> > > > > > > and we can not support it in Zookeeper mode.
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > I also would like to better understand what is the
> > >>>>>>>> community's
> > >>>>>>>> > > > stand
> > >>>>>>>> > > > > > for
> > >>>>>>>> > > > > > > adding a new feature for Zookeeper since it is
> marked
> > as
> > >>>>>>>> > deprecated
> > >>>>>>>> > > > > > already.
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > Thanks.
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
> > >>>>>>>> > > > christololov@gmail.com>
> > >>>>>>>> > > > > > > wrote:
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> Heya,
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> I do have some time to put into this, but to be
> > >>>>>>>> honest I am
> > >>>>>>>> > > still
> > >>>>>>>> > > > > > after
> > >>>>>>>> > > > > > > >> reviews of the KIP itself :)
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> After the latest changes it ought to be detailing
> > >>>>>>>> both a
> > >>>>>>>> > > Zookeeper
> > >>>>>>>> > > > > > > approach
> > >>>>>>>> > > > > > > >> and a KRaft approach.
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> Do you have any thoughts on how it could be
> > improved
> > >>>>>>>> or
> > >>>>>>>> > should I
> > >>>>>>>> > > > > > start a
> > >>>>>>>> > > > > > > >> voting thread?
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> Best,
> > >>>>>>>> > > > > > > >> Christo
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <
> > >>>>>>>> showuon@gmail.com>
> > >>>>>>>> > > > wrote:
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> > Hi Christo,
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > > >> > Any update with this KIP?
> > >>>>>>>> > > > > > > >> > If you don't have time to complete it, I can
> > >>>>>>>> collaborate
> > >>>>>>>> > with
> > >>>>>>>> > > > you
> > >>>>>>>> > > > > to
> > >>>>>>>> > > > > > > work
> > >>>>>>>> > > > > > > >> > on it.
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > > >> > Thanks.
> > >>>>>>>> > > > > > > >> > Luke
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish
> Duggana <
> > >>>>>>>> > > > > > > satish.duggana@gmail.com>
> > >>>>>>>> > > > > > > >> > wrote:
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > > >> > > Hi Christo,
> > >>>>>>>> > > > > > > >> > > Thanks for volunteering to contribute to the
> > KIP
> > >>>>>>>> > > discussion. I
> > >>>>>>>> > > > > > > suggest
> > >>>>>>>> > > > > > > >> > > considering this KIP for both ZK and KRaft as
> > it
> > >>>>>>>> will be
> > >>>>>>>> > > > helpful
> > >>>>>>>> > > > > > for
> > >>>>>>>> > > > > > > >> > > this feature to be available in 3.8.0 running
> > >>>>>>>> with ZK
> > >>>>>>>> > > > clusters.
> > >>>>>>>> > > > > > > >> > >
> > >>>>>>>> > > > > > > >> > > Thanks,
> > >>>>>>>> > > > > > > >> > > Satish.
> > >>>>>>>> > > > > > > >> > >
> > >>>>>>>> > > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
> > >>>>>>>> > > > > > christololov@gmail.com
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > >> > > wrote:
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > Hello!
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > I volunteer to get this KIP moving forward
> > and
> > >>>>>>>> > implemented
> > >>>>>>>> > > > in
> > >>>>>>>> > > > > > > Apache
> > >>>>>>>> > > > > > > >> > > Kafka
> > >>>>>>>> > > > > > > >> > > > 3.8.
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > I have caught up with Mehari offline and we
> > >>>>>>>> have agreed
> > >>>>>>>> > > that
> > >>>>>>>> > > > > > given
> > >>>>>>>> > > > > > > >> > Apache
> > >>>>>>>> > > > > > > >> > > > Kafka 4.0 being around the corner we would
> > >>>>>>>> like to
> > >>>>>>>> > propose
> > >>>>>>>> > > > > this
> > >>>>>>>> > > > > > > feature
> > >>>>>>>> > > > > > > >> > > > only for KRaft clusters.
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > Any and all reviews and comments are
> welcome!
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > Best,
> > >>>>>>>> > > > > > > >> > > > Christo
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan
> Namal <
> > >>>>>>>> > > > > > > namal.doguscan@gmail.com>
> > >>>>>>>> > > > > > > >> > > > wrote:
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > > Hi everyone, any progress on the status
> of
> > >>>>>>>> this KIP?
> > >>>>>>>> > > > Overall
> > >>>>>>>> > > > > > > looks
> > >>>>>>>> > > > > > > >> > > good to
> > >>>>>>>> > > > > > > >> > > > > me but I wonder whether we still need to
> > >>>>>>>> support it
> > >>>>>>>> > for
> > >>>>>>>> > > > > > > Zookeeper
> > >>>>>>>> > > > > > > >> > mode
> > >>>>>>>> > > > > > > >> > > > > given that it will be deprecated in the
> > next
> > >>>>>>>> 3 months.
> > >>>>>>>> > > > > > > >> > > > >
> > >>>>>>>> > > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari"
> > >>>>>>>> wrote:
> > >>>>>>>> > > > > > > >> > > > > > Hi everyone,
> > >>>>>>>> > > > > > > >> > > > > > I would like to start a discussion on
> > >>>>>>>> KIP-950:
> > >>>>>>>> > Tiered
> > >>>>>>>> > > > > > Storage
> > >>>>>>>> > > > > > > >> > > Disablement
> > >>>>>>>> > > > > > > >> > > > > (
> > >>>>>>>> > > > > > > >> > > > >
> > >>>>>>>> > > > > > > >> > > > >
> > >>>>>>>> > > > > > > >> > >
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > >
> > >>>>>>>> > >
> > >>>>>>>> >
> > >>>>>>>>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
> > >>>>>>>> > > > > > > >> > > > > ).
> > >>>>>>>> > > > > > > >> > > > > >
> > >>>>>>>> > > > > > > >> > > > > > This KIP proposes adding the ability to
> > >>>>>>>> disable and
> > >>>>>>>> > > > > > re-enable
> > >>>>>>>> > > > > > > >> > tiered
> > >>>>>>>> > > > > > > >> > > > > storage on a topic.
> > >>>>>>>> > > > > > > >> > > > > >
> > >>>>>>>> > > > > > > >> > > > > > Thanks,
> > >>>>>>>> > > > > > > >> > > > > > Mehari
> > >>>>>>>> > > > > > > >> > > > > >
> > >>>>>>>> > > > > > > >> > > > >
> > >>>>>>>> > > > > > > >> > >
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > >
> > >>>>>>>> > >
> > >>>>>>>> >
> > >>>>>>>>
> > >>>>>>>
> >
>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Luke Chen <sh...@gmail.com>.

Hi Christo,

Thanks for the explanation.
I think it would be good if you could add that into the KIP.

Otherwise, LGTM.

Thank you.
Luke

On Mon, May 13, 2024 at 11:55 PM Christo Lolov <ch...@gmail.com>
wrote:

> Heya!
>
> re Kamal - Okay, I believe I understand what you mean and I agree. I have
> made the following change
>
> ```
>
> During tiered storage disablement, when RemoteLogManager#stopPartition() is
> called:
>
>    - Tasks scheduled for the topic-partitions in the
>    RemoteStorageCopierThreadPool will be canceled.
>    - If the disablement policy is retain, scheduled tasks for the
>    topic-partitions in the RemoteDataExpirationThreadPool will remain
>    unchanged.
>    - If the disablement policy is delete, we will first advance the log
>    start offset and we will let tasks scheduled for the topic-partitions in
>    the RemoteDataExpirationThreadPool to successfully delete all remote
>    segments before the log start offset and then unregister themselves.
>
> ```
>
> re Luke - I checked once again. As far as I understand when a broker goes
> down all replicas it hosts go to OfflineReplica state in the state machine
> the controller maintains. The moment the broker comes back up again the
> state machine resends StopReplica based on
> ```
>
> * OfflineReplica -> ReplicaDeletionStarted
> * --send StopReplicaRequest to the replica (with deletion)
>
> ```
> from ReplicaStateMachine.scala. So I was wrong and you are right, we do not
> appear to be sending constant requests today. I believe it is safe for us
> to follow a similar approach i.e. if a replica comes online again we resend
> the StopReplica.
>
> If you don't notice any more problems I will aim to start a VOTE tomorrow
> so we can get at least part of this KIP in 3.8.
>
> Best,
> Christo
>
> On Fri, 10 May 2024 at 11:11, Luke Chen <sh...@gmail.com> wrote:
>
> > Hi Christo,
> >
> > > 1. I am not certain I follow the question. From DISABLED you can only
> go
> > to
> > ENABLED regardless of whether your cluster is backed by Zookeeper or
> KRaft.
> > Am I misunderstanding your point?
> >
> > Yes, you're right.
> >
> > > 4. I was thinking that if there is a mismatch we will just fail
> accepting
> > the request for disablement. This should be the same in both Zookeeper
> and
> > KRaft. Or am I misunderstanding your question?
> >
> > OK, sounds good.
> >
> > > 6. I think my current train of thought is that there will be unlimited
> > retries until all brokers respond in a similar way to how deletion of a
> > topic works today in ZK. In the meantime the state will continue to be
> > DISABLING. Do you have a better suggestion?
> >
> > I don't think infinite retries is a good idea since if a broker is down
> > forever, this request will never complete.
> > You mentioned the existing topic deletion is using the similar pattern,
> how
> > does it handle this issue?
> >
> > Thanks.
> > Luke
> >
> > On Thu, May 9, 2024 at 9:21 PM Christo Lolov <ch...@gmail.com>
> > wrote:
> >
> > > Heya!
> > >
> > > re: Luke
> > >
> > > 1. I am not certain I follow the question. From DISABLED you can only
> go
> > to
> > > ENABLED regardless of whether your cluster is backed by Zookeeper or
> > KRaft.
> > > Am I misunderstanding your point?
> > >
> > > 2. Apologies, this was a leftover from previous versions. I have
> updated
> > > the Zookeeper section. The steps ought to be: controller receives
> change,
> > > commits necessary data to Zookeeper, enqueues disablement and starts
> > > sending StopReplicas request to brokers; brokers receive StopReplicas
> and
> > > propagate them all the way to RemoteLogManager#stopPartitions which
> takes
> > > care of the rest.
> > >
> > > 3. Correct, it should say DISABLED - this should now be corrected.
> > >
> > > 4. I was thinking that if there is a mismatch we will just fail
> accepting
> > > the request for disablement. This should be the same in both Zookeeper
> > and
> > > KRaft. Or am I misunderstanding your question?
> > >
> > > 5. Yeah. I am now doing a second pass on all diagrams and will update
> > them
> > > by the end of the day!
> > >
> > > 6. I think my current train of thought is that there will be unlimited
> > > retries until all brokers respond in a similar way to how deletion of a
> > > topic works today in ZK. In the meantime the state will continue to be
> > > DISABLING. Do you have a better suggestion?
> > >
> > > re: Kamal
> > >
> > > Yep, I will update all diagrams
> > >
> > > I am not certain I follow the reasoning for making retain and delete
> the
> > > same. Deletion when the policy is retain happens asynchronously due to
> > > expiration. I think that deletion when the policy is delete ought to
> (at
> > > least for the initial implementation) happen synchronously. Should
> people
> > > run into timeout problems we can always then have a follow-up KIP where
> > we
> > > make it asynchronous.
> > >
> > > Best,
> > > Christo
> > >
> > > On Tue, 7 May 2024 at 10:04, Kamal Chandraprakash <
> > > kamal.chandraprakash@gmail.com> wrote:
> > >
> > > > Hi Christo,
> > > >
> > > > Thanks for the update!
> > > >
> > > > For both the policies "retain" and "delete", can we maintain the same
> > > > approach to delete the segments async?
> > > >
> > > > > If the disablement policy is set to delete, the Log start offset
> > (LSO)
> > > is
> > > > updated to match the Local Log Start Offset and the remote log is
> > deleted
> > > > by calling the RemoteStorageManager#deleteLogSegmentData().
> > > >
> > > > In the KIP, it's mentioned that when the disable policy is set to
> > > "delete",
> > > > the remote-log-segments will be
> > > > deleted in-sync. The stopPartition call might get timed out when the
> > > number
> > > > of remote log segments to
> > > > delete is huge. We can further extend the same approach for the topic
> > > > deletion requests.
> > > >
> > > > Also, Could you please update the state diagram about the
> transitions?
> > It
> > > > is not clear when to transit from
> > > > DISABLING to DISABLED state?
> > > >
> > > > --
> > > > Kamal
> > > >
> > > > On Mon, May 6, 2024 at 6:55 PM Kamal Chandraprakash <
> > > > kamal.chandraprakash@gmail.com> wrote:
> > > >
> > > > > Ignore the above message. Got the answers after reading the state
> > > > > transition section.
> > > > >
> > > > > > If the disablement policy is delete, tasks scheduled for the
> > > > > topic-partitions in the RemoteDataExpirationThreadPool will also be
> > > > > canceled.
> > > > >
> > > > > We are deleting the segments synchronously. Should we delete them
> > > > > asynchronously? The same approach can be extended to topic deletion
> > > > > requests.
> > > > >
> > > > > > 6. In ZK mode, what will the controller do if the "stopReplicas"
> > > > > responses not received from all brokers? Reverting the changes?
> > > > >
> > > > > Since we are deleting the segments synchronously. This case can be
> > > bound
> > > > > to happen when the number of remote log segments to
> > > > > delete is huge.
> > > > >
> > > > >
> > > > > On Mon, May 6, 2024, 18:12 Kamal Chandraprakash <
> > > > > kamal.chandraprakash@gmail.com> wrote:
> > > > >
> > > > >> Hi Christo,
> > > > >>
> > > > >> Thanks for the update!
> > > > >>
> > > > >> 1. In the ZK mode, how will the transition from DISABLING to
> > DISABLED
> > > > >> state happen?
> > > > >> For the "retain" policy, until we delete all the remote-log
> > segments,
> > > > the
> > > > >> state will be
> > > > >> DISABLING and the deletion can happen only when they breach either
> > the
> > > > >> retention
> > > > >> time (or) size.
> > > > >>
> > > > >> How does the controller monitor that all the remote log segments
> are
> > > > >> deleted for all
> > > > >> the partitions of the topic before transitioning the state to
> > > DISABLED?
> > > > >>
> > > > >> 2. In Kraft, we have only ENABLED -> DISABLED state. How are we
> > > > >> supporting the case
> > > > >> "retain" -> "enable"?
> > > > >>
> > > > >> If the remote storage is degraded, we want to avoid uploading the
> > > > >> segments temporarily
> > > > >> and resume back once the remote storage is healthy. Is the case
> > > > supported?
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Fri, May 3, 2024 at 12:12 PM Luke Chen <sh...@gmail.com>
> > wrote:
> > > > >>
> > > > >>> Also, I think using `stopReplicas` request is a good idea because
> > it
> > > > >>> won't cause any problems while migrating to KRaft mode.
> > > > >>> The stopReplicas request is one of the request that KRaft
> > controller
> > > > >>> will send to ZK brokers during migration.
> > > > >>>
> > > > >>> Thanks.
> > > > >>> Luke
> > > > >>>
> > > > >>> On Fri, May 3, 2024 at 11:48 AM Luke Chen <sh...@gmail.com>
> > wrote:
> > > > >>>
> > > > >>>> Hi Christo,
> > > > >>>>
> > > > >>>> Thanks for the update.
> > > > >>>>
> > > > >>>> Questions:
> > > > >>>> 1. For this
> > > > >>>> "The possible state transition from DISABLED state is to the
> > > ENABLED."
> > > > >>>> I think it only applies for KRaft mode. In ZK mode, the possible
> > > state
> > > > >>>> is "DISABLING", right?
> > > > >>>>
> > > > >>>> 2. For this:
> > > > >>>> "If the cluster is using Zookeeper as the control plane,
> enabling
> > > > >>>> remote storage for a topic triggers the controller to send this
> > > > information
> > > > >>>> to Zookeeper. Each broker listens for changes in Zookeeper, and
> > > when a
> > > > >>>> change is detected, the broker triggers
> > > > >>>> RemoteLogManager#onLeadershipChange()."
> > > > >>>>
> > > > >>>> I think the way ZK brokers knows the leadership change is by
> > getting
> > > > >>>> the LeaderAndISRRequeset from the controller, not listening for
> > > > changes in
> > > > >>>> ZK.
> > > > >>>>
> > > > >>>> 3. In the KRaft handler steps, you said:
> > > > >>>> "The controller also updates the Topic metadata to increment the
> > > > >>>> tiered_epoch and update the tiered_stateto DISABLING state."
> > > > >>>>
> > > > >>>> Should it be "DISABLED" state since it's KRaft mode?
> > > > >>>>
> > > > >>>> 4. I was thinking how we handle the tiered_epoch not match
> error.
> > > > >>>> For ZK, I think the controller won't write any data into ZK
> Znode,
> > > > >>>> For KRaft, either configRecord or updateTopicMetadata records
> > won't
> > > be
> > > > >>>> written.
> > > > >>>> Is that right? Because the current workflow makes me think there
> > > will
> > > > >>>> be partial data updated in ZK/KRaft when tiered_epoch error.
> > > > >>>>
> > > > >>>> 5. Since we changed to use stopReplicas (V5) request now, the
> > > diagram
> > > > >>>> for ZK workflow might also need to update.
> > > > >>>>
> > > > >>>> 6. In ZK mode, what will the controller do if the "stopReplicas"
> > > > >>>> responses not received from all brokers? Reverting the changes?
> > > > >>>> This won't happen in KRaft mode because it's broker's
> > responsibility
> > > > to
> > > > >>>> fetch metadata update from controller.
> > > > >>>>
> > > > >>>>
> > > > >>>> Thank you.
> > > > >>>> Luke
> > > > >>>>
> > > > >>>>
> > > > >>>> On Fri, Apr 19, 2024 at 10:23 PM Christo Lolov <
> > > > christololov@gmail.com>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Heya all!
> > > > >>>>>
> > > > >>>>> I have updated KIP-950. A list of what I have updated is:
> > > > >>>>>
> > > > >>>>> * Explicitly state that Zookeeper-backed clusters will have
> > ENABLED
> > > > ->
> > > > >>>>> DISABLING -> DISABLED while KRaft-backed clusters will only
> have
> > > > ENABLED ->
> > > > >>>>> DISABLED
> > > > >>>>> * Added two configurations for the new thread pools and
> explained
> > > > >>>>> where values will be picked-up mid Kafka version upgrade
> > > > >>>>> * Explained how leftover remote partitions will be scheduled
> for
> > > > >>>>> deletion
> > > > >>>>> * Updated the API to use StopReplica V5 rather than a whole new
> > > > >>>>> controller-to-broker API
> > > > >>>>> * Explained that the disablement procedure will be triggered by
> > the
> > > > >>>>> controller listening for an (Incremental)AlterConfig change
> > > > >>>>> * Explained that we will first move log start offset and then
> > > issue a
> > > > >>>>> deletion
> > > > >>>>> * Went into more details that changing
> remote.log.disable.policy
> > > > after
> > > > >>>>> disablement won't do anything and that if a customer would like
> > > > additional
> > > > >>>>> data deleted they would have to use already existing methods
> > > > >>>>>
> > > > >>>>> Let me know if there are any new comments or I have missed
> > > something!
> > > > >>>>>
> > > > >>>>> Best,
> > > > >>>>> Christo
> > > > >>>>>
> > > > >>>>> On Mon, 15 Apr 2024 at 12:40, Christo Lolov <
> > > christololov@gmail.com>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Heya Doguscan,
> > > > >>>>>>
> > > > >>>>>> I believe that the state of the world after this KIP will be
> the
> > > > >>>>>> following:
> > > > >>>>>>
> > > > >>>>>> For Zookeeper-backed clusters there will be 3 states: ENABLED,
> > > > >>>>>> DISABLING and DISABLED. We want this because Zookeeper-backed
> > > > clusters will
> > > > >>>>>> await a confirmation from the brokers that they have indeed
> > > stopped
> > > > >>>>>> tiered-related operations on the topic.
> > > > >>>>>>
> > > > >>>>>> For KRaft-backed clusters there will be only 2 states: ENABLED
> > and
> > > > >>>>>> DISABLED. KRaft takes a fire-and-forget approach for topic
> > > > deletion. I
> > > > >>>>>> believe the same approach ought to be taken for tiered topics.
> > The
> > > > >>>>>> mechanism which will ensure that leftover state in remote due
> to
> > > > failures
> > > > >>>>>> is cleaned up to me is the retention mechanism. In today's
> > code, a
> > > > leader
> > > > >>>>>> deletes all segments it finds in remote with offsets below the
> > log
> > > > start
> > > > >>>>>> offset. I believe this will be good enough for cleaning up
> > > leftover
> > > > state
> > > > >>>>>> in remote due to failures.
> > > > >>>>>>
> > > > >>>>>> I know that quite a few changes have been discussed so I will
> > aim
> > > to
> > > > >>>>>> put them on paper in the upcoming days and let everyone know!
> > > > >>>>>>
> > > > >>>>>> Best,
> > > > >>>>>> Christo
> > > > >>>>>>
> > > > >>>>>> On Tue, 9 Apr 2024 at 14:49, Doğuşcan Namal <
> > > > namal.doguscan@gmail.com>
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> +1 let's not introduce a new api and mark it immediately as
> > > > >>>>>>> deprecated :)
> > > > >>>>>>>
> > > > >>>>>>> On your second comment Luke, one thing we need to clarify is
> > when
> > > > do
> > > > >>>>>>> we consider remote storage to be DISABLED for a topic?
> > > > >>>>>>> Particularly, what is the state when the remote storage is
> > being
> > > > >>>>>>> deleted in case of disablement.policy=delete? Is it DISABLING
> > or
> > > > DISABLED?
> > > > >>>>>>>
> > > > >>>>>>> If we move directly to the DISABLED state,
> > > > >>>>>>>
> > > > >>>>>>> a) in case of failures, the leaders should continue remote
> > > storage
> > > > >>>>>>> deletion even if the topic is moved to the DISABLED state,
> > > > otherwise we
> > > > >>>>>>> risk having stray data on remote storage.
> > > > >>>>>>> b) on each restart, we should initiate the remote storage
> > > deletion
> > > > >>>>>>> because although we replayed a record with a DISABLED state,
> we
> > > > can not be
> > > > >>>>>>> sure if the remote data is deleted or not.
> > > > >>>>>>>
> > > > >>>>>>> We could either consider keeping the remote topic in
> DISABLING
> > > > state
> > > > >>>>>>> until all of the remote storage data is deleted, or we need
> an
> > > > additional
> > > > >>>>>>> mechanism to handle the remote stray data.
> > > > >>>>>>>
> > > > >>>>>>> The existing topic deletion, for instance, handles stray logs
> > on
> > > > >>>>>>> disk by detecting them on KafkaBroker startup and deleting
> > before
> > > > the
> > > > >>>>>>> ReplicaManager is started.
> > > > >>>>>>> Maybe we need a similar mechanism here as well if we don't
> > want a
> > > > >>>>>>> DISABLING state. Otherwise, we need a callback from Brokers
> to
> > > > validate
> > > > >>>>>>> that remote storage data is deleted and now we could move to
> > the
> > > > DISABLED
> > > > >>>>>>> state.
> > > > >>>>>>>
> > > > >>>>>>> Thanks.
> > > > >>>>>>>
> > > > >>>>>>> On Tue, 9 Apr 2024 at 12:45, Luke Chen <sh...@gmail.com>
> > > wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi Christo,
> > > > >>>>>>>>
> > > > >>>>>>>> > I would then opt for moving information from
> > > DisableRemoteTopic
> > > > >>>>>>>> within the StopReplicas API which will then disappear in
> KRaft
> > > > >>>>>>>> world as it
> > > > >>>>>>>> is already scheduled for deprecation. What do you think?
> > > > >>>>>>>>
> > > > >>>>>>>> Sounds good to me.
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks.
> > > > >>>>>>>> Luke
> > > > >>>>>>>>
> > > > >>>>>>>> On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <
> > > > >>>>>>>> christololov@gmail.com> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>> > Heya Luke!
> > > > >>>>>>>> >
> > > > >>>>>>>> > I thought a bit more about it and I reached the same
> > > conclusion
> > > > >>>>>>>> as you for
> > > > >>>>>>>> > 2 as a follow-up from 1. In other words, in KRaft world I
> > > don't
> > > > >>>>>>>> think the
> > > > >>>>>>>> > controller needs to wait for acknowledgements for the
> > brokers.
> > > > >>>>>>>> All we care
> > > > >>>>>>>> > about is that the leader (who is responsible for
> > > > >>>>>>>> archiving/deleting data in
> > > > >>>>>>>> > tiered storage) knows about the change and applies it
> > > properly.
> > > > >>>>>>>> If there is
> > > > >>>>>>>> > a leadership change halfway through the operation then the
> > new
> > > > >>>>>>>> leader still
> > > > >>>>>>>> > needs to apply the message from the state topic and we
> know
> > > > that a
> > > > >>>>>>>> > disable-message will be applied before a
> > > reenablement-message. I
> > > > >>>>>>>> will
> > > > >>>>>>>> > change the KIP later today/tomorrow morning to reflect
> this
> > > > >>>>>>>> reasoning.
> > > > >>>>>>>> >
> > > > >>>>>>>> > However, with this I believe that introducing a new API
> just
> > > for
> > > > >>>>>>>> > Zookeeper-based clusters (i.e. DisableRemoteTopic)
> becomes a
> > > bit
> > > > >>>>>>>> of an
> > > > >>>>>>>> > overkill. I would then opt for moving information from
> > > > >>>>>>>> DisableRemoteTopic
> > > > >>>>>>>> > within the StopReplicas API which will then disappear in
> > KRaft
> > > > >>>>>>>> world as it
> > > > >>>>>>>> > is already scheduled for deprecation. What do you think?
> > > > >>>>>>>> >
> > > > >>>>>>>> > Best,
> > > > >>>>>>>> > Christo
> > > > >>>>>>>> >
> > > > >>>>>>>> > On Wed, 3 Apr 2024 at 07:59, Luke Chen <showuon@gmail.com
> >
> > > > wrote:
> > > > >>>>>>>> >
> > > > >>>>>>>> > > Hi Christo,
> > > > >>>>>>>> > >
> > > > >>>>>>>> > > 1. I agree with Doguscan that in KRaft mode, the
> > controller
> > > > >>>>>>>> won't send
> > > > >>>>>>>> > RPCs
> > > > >>>>>>>> > > to the brokers (except in the migration path).
> > > > >>>>>>>> > > So, I think we could adopt the similar way we did to
> > > > >>>>>>>> > `AlterReplicaLogDirs`
> > > > >>>>>>>> > > (
> > > > >>>>>>>> > > KIP-858
> > > > >>>>>>>> > > <
> > > > >>>>>>>> > >
> > > > >>>>>>>> >
> > > > >>>>>>>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
> > > > >>>>>>>> > > >)
> > > > >>>>>>>> > > that let the broker notify controller any update,
> instead
> > of
> > > > >>>>>>>> controller
> > > > >>>>>>>> > to
> > > > >>>>>>>> > > broker. And once the controller receives all the
> complete
> > > > >>>>>>>> requests from
> > > > >>>>>>>> > > brokers, it'll enter "Disabled" state. WDYT?
> > > > >>>>>>>> > >
> > > > >>>>>>>> > > 2. Why should we wait until all brokers to respond
> before
> > > > >>>>>>>> moving to
> > > > >>>>>>>> > > "Disabled" state in "KRaft mode"?
> > > > >>>>>>>> > > Currently, only the leader node does the remote log
> > > > >>>>>>>> upload/fetch tasks,
> > > > >>>>>>>> > so
> > > > >>>>>>>> > > does that mean the controller only need to make sure the
> > > > leader
> > > > >>>>>>>> completes
> > > > >>>>>>>> > > the stopPartition?
> > > > >>>>>>>> > > If during the leader node stopPartition process
> triggered
> > > > >>>>>>>> leadership
> > > > >>>>>>>> > > change, then the new leader should receive and apply the
> > > > >>>>>>>> configRecord
> > > > >>>>>>>> > > update before the leadership change record based on the
> > > KRaft
> > > > >>>>>>>> design,
> > > > >>>>>>>> > which
> > > > >>>>>>>> > > means there will be no gap that the follower node
> becomes
> > > the
> > > > >>>>>>>> leader and
> > > > >>>>>>>> > > starting doing unexpected upload/fetch tasks, right?
> > > > >>>>>>>> > > I agree we should make sure in ZK mode, all brokers are
> > > > >>>>>>>> completed the
> > > > >>>>>>>> > > stopPartitions before moving to "Disabled" state because
> > ZK
> > > > >>>>>>>> node watcher
> > > > >>>>>>>> > is
> > > > >>>>>>>> > > working in a separate thread. But not sure about KRaft
> > mode.
> > > > >>>>>>>> > >
> > > > >>>>>>>> > > Thanks.
> > > > >>>>>>>> > > Luke
> > > > >>>>>>>> > >
> > > > >>>>>>>> > >
> > > > >>>>>>>> > > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <
> > > > >>>>>>>> christololov@gmail.com>
> > > > >>>>>>>> > > wrote:
> > > > >>>>>>>> > >
> > > > >>>>>>>> > > > Heya everyone!
> > > > >>>>>>>> > > >
> > > > >>>>>>>> > > > re: Doguscan
> > > > >>>>>>>> > > >
> > > > >>>>>>>> > > > I believe the answer to 101 needs a bit more
> discussion.
> > > As
> > > > >>>>>>>> far as I
> > > > >>>>>>>> > > know,
> > > > >>>>>>>> > > > tiered storage today has methods to update a metadata
> > of a
> > > > >>>>>>>> segment to
> > > > >>>>>>>> > say
> > > > >>>>>>>> > > > "hey, I would like this deleted", but actual deletion
> is
> > > > left
> > > > >>>>>>>> to plugin
> > > > >>>>>>>> > > > implementations (or any background cleaners). In other
> > > > words,
> > > > >>>>>>>> there is
> > > > >>>>>>>> > no
> > > > >>>>>>>> > > > "immediate" deletion. In this KIP, we would like to
> > > continue
> > > > >>>>>>>> doing the
> > > > >>>>>>>> > > same
> > > > >>>>>>>> > > > if the retention policy is set to delete. So I believe
> > the
> > > > >>>>>>>> answer is
> > > > >>>>>>>> > > > actually that a) we will update the metadata of the
> > > segments
> > > > >>>>>>>> to mark
> > > > >>>>>>>> > them
> > > > >>>>>>>> > > > as deleted and b) we will advance the log start
> offset.
> > > Any
> > > > >>>>>>>> deletion of
> > > > >>>>>>>> > > > actual files will still be delegated to plugin
> > > > >>>>>>>> implementations. I
> > > > >>>>>>>> > believe
> > > > >>>>>>>> > > > this is further supported by
> > > > >>>>>>>> "*remote.log.disable.policy=delete:* Logs
> > > > >>>>>>>> > > that
> > > > >>>>>>>> > > > are archived in the remote storage will not be part of
> > the
> > > > >>>>>>>> contiguous
> > > > >>>>>>>> > > > "active" log and will be deleted asynchronously as
> part
> > of
> > > > the
> > > > >>>>>>>> > > disablement
> > > > >>>>>>>> > > > process"
> > > > >>>>>>>> > > >
> > > > >>>>>>>> > > > Following from the above, I believe for 102 it is fine
> > to
> > > > >>>>>>>> allow setting
> > > > >>>>>>>> > > of
> > > > >>>>>>>> > > > remote.log.disable.policy on a disabled topic in much
> > the
> > > > >>>>>>>> same way we
> > > > >>>>>>>> > > allow
> > > > >>>>>>>> > > > other remote-related configurations to be set on a
> topic
> > > > (i.e.
> > > > >>>>>>>> > > > local.retention.*) - it just won't have an effect.
> > > Granted,
> > > > I
> > > > >>>>>>>> do
> > > > >>>>>>>> > believe
> > > > >>>>>>>> > > we
> > > > >>>>>>>> > > > should restrict the policy being changed while a
> > > disablement
> > > > >>>>>>>> is
> > > > >>>>>>>> > ongoing.
> > > > >>>>>>>> > > >
> > > > >>>>>>>> > > > re: Satish and Kamal
> > > > >>>>>>>> > > >
> > > > >>>>>>>> > > > 104, 1 and 2 are fair asks, I will work with Doguscan
> to
> > > > >>>>>>>> update the KIP
> > > > >>>>>>>> > > > with the information!
> > > > >>>>>>>> > > >
> > > > >>>>>>>> > > > Best,
> > > > >>>>>>>> > > > Christo
> > > > >>>>>>>> > > >
> > > > >>>>>>>> > > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <
> > > > >>>>>>>> namal.doguscan@gmail.com
> > > > >>>>>>>> > >
> > > > >>>>>>>> > > > wrote:
> > > > >>>>>>>> > > >
> > > > >>>>>>>> > > > > Hi Satish, I will try to answer as much as I can and
> > the
> > > > >>>>>>>> others could
> > > > >>>>>>>> > > > chime
> > > > >>>>>>>> > > > > in with further details.
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > *101. For remote.log.disable.policy=delete: Does it
> > > delete
> > > > >>>>>>>> the remote
> > > > >>>>>>>> > > log
> > > > >>>>>>>> > > > > data immediately and the data in remote storage will
> > not
> > > > be
> > > > >>>>>>>> taken
> > > > >>>>>>>> > into
> > > > >>>>>>>> > > > > account by any replica? That means log-start-offset
> is
> > > > >>>>>>>> moved to the
> > > > >>>>>>>> > > > earlier
> > > > >>>>>>>> > > > > local-log-start-offset.*
> > > > >>>>>>>> > > > > *Exactly. RemoteLogData will be deleted
> immediately. *
> > > > >>>>>>>> > > > > *So before the deletion starts we move LogStart
> offset
> > > to
> > > > >>>>>>>> > > > > LocalLogStartOffset to ensure that no RemoteLog will
> > be
> > > > >>>>>>>> accessed
> > > > >>>>>>>> > after
> > > > >>>>>>>> > > > that
> > > > >>>>>>>> > > > > point.*
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > * 102. Can we update the remote.log.disable.policy
> > after
> > > > >>>>>>>> tiered
> > > > >>>>>>>> > storage
> > > > >>>>>>>> > > > is
> > > > >>>>>>>> > > > > disabled on a topic?*
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > *This is a good point. I think we should not allow
> > > > >>>>>>>> modifying this
> > > > >>>>>>>> > > > > configuration*
> > > > >>>>>>>> > > > > *because changing the policy from Deletion to Retain
> > > when
> > > > >>>>>>>> there is an
> > > > >>>>>>>> > > > > ongoing Deletion will result in an undefined
> behaviour
> > > and
> > > > >>>>>>>> where we
> > > > >>>>>>>> > > > retain
> > > > >>>>>>>> > > > > half of the remote log and delete the other half.*
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > * 103. Do we plan to add any metrics related to this
> > > > >>>>>>>> feature?*
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > *Any recommendations?*
> > > > >>>>>>>> > > > > *We may emit a gauge showing the enablement state
> of a
> > > > >>>>>>>> topic but we
> > > > >>>>>>>> > > could
> > > > >>>>>>>> > > > > gather that info from the logs as well.*
> > > > >>>>>>>> > > > > *The total duration for remote topic deletion could
> be
> > > > >>>>>>>> added as well
> > > > >>>>>>>> > > but
> > > > >>>>>>>> > > > > this is more of a metric for the
> > RemotePartitionRemover
> > > > >>>>>>>> itself.*
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > *104. Please add configuration details about copier
> > > thread
> > > > >>>>>>>> pool,
> > > > >>>>>>>> > > > expiration
> > > > >>>>>>>> > > > > thread pool and the migration of the existing
> > > > >>>>>>>> > > > > RemoteLogManagerScheduledThreadPool.*
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > *Will add the details.*
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > 105. How is the behaviour with topic or partition
> > > deletion
> > > > >>>>>>>> request
> > > > >>>>>>>> > > > > handled when tiered storage disablement request is
> > still
> > > > >>>>>>>> being
> > > > >>>>>>>> > > > > processed on a topic?
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > *If the disablement policy is Delete then a
> successive
> > > > >>>>>>>> topic deletion
> > > > >>>>>>>> > > > > request is going to be a NOOP because RemoteLogs is
> > > > already
> > > > >>>>>>>> being
> > > > >>>>>>>> > > > deleted.*
> > > > >>>>>>>> > > > > *If the disablement policy is Retain, then we only
> > moved
> > > > the
> > > > >>>>>>>> > > > LogStartOffset
> > > > >>>>>>>> > > > > and didn't touch RemoteLogs anyway, so the delete
> > topic
> > > > >>>>>>>> request will
> > > > >>>>>>>> > > > > result*
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > *in the initiation of RemoteLog deletion.*
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
> > > > >>>>>>>> > > > > kamal.chandraprakash@gmail.com> wrote:
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > > > > Hi,
> > > > >>>>>>>> > > > > >
> > > > >>>>>>>> > > > > > Thanks for the KIP! Overall the KIP looks good and
> > > > >>>>>>>> covered most of
> > > > >>>>>>>> > > the
> > > > >>>>>>>> > > > > > items.
> > > > >>>>>>>> > > > > >
> > > > >>>>>>>> > > > > > 1. Could you explain how the brokers will handle
> the
> > > > >>>>>>>> > > DisableRemoteTopic
> > > > >>>>>>>> > > > > API
> > > > >>>>>>>> > > > > > request?
> > > > >>>>>>>> > > > > >
> > > > >>>>>>>> > > > > > 2. Who will initiate the controller interaction
> > > > sequence?
> > > > >>>>>>>> Does the
> > > > >>>>>>>> > > > > > controller listens for
> > > > >>>>>>>> > > > > > topic config updates and initiate the disablement?
> > > > >>>>>>>> > > > > >
> > > > >>>>>>>> > > > > > --
> > > > >>>>>>>> > > > > > Kamal
> > > > >>>>>>>> > > > > >
> > > > >>>>>>>> > > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
> > > > >>>>>>>> > > > satish.duggana@gmail.com
> > > > >>>>>>>> > > > > >
> > > > >>>>>>>> > > > > > wrote:
> > > > >>>>>>>> > > > > >
> > > > >>>>>>>> > > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
> > > > >>>>>>>> > > > > > >
> > > > >>>>>>>> > > > > > > I had an initial review of the KIP and left the
> > > below
> > > > >>>>>>>> comments.
> > > > >>>>>>>> > > > > > >
> > > > >>>>>>>> > > > > > > 101. For remote.log.disable.policy=delete:
> > > > >>>>>>>> > > > > > > Does it delete the remote log data immediately
> and
> > > the
> > > > >>>>>>>> data in
> > > > >>>>>>>> > > remote
> > > > >>>>>>>> > > > > > > storage will not be taken into account by any
> > > replica?
> > > > >>>>>>>> That means
> > > > >>>>>>>> > > > > > > log-start-offset is moved to the earlier
> > > > >>>>>>>> local-log-start-offset.
> > > > >>>>>>>> > > > > > >
> > > > >>>>>>>> > > > > > > 102. Can we update the remote.log.disable.policy
> > > after
> > > > >>>>>>>> tiered
> > > > >>>>>>>> > > storage
> > > > >>>>>>>> > > > > > > is disabled on a topic?
> > > > >>>>>>>> > > > > > >
> > > > >>>>>>>> > > > > > > 103. Do we plan to add any metrics related to
> this
> > > > >>>>>>>> feature?
> > > > >>>>>>>> > > > > > >
> > > > >>>>>>>> > > > > > > 104. Please add configuration details about
> copier
> > > > >>>>>>>> thread pool,
> > > > >>>>>>>> > > > > > > expiration thread pool and the migration of the
> > > > existing
> > > > >>>>>>>> > > > > > > RemoteLogManagerScheduledThreadPool.
> > > > >>>>>>>> > > > > > >
> > > > >>>>>>>> > > > > > > 105. How is the behaviour with topic or
> partition
> > > > >>>>>>>> deletion
> > > > >>>>>>>> > request
> > > > >>>>>>>> > > > > > > handled when tiered storage disablement request
> is
> > > > >>>>>>>> still being
> > > > >>>>>>>> > > > > > > processed on a topic?
> > > > >>>>>>>> > > > > > >
> > > > >>>>>>>> > > > > > > ~Satish.
> > > > >>>>>>>> > > > > > >
> > > > >>>>>>>> > > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
> > > > >>>>>>>> > > > namal.doguscan@gmail.com
> > > > >>>>>>>> > > > > >
> > > > >>>>>>>> > > > > > > wrote:
> > > > >>>>>>>> > > > > > > >
> > > > >>>>>>>> > > > > > > > Hi Christo and Luke,
> > > > >>>>>>>> > > > > > > >
> > > > >>>>>>>> > > > > > > > I think the KRaft section of the KIP requires
> > > slight
> > > > >>>>>>>> > improvement.
> > > > >>>>>>>> > > > The
> > > > >>>>>>>> > > > > > > metadata propagation in KRaft is handled by the
> > RAFT
> > > > >>>>>>>> layer
> > > > >>>>>>>> > instead
> > > > >>>>>>>> > > of
> > > > >>>>>>>> > > > > > > sending Controller -> Broker RPCs. In fact,
> > KIP-631
> > > > >>>>>>>> deprecated
> > > > >>>>>>>> > > these
> > > > >>>>>>>> > > > > > RPCs.
> > > > >>>>>>>> > > > > > > >
> > > > >>>>>>>> > > > > > > > I will come up with some recommendations on
> how
> > we
> > > > >>>>>>>> could
> > > > >>>>>>>> > improve
> > > > >>>>>>>> > > > that
> > > > >>>>>>>> > > > > > > one but until then, @Luke please feel free to
> > review
> > > > >>>>>>>> the KIP.
> > > > >>>>>>>> > > > > > > >
> > > > >>>>>>>> > > > > > > > @Satish, if we want this to make it to Kafka
> > 3.8 I
> > > > >>>>>>>> believe we
> > > > >>>>>>>> > > need
> > > > >>>>>>>> > > > to
> > > > >>>>>>>> > > > > > > aim to get the KIP approved in the following
> weeks
> > > > >>>>>>>> otherwise it
> > > > >>>>>>>> > > will
> > > > >>>>>>>> > > > > slip
> > > > >>>>>>>> > > > > > > and we can not support it in Zookeeper mode.
> > > > >>>>>>>> > > > > > > >
> > > > >>>>>>>> > > > > > > > I also would like to better understand what is
> > the
> > > > >>>>>>>> community's
> > > > >>>>>>>> > > > stand
> > > > >>>>>>>> > > > > > for
> > > > >>>>>>>> > > > > > > adding a new feature for Zookeeper since it is
> > > marked
> > > > as
> > > > >>>>>>>> > deprecated
> > > > >>>>>>>> > > > > > already.
> > > > >>>>>>>> > > > > > > >
> > > > >>>>>>>> > > > > > > > Thanks.
> > > > >>>>>>>> > > > > > > >
> > > > >>>>>>>> > > > > > > >
> > > > >>>>>>>> > > > > > > >
> > > > >>>>>>>> > > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
> > > > >>>>>>>> > > > christololov@gmail.com>
> > > > >>>>>>>> > > > > > > wrote:
> > > > >>>>>>>> > > > > > > >>
> > > > >>>>>>>> > > > > > > >> Heya,
> > > > >>>>>>>> > > > > > > >>
> > > > >>>>>>>> > > > > > > >> I do have some time to put into this, but to
> be
> > > > >>>>>>>> honest I am
> > > > >>>>>>>> > > still
> > > > >>>>>>>> > > > > > after
> > > > >>>>>>>> > > > > > > >> reviews of the KIP itself :)
> > > > >>>>>>>> > > > > > > >>
> > > > >>>>>>>> > > > > > > >> After the latest changes it ought to be
> > detailing
> > > > >>>>>>>> both a
> > > > >>>>>>>> > > Zookeeper
> > > > >>>>>>>> > > > > > > approach
> > > > >>>>>>>> > > > > > > >> and a KRaft approach.
> > > > >>>>>>>> > > > > > > >>
> > > > >>>>>>>> > > > > > > >> Do you have any thoughts on how it could be
> > > > improved
> > > > >>>>>>>> or
> > > > >>>>>>>> > should I
> > > > >>>>>>>> > > > > > start a
> > > > >>>>>>>> > > > > > > >> voting thread?
> > > > >>>>>>>> > > > > > > >>
> > > > >>>>>>>> > > > > > > >> Best,
> > > > >>>>>>>> > > > > > > >> Christo
> > > > >>>>>>>> > > > > > > >>
> > > > >>>>>>>> > > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <
> > > > >>>>>>>> showuon@gmail.com>
> > > > >>>>>>>> > > > wrote:
> > > > >>>>>>>> > > > > > > >>
> > > > >>>>>>>> > > > > > > >> > Hi Christo,
> > > > >>>>>>>> > > > > > > >> >
> > > > >>>>>>>> > > > > > > >> > Any update with this KIP?
> > > > >>>>>>>> > > > > > > >> > If you don't have time to complete it, I
> can
> > > > >>>>>>>> collaborate
> > > > >>>>>>>> > with
> > > > >>>>>>>> > > > you
> > > > >>>>>>>> > > > > to
> > > > >>>>>>>> > > > > > > work
> > > > >>>>>>>> > > > > > > >> > on it.
> > > > >>>>>>>> > > > > > > >> >
> > > > >>>>>>>> > > > > > > >> > Thanks.
> > > > >>>>>>>> > > > > > > >> > Luke
> > > > >>>>>>>> > > > > > > >> >
> > > > >>>>>>>> > > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish
> > > Duggana <
> > > > >>>>>>>> > > > > > > satish.duggana@gmail.com>
> > > > >>>>>>>> > > > > > > >> > wrote:
> > > > >>>>>>>> > > > > > > >> >
> > > > >>>>>>>> > > > > > > >> > > Hi Christo,
> > > > >>>>>>>> > > > > > > >> > > Thanks for volunteering to contribute to
> > the
> > > > KIP
> > > > >>>>>>>> > > discussion. I
> > > > >>>>>>>> > > > > > > suggest
> > > > >>>>>>>> > > > > > > >> > > considering this KIP for both ZK and
> KRaft
> > as
> > > > it
> > > > >>>>>>>> will be
> > > > >>>>>>>> > > > helpful
> > > > >>>>>>>> > > > > > for
> > > > >>>>>>>> > > > > > > >> > > this feature to be available in 3.8.0
> > running
> > > > >>>>>>>> with ZK
> > > > >>>>>>>> > > > clusters.
> > > > >>>>>>>> > > > > > > >> > >
> > > > >>>>>>>> > > > > > > >> > > Thanks,
> > > > >>>>>>>> > > > > > > >> > > Satish.
> > > > >>>>>>>> > > > > > > >> > >
> > > > >>>>>>>> > > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo
> > Lolov <
> > > > >>>>>>>> > > > > > christololov@gmail.com
> > > > >>>>>>>> > > > > > > >
> > > > >>>>>>>> > > > > > > >> > > wrote:
> > > > >>>>>>>> > > > > > > >> > > >
> > > > >>>>>>>> > > > > > > >> > > > Hello!
> > > > >>>>>>>> > > > > > > >> > > >
> > > > >>>>>>>> > > > > > > >> > > > I volunteer to get this KIP moving
> > forward
> > > > and
> > > > >>>>>>>> > implemented
> > > > >>>>>>>> > > > in
> > > > >>>>>>>> > > > > > > Apache
> > > > >>>>>>>> > > > > > > >> > > Kafka
> > > > >>>>>>>> > > > > > > >> > > > 3.8.
> > > > >>>>>>>> > > > > > > >> > > >
> > > > >>>>>>>> > > > > > > >> > > > I have caught up with Mehari offline
> and
> > we
> > > > >>>>>>>> have agreed
> > > > >>>>>>>> > > that
> > > > >>>>>>>> > > > > > given
> > > > >>>>>>>> > > > > > > >> > Apache
> > > > >>>>>>>> > > > > > > >> > > > Kafka 4.0 being around the corner we
> > would
> > > > >>>>>>>> like to
> > > > >>>>>>>> > propose
> > > > >>>>>>>> > > > > this
> > > > >>>>>>>> > > > > > > feature
> > > > >>>>>>>> > > > > > > >> > > > only for KRaft clusters.
> > > > >>>>>>>> > > > > > > >> > > >
> > > > >>>>>>>> > > > > > > >> > > > Any and all reviews and comments are
> > > welcome!
> > > > >>>>>>>> > > > > > > >> > > >
> > > > >>>>>>>> > > > > > > >> > > > Best,
> > > > >>>>>>>> > > > > > > >> > > > Christo
> > > > >>>>>>>> > > > > > > >> > > >
> > > > >>>>>>>> > > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan
> > > Namal <
> > > > >>>>>>>> > > > > > > namal.doguscan@gmail.com>
> > > > >>>>>>>> > > > > > > >> > > > wrote:
> > > > >>>>>>>> > > > > > > >> > > >
> > > > >>>>>>>> > > > > > > >> > > > > Hi everyone, any progress on the
> status
> > > of
> > > > >>>>>>>> this KIP?
> > > > >>>>>>>> > > > Overall
> > > > >>>>>>>> > > > > > > looks
> > > > >>>>>>>> > > > > > > >> > > good to
> > > > >>>>>>>> > > > > > > >> > > > > me but I wonder whether we still need
> > to
> > > > >>>>>>>> support it
> > > > >>>>>>>> > for
> > > > >>>>>>>> > > > > > > Zookeeper
> > > > >>>>>>>> > > > > > > >> > mode
> > > > >>>>>>>> > > > > > > >> > > > > given that it will be deprecated in
> the
> > > > next
> > > > >>>>>>>> 3 months.
> > > > >>>>>>>> > > > > > > >> > > > >
> > > > >>>>>>>> > > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene,
> Mehari"
> > > > >>>>>>>> wrote:
> > > > >>>>>>>> > > > > > > >> > > > > > Hi everyone,
> > > > >>>>>>>> > > > > > > >> > > > > > I would like to start a discussion
> on
> > > > >>>>>>>> KIP-950:
> > > > >>>>>>>> > Tiered
> > > > >>>>>>>> > > > > > Storage
> > > > >>>>>>>> > > > > > > >> > > Disablement
> > > > >>>>>>>> > > > > > > >> > > > > (
> > > > >>>>>>>> > > > > > > >> > > > >
> > > > >>>>>>>> > > > > > > >> > > > >
> > > > >>>>>>>> > > > > > > >> > >
> > > > >>>>>>>> > > > > > > >> >
> > > > >>>>>>>> > > > > > >
> > > > >>>>>>>> > > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > >
> > > > >>>>>>>> > >
> > > > >>>>>>>> >
> > > > >>>>>>>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
> > > > >>>>>>>> > > > > > > >> > > > > ).
> > > > >>>>>>>> > > > > > > >> > > > > >
> > > > >>>>>>>> > > > > > > >> > > > > > This KIP proposes adding the
> ability
> > to
> > > > >>>>>>>> disable and
> > > > >>>>>>>> > > > > > re-enable
> > > > >>>>>>>> > > > > > > >> > tiered
> > > > >>>>>>>> > > > > > > >> > > > > storage on a topic.
> > > > >>>>>>>> > > > > > > >> > > > > >
> > > > >>>>>>>> > > > > > > >> > > > > > Thanks,
> > > > >>>>>>>> > > > > > > >> > > > > > Mehari
> > > > >>>>>>>> > > > > > > >> > > > > >
> > > > >>>>>>>> > > > > > > >> > > > >
> > > > >>>>>>>> > > > > > > >> > >
> > > > >>>>>>>> > > > > > > >> >
> > > > >>>>>>>> > > > > > >
> > > > >>>>>>>> > > > > >
> > > > >>>>>>>> > > > >
> > > > >>>>>>>> > > >
> > > > >>>>>>>> > >
> > > > >>>>>>>> >
> > > > >>>>>>>>
> > > > >>>>>>>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Christo Lolov <ch...@gmail.com>.

Heya!

re Kamal - Okay, I believe I understand what you mean and I agree. I have
made the following change

```

During tiered storage disablement, when RemoteLogManager#stopPartition() is
called:

   - Tasks scheduled for the topic-partitions in the
   RemoteStorageCopierThreadPool will be canceled.
   - If the disablement policy is retain, scheduled tasks for the
   topic-partitions in the RemoteDataExpirationThreadPool will remain
   unchanged.
   - If the disablement policy is delete, we will first advance the log
   start offset and we will let tasks scheduled for the topic-partitions in
   the RemoteDataExpirationThreadPool to successfully delete all remote
   segments before the log start offset and then unregister themselves.

```

re Luke - I checked once again. As far as I understand when a broker goes
down all replicas it hosts go to OfflineReplica state in the state machine
the controller maintains. The moment the broker comes back up again the
state machine resends StopReplica based on
```

* OfflineReplica -> ReplicaDeletionStarted
* --send StopReplicaRequest to the replica (with deletion)

```
from ReplicaStateMachine.scala. So I was wrong and you are right, we do not
appear to be sending constant requests today. I believe it is safe for us
to follow a similar approach i.e. if a replica comes online again we resend
the StopReplica.

If you don't notice any more problems I will aim to start a VOTE tomorrow
so we can get at least part of this KIP in 3.8.

Best,
Christo

On Fri, 10 May 2024 at 11:11, Luke Chen <sh...@gmail.com> wrote:

> Hi Christo,
>
> > 1. I am not certain I follow the question. From DISABLED you can only go
> to
> ENABLED regardless of whether your cluster is backed by Zookeeper or KRaft.
> Am I misunderstanding your point?
>
> Yes, you're right.
>
> > 4. I was thinking that if there is a mismatch we will just fail accepting
> the request for disablement. This should be the same in both Zookeeper and
> KRaft. Or am I misunderstanding your question?
>
> OK, sounds good.
>
> > 6. I think my current train of thought is that there will be unlimited
> retries until all brokers respond in a similar way to how deletion of a
> topic works today in ZK. In the meantime the state will continue to be
> DISABLING. Do you have a better suggestion?
>
> I don't think infinite retries is a good idea since if a broker is down
> forever, this request will never complete.
> You mentioned the existing topic deletion is using the similar pattern, how
> does it handle this issue?
>
> Thanks.
> Luke
>
> On Thu, May 9, 2024 at 9:21 PM Christo Lolov <ch...@gmail.com>
> wrote:
>
> > Heya!
> >
> > re: Luke
> >
> > 1. I am not certain I follow the question. From DISABLED you can only go
> to
> > ENABLED regardless of whether your cluster is backed by Zookeeper or
> KRaft.
> > Am I misunderstanding your point?
> >
> > 2. Apologies, this was a leftover from previous versions. I have updated
> > the Zookeeper section. The steps ought to be: controller receives change,
> > commits necessary data to Zookeeper, enqueues disablement and starts
> > sending StopReplicas request to brokers; brokers receive StopReplicas and
> > propagate them all the way to RemoteLogManager#stopPartitions which takes
> > care of the rest.
> >
> > 3. Correct, it should say DISABLED - this should now be corrected.
> >
> > 4. I was thinking that if there is a mismatch we will just fail accepting
> > the request for disablement. This should be the same in both Zookeeper
> and
> > KRaft. Or am I misunderstanding your question?
> >
> > 5. Yeah. I am now doing a second pass on all diagrams and will update
> them
> > by the end of the day!
> >
> > 6. I think my current train of thought is that there will be unlimited
> > retries until all brokers respond in a similar way to how deletion of a
> > topic works today in ZK. In the meantime the state will continue to be
> > DISABLING. Do you have a better suggestion?
> >
> > re: Kamal
> >
> > Yep, I will update all diagrams
> >
> > I am not certain I follow the reasoning for making retain and delete the
> > same. Deletion when the policy is retain happens asynchronously due to
> > expiration. I think that deletion when the policy is delete ought to (at
> > least for the initial implementation) happen synchronously. Should people
> > run into timeout problems we can always then have a follow-up KIP where
> we
> > make it asynchronous.
> >
> > Best,
> > Christo
> >
> > On Tue, 7 May 2024 at 10:04, Kamal Chandraprakash <
> > kamal.chandraprakash@gmail.com> wrote:
> >
> > > Hi Christo,
> > >
> > > Thanks for the update!
> > >
> > > For both the policies "retain" and "delete", can we maintain the same
> > > approach to delete the segments async?
> > >
> > > > If the disablement policy is set to delete, the Log start offset
> (LSO)
> > is
> > > updated to match the Local Log Start Offset and the remote log is
> deleted
> > > by calling the RemoteStorageManager#deleteLogSegmentData().
> > >
> > > In the KIP, it's mentioned that when the disable policy is set to
> > "delete",
> > > the remote-log-segments will be
> > > deleted in-sync. The stopPartition call might get timed out when the
> > number
> > > of remote log segments to
> > > delete is huge. We can further extend the same approach for the topic
> > > deletion requests.
> > >
> > > Also, Could you please update the state diagram about the transitions?
> It
> > > is not clear when to transit from
> > > DISABLING to DISABLED state?
> > >
> > > --
> > > Kamal
> > >
> > > On Mon, May 6, 2024 at 6:55 PM Kamal Chandraprakash <
> > > kamal.chandraprakash@gmail.com> wrote:
> > >
> > > > Ignore the above message. Got the answers after reading the state
> > > > transition section.
> > > >
> > > > > If the disablement policy is delete, tasks scheduled for the
> > > > topic-partitions in the RemoteDataExpirationThreadPool will also be
> > > > canceled.
> > > >
> > > > We are deleting the segments synchronously. Should we delete them
> > > > asynchronously? The same approach can be extended to topic deletion
> > > > requests.
> > > >
> > > > > 6. In ZK mode, what will the controller do if the "stopReplicas"
> > > > responses not received from all brokers? Reverting the changes?
> > > >
> > > > Since we are deleting the segments synchronously. This case can be
> > bound
> > > > to happen when the number of remote log segments to
> > > > delete is huge.
> > > >
> > > >
> > > > On Mon, May 6, 2024, 18:12 Kamal Chandraprakash <
> > > > kamal.chandraprakash@gmail.com> wrote:
> > > >
> > > >> Hi Christo,
> > > >>
> > > >> Thanks for the update!
> > > >>
> > > >> 1. In the ZK mode, how will the transition from DISABLING to
> DISABLED
> > > >> state happen?
> > > >> For the "retain" policy, until we delete all the remote-log
> segments,
> > > the
> > > >> state will be
> > > >> DISABLING and the deletion can happen only when they breach either
> the
> > > >> retention
> > > >> time (or) size.
> > > >>
> > > >> How does the controller monitor that all the remote log segments are
> > > >> deleted for all
> > > >> the partitions of the topic before transitioning the state to
> > DISABLED?
> > > >>
> > > >> 2. In Kraft, we have only ENABLED -> DISABLED state. How are we
> > > >> supporting the case
> > > >> "retain" -> "enable"?
> > > >>
> > > >> If the remote storage is degraded, we want to avoid uploading the
> > > >> segments temporarily
> > > >> and resume back once the remote storage is healthy. Is the case
> > > supported?
> > > >>
> > > >>
> > > >>
> > > >> On Fri, May 3, 2024 at 12:12 PM Luke Chen <sh...@gmail.com>
> wrote:
> > > >>
> > > >>> Also, I think using `stopReplicas` request is a good idea because
> it
> > > >>> won't cause any problems while migrating to KRaft mode.
> > > >>> The stopReplicas request is one of the request that KRaft
> controller
> > > >>> will send to ZK brokers during migration.
> > > >>>
> > > >>> Thanks.
> > > >>> Luke
> > > >>>
> > > >>> On Fri, May 3, 2024 at 11:48 AM Luke Chen <sh...@gmail.com>
> wrote:
> > > >>>
> > > >>>> Hi Christo,
> > > >>>>
> > > >>>> Thanks for the update.
> > > >>>>
> > > >>>> Questions:
> > > >>>> 1. For this
> > > >>>> "The possible state transition from DISABLED state is to the
> > ENABLED."
> > > >>>> I think it only applies for KRaft mode. In ZK mode, the possible
> > state
> > > >>>> is "DISABLING", right?
> > > >>>>
> > > >>>> 2. For this:
> > > >>>> "If the cluster is using Zookeeper as the control plane, enabling
> > > >>>> remote storage for a topic triggers the controller to send this
> > > information
> > > >>>> to Zookeeper. Each broker listens for changes in Zookeeper, and
> > when a
> > > >>>> change is detected, the broker triggers
> > > >>>> RemoteLogManager#onLeadershipChange()."
> > > >>>>
> > > >>>> I think the way ZK brokers knows the leadership change is by
> getting
> > > >>>> the LeaderAndISRRequeset from the controller, not listening for
> > > changes in
> > > >>>> ZK.
> > > >>>>
> > > >>>> 3. In the KRaft handler steps, you said:
> > > >>>> "The controller also updates the Topic metadata to increment the
> > > >>>> tiered_epoch and update the tiered_stateto DISABLING state."
> > > >>>>
> > > >>>> Should it be "DISABLED" state since it's KRaft mode?
> > > >>>>
> > > >>>> 4. I was thinking how we handle the tiered_epoch not match error.
> > > >>>> For ZK, I think the controller won't write any data into ZK Znode,
> > > >>>> For KRaft, either configRecord or updateTopicMetadata records
> won't
> > be
> > > >>>> written.
> > > >>>> Is that right? Because the current workflow makes me think there
> > will
> > > >>>> be partial data updated in ZK/KRaft when tiered_epoch error.
> > > >>>>
> > > >>>> 5. Since we changed to use stopReplicas (V5) request now, the
> > diagram
> > > >>>> for ZK workflow might also need to update.
> > > >>>>
> > > >>>> 6. In ZK mode, what will the controller do if the "stopReplicas"
> > > >>>> responses not received from all brokers? Reverting the changes?
> > > >>>> This won't happen in KRaft mode because it's broker's
> responsibility
> > > to
> > > >>>> fetch metadata update from controller.
> > > >>>>
> > > >>>>
> > > >>>> Thank you.
> > > >>>> Luke
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Apr 19, 2024 at 10:23 PM Christo Lolov <
> > > christololov@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Heya all!
> > > >>>>>
> > > >>>>> I have updated KIP-950. A list of what I have updated is:
> > > >>>>>
> > > >>>>> * Explicitly state that Zookeeper-backed clusters will have
> ENABLED
> > > ->
> > > >>>>> DISABLING -> DISABLED while KRaft-backed clusters will only have
> > > ENABLED ->
> > > >>>>> DISABLED
> > > >>>>> * Added two configurations for the new thread pools and explained
> > > >>>>> where values will be picked-up mid Kafka version upgrade
> > > >>>>> * Explained how leftover remote partitions will be scheduled for
> > > >>>>> deletion
> > > >>>>> * Updated the API to use StopReplica V5 rather than a whole new
> > > >>>>> controller-to-broker API
> > > >>>>> * Explained that the disablement procedure will be triggered by
> the
> > > >>>>> controller listening for an (Incremental)AlterConfig change
> > > >>>>> * Explained that we will first move log start offset and then
> > issue a
> > > >>>>> deletion
> > > >>>>> * Went into more details that changing remote.log.disable.policy
> > > after
> > > >>>>> disablement won't do anything and that if a customer would like
> > > additional
> > > >>>>> data deleted they would have to use already existing methods
> > > >>>>>
> > > >>>>> Let me know if there are any new comments or I have missed
> > something!
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Christo
> > > >>>>>
> > > >>>>> On Mon, 15 Apr 2024 at 12:40, Christo Lolov <
> > christololov@gmail.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Heya Doguscan,
> > > >>>>>>
> > > >>>>>> I believe that the state of the world after this KIP will be the
> > > >>>>>> following:
> > > >>>>>>
> > > >>>>>> For Zookeeper-backed clusters there will be 3 states: ENABLED,
> > > >>>>>> DISABLING and DISABLED. We want this because Zookeeper-backed
> > > clusters will
> > > >>>>>> await a confirmation from the brokers that they have indeed
> > stopped
> > > >>>>>> tiered-related operations on the topic.
> > > >>>>>>
> > > >>>>>> For KRaft-backed clusters there will be only 2 states: ENABLED
> and
> > > >>>>>> DISABLED. KRaft takes a fire-and-forget approach for topic
> > > deletion. I
> > > >>>>>> believe the same approach ought to be taken for tiered topics.
> The
> > > >>>>>> mechanism which will ensure that leftover state in remote due to
> > > failures
> > > >>>>>> is cleaned up to me is the retention mechanism. In today's
> code, a
> > > leader
> > > >>>>>> deletes all segments it finds in remote with offsets below the
> log
> > > start
> > > >>>>>> offset. I believe this will be good enough for cleaning up
> > leftover
> > > state
> > > >>>>>> in remote due to failures.
> > > >>>>>>
> > > >>>>>> I know that quite a few changes have been discussed so I will
> aim
> > to
> > > >>>>>> put them on paper in the upcoming days and let everyone know!
> > > >>>>>>
> > > >>>>>> Best,
> > > >>>>>> Christo
> > > >>>>>>
> > > >>>>>> On Tue, 9 Apr 2024 at 14:49, Doğuşcan Namal <
> > > namal.doguscan@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> +1 let's not introduce a new api and mark it immediately as
> > > >>>>>>> deprecated :)
> > > >>>>>>>
> > > >>>>>>> On your second comment Luke, one thing we need to clarify is
> when
> > > do
> > > >>>>>>> we consider remote storage to be DISABLED for a topic?
> > > >>>>>>> Particularly, what is the state when the remote storage is
> being
> > > >>>>>>> deleted in case of disablement.policy=delete? Is it DISABLING
> or
> > > DISABLED?
> > > >>>>>>>
> > > >>>>>>> If we move directly to the DISABLED state,
> > > >>>>>>>
> > > >>>>>>> a) in case of failures, the leaders should continue remote
> > storage
> > > >>>>>>> deletion even if the topic is moved to the DISABLED state,
> > > otherwise we
> > > >>>>>>> risk having stray data on remote storage.
> > > >>>>>>> b) on each restart, we should initiate the remote storage
> > deletion
> > > >>>>>>> because although we replayed a record with a DISABLED state, we
> > > can not be
> > > >>>>>>> sure if the remote data is deleted or not.
> > > >>>>>>>
> > > >>>>>>> We could either consider keeping the remote topic in DISABLING
> > > state
> > > >>>>>>> until all of the remote storage data is deleted, or we need an
> > > additional
> > > >>>>>>> mechanism to handle the remote stray data.
> > > >>>>>>>
> > > >>>>>>> The existing topic deletion, for instance, handles stray logs
> on
> > > >>>>>>> disk by detecting them on KafkaBroker startup and deleting
> before
> > > the
> > > >>>>>>> ReplicaManager is started.
> > > >>>>>>> Maybe we need a similar mechanism here as well if we don't
> want a
> > > >>>>>>> DISABLING state. Otherwise, we need a callback from Brokers to
> > > validate
> > > >>>>>>> that remote storage data is deleted and now we could move to
> the
> > > DISABLED
> > > >>>>>>> state.
> > > >>>>>>>
> > > >>>>>>> Thanks.
> > > >>>>>>>
> > > >>>>>>> On Tue, 9 Apr 2024 at 12:45, Luke Chen <sh...@gmail.com>
> > wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Christo,
> > > >>>>>>>>
> > > >>>>>>>> > I would then opt for moving information from
> > DisableRemoteTopic
> > > >>>>>>>> within the StopReplicas API which will then disappear in KRaft
> > > >>>>>>>> world as it
> > > >>>>>>>> is already scheduled for deprecation. What do you think?
> > > >>>>>>>>
> > > >>>>>>>> Sounds good to me.
> > > >>>>>>>>
> > > >>>>>>>> Thanks.
> > > >>>>>>>> Luke
> > > >>>>>>>>
> > > >>>>>>>> On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <
> > > >>>>>>>> christololov@gmail.com> wrote:
> > > >>>>>>>>
> > > >>>>>>>> > Heya Luke!
> > > >>>>>>>> >
> > > >>>>>>>> > I thought a bit more about it and I reached the same
> > conclusion
> > > >>>>>>>> as you for
> > > >>>>>>>> > 2 as a follow-up from 1. In other words, in KRaft world I
> > don't
> > > >>>>>>>> think the
> > > >>>>>>>> > controller needs to wait for acknowledgements for the
> brokers.
> > > >>>>>>>> All we care
> > > >>>>>>>> > about is that the leader (who is responsible for
> > > >>>>>>>> archiving/deleting data in
> > > >>>>>>>> > tiered storage) knows about the change and applies it
> > properly.
> > > >>>>>>>> If there is
> > > >>>>>>>> > a leadership change halfway through the operation then the
> new
> > > >>>>>>>> leader still
> > > >>>>>>>> > needs to apply the message from the state topic and we know
> > > that a
> > > >>>>>>>> > disable-message will be applied before a
> > reenablement-message. I
> > > >>>>>>>> will
> > > >>>>>>>> > change the KIP later today/tomorrow morning to reflect this
> > > >>>>>>>> reasoning.
> > > >>>>>>>> >
> > > >>>>>>>> > However, with this I believe that introducing a new API just
> > for
> > > >>>>>>>> > Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a
> > bit
> > > >>>>>>>> of an
> > > >>>>>>>> > overkill. I would then opt for moving information from
> > > >>>>>>>> DisableRemoteTopic
> > > >>>>>>>> > within the StopReplicas API which will then disappear in
> KRaft
> > > >>>>>>>> world as it
> > > >>>>>>>> > is already scheduled for deprecation. What do you think?
> > > >>>>>>>> >
> > > >>>>>>>> > Best,
> > > >>>>>>>> > Christo
> > > >>>>>>>> >
> > > >>>>>>>> > On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com>
> > > wrote:
> > > >>>>>>>> >
> > > >>>>>>>> > > Hi Christo,
> > > >>>>>>>> > >
> > > >>>>>>>> > > 1. I agree with Doguscan that in KRaft mode, the
> controller
> > > >>>>>>>> won't send
> > > >>>>>>>> > RPCs
> > > >>>>>>>> > > to the brokers (except in the migration path).
> > > >>>>>>>> > > So, I think we could adopt the similar way we did to
> > > >>>>>>>> > `AlterReplicaLogDirs`
> > > >>>>>>>> > > (
> > > >>>>>>>> > > KIP-858
> > > >>>>>>>> > > <
> > > >>>>>>>> > >
> > > >>>>>>>> >
> > > >>>>>>>>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
> > > >>>>>>>> > > >)
> > > >>>>>>>> > > that let the broker notify controller any update, instead
> of
> > > >>>>>>>> controller
> > > >>>>>>>> > to
> > > >>>>>>>> > > broker. And once the controller receives all the complete
> > > >>>>>>>> requests from
> > > >>>>>>>> > > brokers, it'll enter "Disabled" state. WDYT?
> > > >>>>>>>> > >
> > > >>>>>>>> > > 2. Why should we wait until all brokers to respond before
> > > >>>>>>>> moving to
> > > >>>>>>>> > > "Disabled" state in "KRaft mode"?
> > > >>>>>>>> > > Currently, only the leader node does the remote log
> > > >>>>>>>> upload/fetch tasks,
> > > >>>>>>>> > so
> > > >>>>>>>> > > does that mean the controller only need to make sure the
> > > leader
> > > >>>>>>>> completes
> > > >>>>>>>> > > the stopPartition?
> > > >>>>>>>> > > If during the leader node stopPartition process triggered
> > > >>>>>>>> leadership
> > > >>>>>>>> > > change, then the new leader should receive and apply the
> > > >>>>>>>> configRecord
> > > >>>>>>>> > > update before the leadership change record based on the
> > KRaft
> > > >>>>>>>> design,
> > > >>>>>>>> > which
> > > >>>>>>>> > > means there will be no gap that the follower node becomes
> > the
> > > >>>>>>>> leader and
> > > >>>>>>>> > > starting doing unexpected upload/fetch tasks, right?
> > > >>>>>>>> > > I agree we should make sure in ZK mode, all brokers are
> > > >>>>>>>> completed the
> > > >>>>>>>> > > stopPartitions before moving to "Disabled" state because
> ZK
> > > >>>>>>>> node watcher
> > > >>>>>>>> > is
> > > >>>>>>>> > > working in a separate thread. But not sure about KRaft
> mode.
> > > >>>>>>>> > >
> > > >>>>>>>> > > Thanks.
> > > >>>>>>>> > > Luke
> > > >>>>>>>> > >
> > > >>>>>>>> > >
> > > >>>>>>>> > > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <
> > > >>>>>>>> christololov@gmail.com>
> > > >>>>>>>> > > wrote:
> > > >>>>>>>> > >
> > > >>>>>>>> > > > Heya everyone!
> > > >>>>>>>> > > >
> > > >>>>>>>> > > > re: Doguscan
> > > >>>>>>>> > > >
> > > >>>>>>>> > > > I believe the answer to 101 needs a bit more discussion.
> > As
> > > >>>>>>>> far as I
> > > >>>>>>>> > > know,
> > > >>>>>>>> > > > tiered storage today has methods to update a metadata
> of a
> > > >>>>>>>> segment to
> > > >>>>>>>> > say
> > > >>>>>>>> > > > "hey, I would like this deleted", but actual deletion is
> > > left
> > > >>>>>>>> to plugin
> > > >>>>>>>> > > > implementations (or any background cleaners). In other
> > > words,
> > > >>>>>>>> there is
> > > >>>>>>>> > no
> > > >>>>>>>> > > > "immediate" deletion. In this KIP, we would like to
> > continue
> > > >>>>>>>> doing the
> > > >>>>>>>> > > same
> > > >>>>>>>> > > > if the retention policy is set to delete. So I believe
> the
> > > >>>>>>>> answer is
> > > >>>>>>>> > > > actually that a) we will update the metadata of the
> > segments
> > > >>>>>>>> to mark
> > > >>>>>>>> > them
> > > >>>>>>>> > > > as deleted and b) we will advance the log start offset.
> > Any
> > > >>>>>>>> deletion of
> > > >>>>>>>> > > > actual files will still be delegated to plugin
> > > >>>>>>>> implementations. I
> > > >>>>>>>> > believe
> > > >>>>>>>> > > > this is further supported by
> > > >>>>>>>> "*remote.log.disable.policy=delete:* Logs
> > > >>>>>>>> > > that
> > > >>>>>>>> > > > are archived in the remote storage will not be part of
> the
> > > >>>>>>>> contiguous
> > > >>>>>>>> > > > "active" log and will be deleted asynchronously as part
> of
> > > the
> > > >>>>>>>> > > disablement
> > > >>>>>>>> > > > process"
> > > >>>>>>>> > > >
> > > >>>>>>>> > > > Following from the above, I believe for 102 it is fine
> to
> > > >>>>>>>> allow setting
> > > >>>>>>>> > > of
> > > >>>>>>>> > > > remote.log.disable.policy on a disabled topic in much
> the
> > > >>>>>>>> same way we
> > > >>>>>>>> > > allow
> > > >>>>>>>> > > > other remote-related configurations to be set on a topic
> > > (i.e.
> > > >>>>>>>> > > > local.retention.*) - it just won't have an effect.
> > Granted,
> > > I
> > > >>>>>>>> do
> > > >>>>>>>> > believe
> > > >>>>>>>> > > we
> > > >>>>>>>> > > > should restrict the policy being changed while a
> > disablement
> > > >>>>>>>> is
> > > >>>>>>>> > ongoing.
> > > >>>>>>>> > > >
> > > >>>>>>>> > > > re: Satish and Kamal
> > > >>>>>>>> > > >
> > > >>>>>>>> > > > 104, 1 and 2 are fair asks, I will work with Doguscan to
> > > >>>>>>>> update the KIP
> > > >>>>>>>> > > > with the information!
> > > >>>>>>>> > > >
> > > >>>>>>>> > > > Best,
> > > >>>>>>>> > > > Christo
> > > >>>>>>>> > > >
> > > >>>>>>>> > > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <
> > > >>>>>>>> namal.doguscan@gmail.com
> > > >>>>>>>> > >
> > > >>>>>>>> > > > wrote:
> > > >>>>>>>> > > >
> > > >>>>>>>> > > > > Hi Satish, I will try to answer as much as I can and
> the
> > > >>>>>>>> others could
> > > >>>>>>>> > > > chime
> > > >>>>>>>> > > > > in with further details.
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > *101. For remote.log.disable.policy=delete: Does it
> > delete
> > > >>>>>>>> the remote
> > > >>>>>>>> > > log
> > > >>>>>>>> > > > > data immediately and the data in remote storage will
> not
> > > be
> > > >>>>>>>> taken
> > > >>>>>>>> > into
> > > >>>>>>>> > > > > account by any replica? That means log-start-offset is
> > > >>>>>>>> moved to the
> > > >>>>>>>> > > > earlier
> > > >>>>>>>> > > > > local-log-start-offset.*
> > > >>>>>>>> > > > > *Exactly. RemoteLogData will be deleted immediately. *
> > > >>>>>>>> > > > > *So before the deletion starts we move LogStart offset
> > to
> > > >>>>>>>> > > > > LocalLogStartOffset to ensure that no RemoteLog will
> be
> > > >>>>>>>> accessed
> > > >>>>>>>> > after
> > > >>>>>>>> > > > that
> > > >>>>>>>> > > > > point.*
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > * 102. Can we update the remote.log.disable.policy
> after
> > > >>>>>>>> tiered
> > > >>>>>>>> > storage
> > > >>>>>>>> > > > is
> > > >>>>>>>> > > > > disabled on a topic?*
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > *This is a good point. I think we should not allow
> > > >>>>>>>> modifying this
> > > >>>>>>>> > > > > configuration*
> > > >>>>>>>> > > > > *because changing the policy from Deletion to Retain
> > when
> > > >>>>>>>> there is an
> > > >>>>>>>> > > > > ongoing Deletion will result in an undefined behaviour
> > and
> > > >>>>>>>> where we
> > > >>>>>>>> > > > retain
> > > >>>>>>>> > > > > half of the remote log and delete the other half.*
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > * 103. Do we plan to add any metrics related to this
> > > >>>>>>>> feature?*
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > *Any recommendations?*
> > > >>>>>>>> > > > > *We may emit a gauge showing the enablement state of a
> > > >>>>>>>> topic but we
> > > >>>>>>>> > > could
> > > >>>>>>>> > > > > gather that info from the logs as well.*
> > > >>>>>>>> > > > > *The total duration for remote topic deletion could be
> > > >>>>>>>> added as well
> > > >>>>>>>> > > but
> > > >>>>>>>> > > > > this is more of a metric for the
> RemotePartitionRemover
> > > >>>>>>>> itself.*
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > *104. Please add configuration details about copier
> > thread
> > > >>>>>>>> pool,
> > > >>>>>>>> > > > expiration
> > > >>>>>>>> > > > > thread pool and the migration of the existing
> > > >>>>>>>> > > > > RemoteLogManagerScheduledThreadPool.*
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > *Will add the details.*
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > 105. How is the behaviour with topic or partition
> > deletion
> > > >>>>>>>> request
> > > >>>>>>>> > > > > handled when tiered storage disablement request is
> still
> > > >>>>>>>> being
> > > >>>>>>>> > > > > processed on a topic?
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > *If the disablement policy is Delete then a successive
> > > >>>>>>>> topic deletion
> > > >>>>>>>> > > > > request is going to be a NOOP because RemoteLogs is
> > > already
> > > >>>>>>>> being
> > > >>>>>>>> > > > deleted.*
> > > >>>>>>>> > > > > *If the disablement policy is Retain, then we only
> moved
> > > the
> > > >>>>>>>> > > > LogStartOffset
> > > >>>>>>>> > > > > and didn't touch RemoteLogs anyway, so the delete
> topic
> > > >>>>>>>> request will
> > > >>>>>>>> > > > > result*
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > *in the initiation of RemoteLog deletion.*
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
> > > >>>>>>>> > > > > kamal.chandraprakash@gmail.com> wrote:
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > > > > Hi,
> > > >>>>>>>> > > > > >
> > > >>>>>>>> > > > > > Thanks for the KIP! Overall the KIP looks good and
> > > >>>>>>>> covered most of
> > > >>>>>>>> > > the
> > > >>>>>>>> > > > > > items.
> > > >>>>>>>> > > > > >
> > > >>>>>>>> > > > > > 1. Could you explain how the brokers will handle the
> > > >>>>>>>> > > DisableRemoteTopic
> > > >>>>>>>> > > > > API
> > > >>>>>>>> > > > > > request?
> > > >>>>>>>> > > > > >
> > > >>>>>>>> > > > > > 2. Who will initiate the controller interaction
> > > sequence?
> > > >>>>>>>> Does the
> > > >>>>>>>> > > > > > controller listens for
> > > >>>>>>>> > > > > > topic config updates and initiate the disablement?
> > > >>>>>>>> > > > > >
> > > >>>>>>>> > > > > > --
> > > >>>>>>>> > > > > > Kamal
> > > >>>>>>>> > > > > >
> > > >>>>>>>> > > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
> > > >>>>>>>> > > > satish.duggana@gmail.com
> > > >>>>>>>> > > > > >
> > > >>>>>>>> > > > > > wrote:
> > > >>>>>>>> > > > > >
> > > >>>>>>>> > > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
> > > >>>>>>>> > > > > > >
> > > >>>>>>>> > > > > > > I had an initial review of the KIP and left the
> > below
> > > >>>>>>>> comments.
> > > >>>>>>>> > > > > > >
> > > >>>>>>>> > > > > > > 101. For remote.log.disable.policy=delete:
> > > >>>>>>>> > > > > > > Does it delete the remote log data immediately and
> > the
> > > >>>>>>>> data in
> > > >>>>>>>> > > remote
> > > >>>>>>>> > > > > > > storage will not be taken into account by any
> > replica?
> > > >>>>>>>> That means
> > > >>>>>>>> > > > > > > log-start-offset is moved to the earlier
> > > >>>>>>>> local-log-start-offset.
> > > >>>>>>>> > > > > > >
> > > >>>>>>>> > > > > > > 102. Can we update the remote.log.disable.policy
> > after
> > > >>>>>>>> tiered
> > > >>>>>>>> > > storage
> > > >>>>>>>> > > > > > > is disabled on a topic?
> > > >>>>>>>> > > > > > >
> > > >>>>>>>> > > > > > > 103. Do we plan to add any metrics related to this
> > > >>>>>>>> feature?
> > > >>>>>>>> > > > > > >
> > > >>>>>>>> > > > > > > 104. Please add configuration details about copier
> > > >>>>>>>> thread pool,
> > > >>>>>>>> > > > > > > expiration thread pool and the migration of the
> > > existing
> > > >>>>>>>> > > > > > > RemoteLogManagerScheduledThreadPool.
> > > >>>>>>>> > > > > > >
> > > >>>>>>>> > > > > > > 105. How is the behaviour with topic or partition
> > > >>>>>>>> deletion
> > > >>>>>>>> > request
> > > >>>>>>>> > > > > > > handled when tiered storage disablement request is
> > > >>>>>>>> still being
> > > >>>>>>>> > > > > > > processed on a topic?
> > > >>>>>>>> > > > > > >
> > > >>>>>>>> > > > > > > ~Satish.
> > > >>>>>>>> > > > > > >
> > > >>>>>>>> > > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
> > > >>>>>>>> > > > namal.doguscan@gmail.com
> > > >>>>>>>> > > > > >
> > > >>>>>>>> > > > > > > wrote:
> > > >>>>>>>> > > > > > > >
> > > >>>>>>>> > > > > > > > Hi Christo and Luke,
> > > >>>>>>>> > > > > > > >
> > > >>>>>>>> > > > > > > > I think the KRaft section of the KIP requires
> > slight
> > > >>>>>>>> > improvement.
> > > >>>>>>>> > > > The
> > > >>>>>>>> > > > > > > metadata propagation in KRaft is handled by the
> RAFT
> > > >>>>>>>> layer
> > > >>>>>>>> > instead
> > > >>>>>>>> > > of
> > > >>>>>>>> > > > > > > sending Controller -> Broker RPCs. In fact,
> KIP-631
> > > >>>>>>>> deprecated
> > > >>>>>>>> > > these
> > > >>>>>>>> > > > > > RPCs.
> > > >>>>>>>> > > > > > > >
> > > >>>>>>>> > > > > > > > I will come up with some recommendations on how
> we
> > > >>>>>>>> could
> > > >>>>>>>> > improve
> > > >>>>>>>> > > > that
> > > >>>>>>>> > > > > > > one but until then, @Luke please feel free to
> review
> > > >>>>>>>> the KIP.
> > > >>>>>>>> > > > > > > >
> > > >>>>>>>> > > > > > > > @Satish, if we want this to make it to Kafka
> 3.8 I
> > > >>>>>>>> believe we
> > > >>>>>>>> > > need
> > > >>>>>>>> > > > to
> > > >>>>>>>> > > > > > > aim to get the KIP approved in the following weeks
> > > >>>>>>>> otherwise it
> > > >>>>>>>> > > will
> > > >>>>>>>> > > > > slip
> > > >>>>>>>> > > > > > > and we can not support it in Zookeeper mode.
> > > >>>>>>>> > > > > > > >
> > > >>>>>>>> > > > > > > > I also would like to better understand what is
> the
> > > >>>>>>>> community's
> > > >>>>>>>> > > > stand
> > > >>>>>>>> > > > > > for
> > > >>>>>>>> > > > > > > adding a new feature for Zookeeper since it is
> > marked
> > > as
> > > >>>>>>>> > deprecated
> > > >>>>>>>> > > > > > already.
> > > >>>>>>>> > > > > > > >
> > > >>>>>>>> > > > > > > > Thanks.
> > > >>>>>>>> > > > > > > >
> > > >>>>>>>> > > > > > > >
> > > >>>>>>>> > > > > > > >
> > > >>>>>>>> > > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
> > > >>>>>>>> > > > christololov@gmail.com>
> > > >>>>>>>> > > > > > > wrote:
> > > >>>>>>>> > > > > > > >>
> > > >>>>>>>> > > > > > > >> Heya,
> > > >>>>>>>> > > > > > > >>
> > > >>>>>>>> > > > > > > >> I do have some time to put into this, but to be
> > > >>>>>>>> honest I am
> > > >>>>>>>> > > still
> > > >>>>>>>> > > > > > after
> > > >>>>>>>> > > > > > > >> reviews of the KIP itself :)
> > > >>>>>>>> > > > > > > >>
> > > >>>>>>>> > > > > > > >> After the latest changes it ought to be
> detailing
> > > >>>>>>>> both a
> > > >>>>>>>> > > Zookeeper
> > > >>>>>>>> > > > > > > approach
> > > >>>>>>>> > > > > > > >> and a KRaft approach.
> > > >>>>>>>> > > > > > > >>
> > > >>>>>>>> > > > > > > >> Do you have any thoughts on how it could be
> > > improved
> > > >>>>>>>> or
> > > >>>>>>>> > should I
> > > >>>>>>>> > > > > > start a
> > > >>>>>>>> > > > > > > >> voting thread?
> > > >>>>>>>> > > > > > > >>
> > > >>>>>>>> > > > > > > >> Best,
> > > >>>>>>>> > > > > > > >> Christo
> > > >>>>>>>> > > > > > > >>
> > > >>>>>>>> > > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <
> > > >>>>>>>> showuon@gmail.com>
> > > >>>>>>>> > > > wrote:
> > > >>>>>>>> > > > > > > >>
> > > >>>>>>>> > > > > > > >> > Hi Christo,
> > > >>>>>>>> > > > > > > >> >
> > > >>>>>>>> > > > > > > >> > Any update with this KIP?
> > > >>>>>>>> > > > > > > >> > If you don't have time to complete it, I can
> > > >>>>>>>> collaborate
> > > >>>>>>>> > with
> > > >>>>>>>> > > > you
> > > >>>>>>>> > > > > to
> > > >>>>>>>> > > > > > > work
> > > >>>>>>>> > > > > > > >> > on it.
> > > >>>>>>>> > > > > > > >> >
> > > >>>>>>>> > > > > > > >> > Thanks.
> > > >>>>>>>> > > > > > > >> > Luke
> > > >>>>>>>> > > > > > > >> >
> > > >>>>>>>> > > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish
> > Duggana <
> > > >>>>>>>> > > > > > > satish.duggana@gmail.com>
> > > >>>>>>>> > > > > > > >> > wrote:
> > > >>>>>>>> > > > > > > >> >
> > > >>>>>>>> > > > > > > >> > > Hi Christo,
> > > >>>>>>>> > > > > > > >> > > Thanks for volunteering to contribute to
> the
> > > KIP
> > > >>>>>>>> > > discussion. I
> > > >>>>>>>> > > > > > > suggest
> > > >>>>>>>> > > > > > > >> > > considering this KIP for both ZK and KRaft
> as
> > > it
> > > >>>>>>>> will be
> > > >>>>>>>> > > > helpful
> > > >>>>>>>> > > > > > for
> > > >>>>>>>> > > > > > > >> > > this feature to be available in 3.8.0
> running
> > > >>>>>>>> with ZK
> > > >>>>>>>> > > > clusters.
> > > >>>>>>>> > > > > > > >> > >
> > > >>>>>>>> > > > > > > >> > > Thanks,
> > > >>>>>>>> > > > > > > >> > > Satish.
> > > >>>>>>>> > > > > > > >> > >
> > > >>>>>>>> > > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo
> Lolov <
> > > >>>>>>>> > > > > > christololov@gmail.com
> > > >>>>>>>> > > > > > > >
> > > >>>>>>>> > > > > > > >> > > wrote:
> > > >>>>>>>> > > > > > > >> > > >
> > > >>>>>>>> > > > > > > >> > > > Hello!
> > > >>>>>>>> > > > > > > >> > > >
> > > >>>>>>>> > > > > > > >> > > > I volunteer to get this KIP moving
> forward
> > > and
> > > >>>>>>>> > implemented
> > > >>>>>>>> > > > in
> > > >>>>>>>> > > > > > > Apache
> > > >>>>>>>> > > > > > > >> > > Kafka
> > > >>>>>>>> > > > > > > >> > > > 3.8.
> > > >>>>>>>> > > > > > > >> > > >
> > > >>>>>>>> > > > > > > >> > > > I have caught up with Mehari offline and
> we
> > > >>>>>>>> have agreed
> > > >>>>>>>> > > that
> > > >>>>>>>> > > > > > given
> > > >>>>>>>> > > > > > > >> > Apache
> > > >>>>>>>> > > > > > > >> > > > Kafka 4.0 being around the corner we
> would
> > > >>>>>>>> like to
> > > >>>>>>>> > propose
> > > >>>>>>>> > > > > this
> > > >>>>>>>> > > > > > > feature
> > > >>>>>>>> > > > > > > >> > > > only for KRaft clusters.
> > > >>>>>>>> > > > > > > >> > > >
> > > >>>>>>>> > > > > > > >> > > > Any and all reviews and comments are
> > welcome!
> > > >>>>>>>> > > > > > > >> > > >
> > > >>>>>>>> > > > > > > >> > > > Best,
> > > >>>>>>>> > > > > > > >> > > > Christo
> > > >>>>>>>> > > > > > > >> > > >
> > > >>>>>>>> > > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan
> > Namal <
> > > >>>>>>>> > > > > > > namal.doguscan@gmail.com>
> > > >>>>>>>> > > > > > > >> > > > wrote:
> > > >>>>>>>> > > > > > > >> > > >
> > > >>>>>>>> > > > > > > >> > > > > Hi everyone, any progress on the status
> > of
> > > >>>>>>>> this KIP?
> > > >>>>>>>> > > > Overall
> > > >>>>>>>> > > > > > > looks
> > > >>>>>>>> > > > > > > >> > > good to
> > > >>>>>>>> > > > > > > >> > > > > me but I wonder whether we still need
> to
> > > >>>>>>>> support it
> > > >>>>>>>> > for
> > > >>>>>>>> > > > > > > Zookeeper
> > > >>>>>>>> > > > > > > >> > mode
> > > >>>>>>>> > > > > > > >> > > > > given that it will be deprecated in the
> > > next
> > > >>>>>>>> 3 months.
> > > >>>>>>>> > > > > > > >> > > > >
> > > >>>>>>>> > > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari"
> > > >>>>>>>> wrote:
> > > >>>>>>>> > > > > > > >> > > > > > Hi everyone,
> > > >>>>>>>> > > > > > > >> > > > > > I would like to start a discussion on
> > > >>>>>>>> KIP-950:
> > > >>>>>>>> > Tiered
> > > >>>>>>>> > > > > > Storage
> > > >>>>>>>> > > > > > > >> > > Disablement
> > > >>>>>>>> > > > > > > >> > > > > (
> > > >>>>>>>> > > > > > > >> > > > >
> > > >>>>>>>> > > > > > > >> > > > >
> > > >>>>>>>> > > > > > > >> > >
> > > >>>>>>>> > > > > > > >> >
> > > >>>>>>>> > > > > > >
> > > >>>>>>>> > > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > >
> > > >>>>>>>> > >
> > > >>>>>>>> >
> > > >>>>>>>>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
> > > >>>>>>>> > > > > > > >> > > > > ).
> > > >>>>>>>> > > > > > > >> > > > > >
> > > >>>>>>>> > > > > > > >> > > > > > This KIP proposes adding the ability
> to
> > > >>>>>>>> disable and
> > > >>>>>>>> > > > > > re-enable
> > > >>>>>>>> > > > > > > >> > tiered
> > > >>>>>>>> > > > > > > >> > > > > storage on a topic.
> > > >>>>>>>> > > > > > > >> > > > > >
> > > >>>>>>>> > > > > > > >> > > > > > Thanks,
> > > >>>>>>>> > > > > > > >> > > > > > Mehari
> > > >>>>>>>> > > > > > > >> > > > > >
> > > >>>>>>>> > > > > > > >> > > > >
> > > >>>>>>>> > > > > > > >> > >
> > > >>>>>>>> > > > > > > >> >
> > > >>>>>>>> > > > > > >
> > > >>>>>>>> > > > > >
> > > >>>>>>>> > > > >
> > > >>>>>>>> > > >
> > > >>>>>>>> > >
> > > >>>>>>>> >
> > > >>>>>>>>
> > > >>>>>>>
> > >
> >
>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Luke Chen <sh...@gmail.com>.

Hi Christo,

> 1. I am not certain I follow the question. From DISABLED you can only go
to
ENABLED regardless of whether your cluster is backed by Zookeeper or KRaft.
Am I misunderstanding your point?

Yes, you're right.

> 4. I was thinking that if there is a mismatch we will just fail accepting
the request for disablement. This should be the same in both Zookeeper and
KRaft. Or am I misunderstanding your question?

OK, sounds good.

> 6. I think my current train of thought is that there will be unlimited
retries until all brokers respond in a similar way to how deletion of a
topic works today in ZK. In the meantime the state will continue to be
DISABLING. Do you have a better suggestion?

I don't think infinite retries is a good idea since if a broker is down
forever, this request will never complete.
You mentioned the existing topic deletion is using the similar pattern, how
does it handle this issue?

Thanks.
Luke

On Thu, May 9, 2024 at 9:21 PM Christo Lolov <ch...@gmail.com> wrote:

> Heya!
>
> re: Luke
>
> 1. I am not certain I follow the question. From DISABLED you can only go to
> ENABLED regardless of whether your cluster is backed by Zookeeper or KRaft.
> Am I misunderstanding your point?
>
> 2. Apologies, this was a leftover from previous versions. I have updated
> the Zookeeper section. The steps ought to be: controller receives change,
> commits necessary data to Zookeeper, enqueues disablement and starts
> sending StopReplicas request to brokers; brokers receive StopReplicas and
> propagate them all the way to RemoteLogManager#stopPartitions which takes
> care of the rest.
>
> 3. Correct, it should say DISABLED - this should now be corrected.
>
> 4. I was thinking that if there is a mismatch we will just fail accepting
> the request for disablement. This should be the same in both Zookeeper and
> KRaft. Or am I misunderstanding your question?
>
> 5. Yeah. I am now doing a second pass on all diagrams and will update them
> by the end of the day!
>
> 6. I think my current train of thought is that there will be unlimited
> retries until all brokers respond in a similar way to how deletion of a
> topic works today in ZK. In the meantime the state will continue to be
> DISABLING. Do you have a better suggestion?
>
> re: Kamal
>
> Yep, I will update all diagrams
>
> I am not certain I follow the reasoning for making retain and delete the
> same. Deletion when the policy is retain happens asynchronously due to
> expiration. I think that deletion when the policy is delete ought to (at
> least for the initial implementation) happen synchronously. Should people
> run into timeout problems we can always then have a follow-up KIP where we
> make it asynchronous.
>
> Best,
> Christo
>
> On Tue, 7 May 2024 at 10:04, Kamal Chandraprakash <
> kamal.chandraprakash@gmail.com> wrote:
>
> > Hi Christo,
> >
> > Thanks for the update!
> >
> > For both the policies "retain" and "delete", can we maintain the same
> > approach to delete the segments async?
> >
> > > If the disablement policy is set to delete, the Log start offset (LSO)
> is
> > updated to match the Local Log Start Offset and the remote log is deleted
> > by calling the RemoteStorageManager#deleteLogSegmentData().
> >
> > In the KIP, it's mentioned that when the disable policy is set to
> "delete",
> > the remote-log-segments will be
> > deleted in-sync. The stopPartition call might get timed out when the
> number
> > of remote log segments to
> > delete is huge. We can further extend the same approach for the topic
> > deletion requests.
> >
> > Also, Could you please update the state diagram about the transitions? It
> > is not clear when to transit from
> > DISABLING to DISABLED state?
> >
> > --
> > Kamal
> >
> > On Mon, May 6, 2024 at 6:55 PM Kamal Chandraprakash <
> > kamal.chandraprakash@gmail.com> wrote:
> >
> > > Ignore the above message. Got the answers after reading the state
> > > transition section.
> > >
> > > > If the disablement policy is delete, tasks scheduled for the
> > > topic-partitions in the RemoteDataExpirationThreadPool will also be
> > > canceled.
> > >
> > > We are deleting the segments synchronously. Should we delete them
> > > asynchronously? The same approach can be extended to topic deletion
> > > requests.
> > >
> > > > 6. In ZK mode, what will the controller do if the "stopReplicas"
> > > responses not received from all brokers? Reverting the changes?
> > >
> > > Since we are deleting the segments synchronously. This case can be
> bound
> > > to happen when the number of remote log segments to
> > > delete is huge.
> > >
> > >
> > > On Mon, May 6, 2024, 18:12 Kamal Chandraprakash <
> > > kamal.chandraprakash@gmail.com> wrote:
> > >
> > >> Hi Christo,
> > >>
> > >> Thanks for the update!
> > >>
> > >> 1. In the ZK mode, how will the transition from DISABLING to DISABLED
> > >> state happen?
> > >> For the "retain" policy, until we delete all the remote-log segments,
> > the
> > >> state will be
> > >> DISABLING and the deletion can happen only when they breach either the
> > >> retention
> > >> time (or) size.
> > >>
> > >> How does the controller monitor that all the remote log segments are
> > >> deleted for all
> > >> the partitions of the topic before transitioning the state to
> DISABLED?
> > >>
> > >> 2. In Kraft, we have only ENABLED -> DISABLED state. How are we
> > >> supporting the case
> > >> "retain" -> "enable"?
> > >>
> > >> If the remote storage is degraded, we want to avoid uploading the
> > >> segments temporarily
> > >> and resume back once the remote storage is healthy. Is the case
> > supported?
> > >>
> > >>
> > >>
> > >> On Fri, May 3, 2024 at 12:12 PM Luke Chen <sh...@gmail.com> wrote:
> > >>
> > >>> Also, I think using `stopReplicas` request is a good idea because it
> > >>> won't cause any problems while migrating to KRaft mode.
> > >>> The stopReplicas request is one of the request that KRaft controller
> > >>> will send to ZK brokers during migration.
> > >>>
> > >>> Thanks.
> > >>> Luke
> > >>>
> > >>> On Fri, May 3, 2024 at 11:48 AM Luke Chen <sh...@gmail.com> wrote:
> > >>>
> > >>>> Hi Christo,
> > >>>>
> > >>>> Thanks for the update.
> > >>>>
> > >>>> Questions:
> > >>>> 1. For this
> > >>>> "The possible state transition from DISABLED state is to the
> ENABLED."
> > >>>> I think it only applies for KRaft mode. In ZK mode, the possible
> state
> > >>>> is "DISABLING", right?
> > >>>>
> > >>>> 2. For this:
> > >>>> "If the cluster is using Zookeeper as the control plane, enabling
> > >>>> remote storage for a topic triggers the controller to send this
> > information
> > >>>> to Zookeeper. Each broker listens for changes in Zookeeper, and
> when a
> > >>>> change is detected, the broker triggers
> > >>>> RemoteLogManager#onLeadershipChange()."
> > >>>>
> > >>>> I think the way ZK brokers knows the leadership change is by getting
> > >>>> the LeaderAndISRRequeset from the controller, not listening for
> > changes in
> > >>>> ZK.
> > >>>>
> > >>>> 3. In the KRaft handler steps, you said:
> > >>>> "The controller also updates the Topic metadata to increment the
> > >>>> tiered_epoch and update the tiered_stateto DISABLING state."
> > >>>>
> > >>>> Should it be "DISABLED" state since it's KRaft mode?
> > >>>>
> > >>>> 4. I was thinking how we handle the tiered_epoch not match error.
> > >>>> For ZK, I think the controller won't write any data into ZK Znode,
> > >>>> For KRaft, either configRecord or updateTopicMetadata records won't
> be
> > >>>> written.
> > >>>> Is that right? Because the current workflow makes me think there
> will
> > >>>> be partial data updated in ZK/KRaft when tiered_epoch error.
> > >>>>
> > >>>> 5. Since we changed to use stopReplicas (V5) request now, the
> diagram
> > >>>> for ZK workflow might also need to update.
> > >>>>
> > >>>> 6. In ZK mode, what will the controller do if the "stopReplicas"
> > >>>> responses not received from all brokers? Reverting the changes?
> > >>>> This won't happen in KRaft mode because it's broker's responsibility
> > to
> > >>>> fetch metadata update from controller.
> > >>>>
> > >>>>
> > >>>> Thank you.
> > >>>> Luke
> > >>>>
> > >>>>
> > >>>> On Fri, Apr 19, 2024 at 10:23 PM Christo Lolov <
> > christololov@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Heya all!
> > >>>>>
> > >>>>> I have updated KIP-950. A list of what I have updated is:
> > >>>>>
> > >>>>> * Explicitly state that Zookeeper-backed clusters will have ENABLED
> > ->
> > >>>>> DISABLING -> DISABLED while KRaft-backed clusters will only have
> > ENABLED ->
> > >>>>> DISABLED
> > >>>>> * Added two configurations for the new thread pools and explained
> > >>>>> where values will be picked-up mid Kafka version upgrade
> > >>>>> * Explained how leftover remote partitions will be scheduled for
> > >>>>> deletion
> > >>>>> * Updated the API to use StopReplica V5 rather than a whole new
> > >>>>> controller-to-broker API
> > >>>>> * Explained that the disablement procedure will be triggered by the
> > >>>>> controller listening for an (Incremental)AlterConfig change
> > >>>>> * Explained that we will first move log start offset and then
> issue a
> > >>>>> deletion
> > >>>>> * Went into more details that changing remote.log.disable.policy
> > after
> > >>>>> disablement won't do anything and that if a customer would like
> > additional
> > >>>>> data deleted they would have to use already existing methods
> > >>>>>
> > >>>>> Let me know if there are any new comments or I have missed
> something!
> > >>>>>
> > >>>>> Best,
> > >>>>> Christo
> > >>>>>
> > >>>>> On Mon, 15 Apr 2024 at 12:40, Christo Lolov <
> christololov@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Heya Doguscan,
> > >>>>>>
> > >>>>>> I believe that the state of the world after this KIP will be the
> > >>>>>> following:
> > >>>>>>
> > >>>>>> For Zookeeper-backed clusters there will be 3 states: ENABLED,
> > >>>>>> DISABLING and DISABLED. We want this because Zookeeper-backed
> > clusters will
> > >>>>>> await a confirmation from the brokers that they have indeed
> stopped
> > >>>>>> tiered-related operations on the topic.
> > >>>>>>
> > >>>>>> For KRaft-backed clusters there will be only 2 states: ENABLED and
> > >>>>>> DISABLED. KRaft takes a fire-and-forget approach for topic
> > deletion. I
> > >>>>>> believe the same approach ought to be taken for tiered topics. The
> > >>>>>> mechanism which will ensure that leftover state in remote due to
> > failures
> > >>>>>> is cleaned up to me is the retention mechanism. In today's code, a
> > leader
> > >>>>>> deletes all segments it finds in remote with offsets below the log
> > start
> > >>>>>> offset. I believe this will be good enough for cleaning up
> leftover
> > state
> > >>>>>> in remote due to failures.
> > >>>>>>
> > >>>>>> I know that quite a few changes have been discussed so I will aim
> to
> > >>>>>> put them on paper in the upcoming days and let everyone know!
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Christo
> > >>>>>>
> > >>>>>> On Tue, 9 Apr 2024 at 14:49, Doğuşcan Namal <
> > namal.doguscan@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> +1 let's not introduce a new api and mark it immediately as
> > >>>>>>> deprecated :)
> > >>>>>>>
> > >>>>>>> On your second comment Luke, one thing we need to clarify is when
> > do
> > >>>>>>> we consider remote storage to be DISABLED for a topic?
> > >>>>>>> Particularly, what is the state when the remote storage is being
> > >>>>>>> deleted in case of disablement.policy=delete? Is it DISABLING or
> > DISABLED?
> > >>>>>>>
> > >>>>>>> If we move directly to the DISABLED state,
> > >>>>>>>
> > >>>>>>> a) in case of failures, the leaders should continue remote
> storage
> > >>>>>>> deletion even if the topic is moved to the DISABLED state,
> > otherwise we
> > >>>>>>> risk having stray data on remote storage.
> > >>>>>>> b) on each restart, we should initiate the remote storage
> deletion
> > >>>>>>> because although we replayed a record with a DISABLED state, we
> > can not be
> > >>>>>>> sure if the remote data is deleted or not.
> > >>>>>>>
> > >>>>>>> We could either consider keeping the remote topic in DISABLING
> > state
> > >>>>>>> until all of the remote storage data is deleted, or we need an
> > additional
> > >>>>>>> mechanism to handle the remote stray data.
> > >>>>>>>
> > >>>>>>> The existing topic deletion, for instance, handles stray logs on
> > >>>>>>> disk by detecting them on KafkaBroker startup and deleting before
> > the
> > >>>>>>> ReplicaManager is started.
> > >>>>>>> Maybe we need a similar mechanism here as well if we don't want a
> > >>>>>>> DISABLING state. Otherwise, we need a callback from Brokers to
> > validate
> > >>>>>>> that remote storage data is deleted and now we could move to the
> > DISABLED
> > >>>>>>> state.
> > >>>>>>>
> > >>>>>>> Thanks.
> > >>>>>>>
> > >>>>>>> On Tue, 9 Apr 2024 at 12:45, Luke Chen <sh...@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Christo,
> > >>>>>>>>
> > >>>>>>>> > I would then opt for moving information from
> DisableRemoteTopic
> > >>>>>>>> within the StopReplicas API which will then disappear in KRaft
> > >>>>>>>> world as it
> > >>>>>>>> is already scheduled for deprecation. What do you think?
> > >>>>>>>>
> > >>>>>>>> Sounds good to me.
> > >>>>>>>>
> > >>>>>>>> Thanks.
> > >>>>>>>> Luke
> > >>>>>>>>
> > >>>>>>>> On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <
> > >>>>>>>> christololov@gmail.com> wrote:
> > >>>>>>>>
> > >>>>>>>> > Heya Luke!
> > >>>>>>>> >
> > >>>>>>>> > I thought a bit more about it and I reached the same
> conclusion
> > >>>>>>>> as you for
> > >>>>>>>> > 2 as a follow-up from 1. In other words, in KRaft world I
> don't
> > >>>>>>>> think the
> > >>>>>>>> > controller needs to wait for acknowledgements for the brokers.
> > >>>>>>>> All we care
> > >>>>>>>> > about is that the leader (who is responsible for
> > >>>>>>>> archiving/deleting data in
> > >>>>>>>> > tiered storage) knows about the change and applies it
> properly.
> > >>>>>>>> If there is
> > >>>>>>>> > a leadership change halfway through the operation then the new
> > >>>>>>>> leader still
> > >>>>>>>> > needs to apply the message from the state topic and we know
> > that a
> > >>>>>>>> > disable-message will be applied before a
> reenablement-message. I
> > >>>>>>>> will
> > >>>>>>>> > change the KIP later today/tomorrow morning to reflect this
> > >>>>>>>> reasoning.
> > >>>>>>>> >
> > >>>>>>>> > However, with this I believe that introducing a new API just
> for
> > >>>>>>>> > Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a
> bit
> > >>>>>>>> of an
> > >>>>>>>> > overkill. I would then opt for moving information from
> > >>>>>>>> DisableRemoteTopic
> > >>>>>>>> > within the StopReplicas API which will then disappear in KRaft
> > >>>>>>>> world as it
> > >>>>>>>> > is already scheduled for deprecation. What do you think?
> > >>>>>>>> >
> > >>>>>>>> > Best,
> > >>>>>>>> > Christo
> > >>>>>>>> >
> > >>>>>>>> > On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com>
> > wrote:
> > >>>>>>>> >
> > >>>>>>>> > > Hi Christo,
> > >>>>>>>> > >
> > >>>>>>>> > > 1. I agree with Doguscan that in KRaft mode, the controller
> > >>>>>>>> won't send
> > >>>>>>>> > RPCs
> > >>>>>>>> > > to the brokers (except in the migration path).
> > >>>>>>>> > > So, I think we could adopt the similar way we did to
> > >>>>>>>> > `AlterReplicaLogDirs`
> > >>>>>>>> > > (
> > >>>>>>>> > > KIP-858
> > >>>>>>>> > > <
> > >>>>>>>> > >
> > >>>>>>>> >
> > >>>>>>>>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
> > >>>>>>>> > > >)
> > >>>>>>>> > > that let the broker notify controller any update, instead of
> > >>>>>>>> controller
> > >>>>>>>> > to
> > >>>>>>>> > > broker. And once the controller receives all the complete
> > >>>>>>>> requests from
> > >>>>>>>> > > brokers, it'll enter "Disabled" state. WDYT?
> > >>>>>>>> > >
> > >>>>>>>> > > 2. Why should we wait until all brokers to respond before
> > >>>>>>>> moving to
> > >>>>>>>> > > "Disabled" state in "KRaft mode"?
> > >>>>>>>> > > Currently, only the leader node does the remote log
> > >>>>>>>> upload/fetch tasks,
> > >>>>>>>> > so
> > >>>>>>>> > > does that mean the controller only need to make sure the
> > leader
> > >>>>>>>> completes
> > >>>>>>>> > > the stopPartition?
> > >>>>>>>> > > If during the leader node stopPartition process triggered
> > >>>>>>>> leadership
> > >>>>>>>> > > change, then the new leader should receive and apply the
> > >>>>>>>> configRecord
> > >>>>>>>> > > update before the leadership change record based on the
> KRaft
> > >>>>>>>> design,
> > >>>>>>>> > which
> > >>>>>>>> > > means there will be no gap that the follower node becomes
> the
> > >>>>>>>> leader and
> > >>>>>>>> > > starting doing unexpected upload/fetch tasks, right?
> > >>>>>>>> > > I agree we should make sure in ZK mode, all brokers are
> > >>>>>>>> completed the
> > >>>>>>>> > > stopPartitions before moving to "Disabled" state because ZK
> > >>>>>>>> node watcher
> > >>>>>>>> > is
> > >>>>>>>> > > working in a separate thread. But not sure about KRaft mode.
> > >>>>>>>> > >
> > >>>>>>>> > > Thanks.
> > >>>>>>>> > > Luke
> > >>>>>>>> > >
> > >>>>>>>> > >
> > >>>>>>>> > > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <
> > >>>>>>>> christololov@gmail.com>
> > >>>>>>>> > > wrote:
> > >>>>>>>> > >
> > >>>>>>>> > > > Heya everyone!
> > >>>>>>>> > > >
> > >>>>>>>> > > > re: Doguscan
> > >>>>>>>> > > >
> > >>>>>>>> > > > I believe the answer to 101 needs a bit more discussion.
> As
> > >>>>>>>> far as I
> > >>>>>>>> > > know,
> > >>>>>>>> > > > tiered storage today has methods to update a metadata of a
> > >>>>>>>> segment to
> > >>>>>>>> > say
> > >>>>>>>> > > > "hey, I would like this deleted", but actual deletion is
> > left
> > >>>>>>>> to plugin
> > >>>>>>>> > > > implementations (or any background cleaners). In other
> > words,
> > >>>>>>>> there is
> > >>>>>>>> > no
> > >>>>>>>> > > > "immediate" deletion. In this KIP, we would like to
> continue
> > >>>>>>>> doing the
> > >>>>>>>> > > same
> > >>>>>>>> > > > if the retention policy is set to delete. So I believe the
> > >>>>>>>> answer is
> > >>>>>>>> > > > actually that a) we will update the metadata of the
> segments
> > >>>>>>>> to mark
> > >>>>>>>> > them
> > >>>>>>>> > > > as deleted and b) we will advance the log start offset.
> Any
> > >>>>>>>> deletion of
> > >>>>>>>> > > > actual files will still be delegated to plugin
> > >>>>>>>> implementations. I
> > >>>>>>>> > believe
> > >>>>>>>> > > > this is further supported by
> > >>>>>>>> "*remote.log.disable.policy=delete:* Logs
> > >>>>>>>> > > that
> > >>>>>>>> > > > are archived in the remote storage will not be part of the
> > >>>>>>>> contiguous
> > >>>>>>>> > > > "active" log and will be deleted asynchronously as part of
> > the
> > >>>>>>>> > > disablement
> > >>>>>>>> > > > process"
> > >>>>>>>> > > >
> > >>>>>>>> > > > Following from the above, I believe for 102 it is fine to
> > >>>>>>>> allow setting
> > >>>>>>>> > > of
> > >>>>>>>> > > > remote.log.disable.policy on a disabled topic in much the
> > >>>>>>>> same way we
> > >>>>>>>> > > allow
> > >>>>>>>> > > > other remote-related configurations to be set on a topic
> > (i.e.
> > >>>>>>>> > > > local.retention.*) - it just won't have an effect.
> Granted,
> > I
> > >>>>>>>> do
> > >>>>>>>> > believe
> > >>>>>>>> > > we
> > >>>>>>>> > > > should restrict the policy being changed while a
> disablement
> > >>>>>>>> is
> > >>>>>>>> > ongoing.
> > >>>>>>>> > > >
> > >>>>>>>> > > > re: Satish and Kamal
> > >>>>>>>> > > >
> > >>>>>>>> > > > 104, 1 and 2 are fair asks, I will work with Doguscan to
> > >>>>>>>> update the KIP
> > >>>>>>>> > > > with the information!
> > >>>>>>>> > > >
> > >>>>>>>> > > > Best,
> > >>>>>>>> > > > Christo
> > >>>>>>>> > > >
> > >>>>>>>> > > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <
> > >>>>>>>> namal.doguscan@gmail.com
> > >>>>>>>> > >
> > >>>>>>>> > > > wrote:
> > >>>>>>>> > > >
> > >>>>>>>> > > > > Hi Satish, I will try to answer as much as I can and the
> > >>>>>>>> others could
> > >>>>>>>> > > > chime
> > >>>>>>>> > > > > in with further details.
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *101. For remote.log.disable.policy=delete: Does it
> delete
> > >>>>>>>> the remote
> > >>>>>>>> > > log
> > >>>>>>>> > > > > data immediately and the data in remote storage will not
> > be
> > >>>>>>>> taken
> > >>>>>>>> > into
> > >>>>>>>> > > > > account by any replica? That means log-start-offset is
> > >>>>>>>> moved to the
> > >>>>>>>> > > > earlier
> > >>>>>>>> > > > > local-log-start-offset.*
> > >>>>>>>> > > > > *Exactly. RemoteLogData will be deleted immediately. *
> > >>>>>>>> > > > > *So before the deletion starts we move LogStart offset
> to
> > >>>>>>>> > > > > LocalLogStartOffset to ensure that no RemoteLog will be
> > >>>>>>>> accessed
> > >>>>>>>> > after
> > >>>>>>>> > > > that
> > >>>>>>>> > > > > point.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > * 102. Can we update the remote.log.disable.policy after
> > >>>>>>>> tiered
> > >>>>>>>> > storage
> > >>>>>>>> > > > is
> > >>>>>>>> > > > > disabled on a topic?*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *This is a good point. I think we should not allow
> > >>>>>>>> modifying this
> > >>>>>>>> > > > > configuration*
> > >>>>>>>> > > > > *because changing the policy from Deletion to Retain
> when
> > >>>>>>>> there is an
> > >>>>>>>> > > > > ongoing Deletion will result in an undefined behaviour
> and
> > >>>>>>>> where we
> > >>>>>>>> > > > retain
> > >>>>>>>> > > > > half of the remote log and delete the other half.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > * 103. Do we plan to add any metrics related to this
> > >>>>>>>> feature?*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *Any recommendations?*
> > >>>>>>>> > > > > *We may emit a gauge showing the enablement state of a
> > >>>>>>>> topic but we
> > >>>>>>>> > > could
> > >>>>>>>> > > > > gather that info from the logs as well.*
> > >>>>>>>> > > > > *The total duration for remote topic deletion could be
> > >>>>>>>> added as well
> > >>>>>>>> > > but
> > >>>>>>>> > > > > this is more of a metric for the RemotePartitionRemover
> > >>>>>>>> itself.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *104. Please add configuration details about copier
> thread
> > >>>>>>>> pool,
> > >>>>>>>> > > > expiration
> > >>>>>>>> > > > > thread pool and the migration of the existing
> > >>>>>>>> > > > > RemoteLogManagerScheduledThreadPool.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *Will add the details.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > 105. How is the behaviour with topic or partition
> deletion
> > >>>>>>>> request
> > >>>>>>>> > > > > handled when tiered storage disablement request is still
> > >>>>>>>> being
> > >>>>>>>> > > > > processed on a topic?
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *If the disablement policy is Delete then a successive
> > >>>>>>>> topic deletion
> > >>>>>>>> > > > > request is going to be a NOOP because RemoteLogs is
> > already
> > >>>>>>>> being
> > >>>>>>>> > > > deleted.*
> > >>>>>>>> > > > > *If the disablement policy is Retain, then we only moved
> > the
> > >>>>>>>> > > > LogStartOffset
> > >>>>>>>> > > > > and didn't touch RemoteLogs anyway, so the delete topic
> > >>>>>>>> request will
> > >>>>>>>> > > > > result*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > *in the initiation of RemoteLog deletion.*
> > >>>>>>>> > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
> > >>>>>>>> > > > > kamal.chandraprakash@gmail.com> wrote:
> > >>>>>>>> > > > >
> > >>>>>>>> > > > > > Hi,
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > Thanks for the KIP! Overall the KIP looks good and
> > >>>>>>>> covered most of
> > >>>>>>>> > > the
> > >>>>>>>> > > > > > items.
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > 1. Could you explain how the brokers will handle the
> > >>>>>>>> > > DisableRemoteTopic
> > >>>>>>>> > > > > API
> > >>>>>>>> > > > > > request?
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > 2. Who will initiate the controller interaction
> > sequence?
> > >>>>>>>> Does the
> > >>>>>>>> > > > > > controller listens for
> > >>>>>>>> > > > > > topic config updates and initiate the disablement?
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > --
> > >>>>>>>> > > > > > Kamal
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
> > >>>>>>>> > > > satish.duggana@gmail.com
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > wrote:
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > I had an initial review of the KIP and left the
> below
> > >>>>>>>> comments.
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > 101. For remote.log.disable.policy=delete:
> > >>>>>>>> > > > > > > Does it delete the remote log data immediately and
> the
> > >>>>>>>> data in
> > >>>>>>>> > > remote
> > >>>>>>>> > > > > > > storage will not be taken into account by any
> replica?
> > >>>>>>>> That means
> > >>>>>>>> > > > > > > log-start-offset is moved to the earlier
> > >>>>>>>> local-log-start-offset.
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > 102. Can we update the remote.log.disable.policy
> after
> > >>>>>>>> tiered
> > >>>>>>>> > > storage
> > >>>>>>>> > > > > > > is disabled on a topic?
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > 103. Do we plan to add any metrics related to this
> > >>>>>>>> feature?
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > 104. Please add configuration details about copier
> > >>>>>>>> thread pool,
> > >>>>>>>> > > > > > > expiration thread pool and the migration of the
> > existing
> > >>>>>>>> > > > > > > RemoteLogManagerScheduledThreadPool.
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > 105. How is the behaviour with topic or partition
> > >>>>>>>> deletion
> > >>>>>>>> > request
> > >>>>>>>> > > > > > > handled when tiered storage disablement request is
> > >>>>>>>> still being
> > >>>>>>>> > > > > > > processed on a topic?
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > ~Satish.
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
> > >>>>>>>> > > > namal.doguscan@gmail.com
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > > > > wrote:
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > Hi Christo and Luke,
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > I think the KRaft section of the KIP requires
> slight
> > >>>>>>>> > improvement.
> > >>>>>>>> > > > The
> > >>>>>>>> > > > > > > metadata propagation in KRaft is handled by the RAFT
> > >>>>>>>> layer
> > >>>>>>>> > instead
> > >>>>>>>> > > of
> > >>>>>>>> > > > > > > sending Controller -> Broker RPCs. In fact, KIP-631
> > >>>>>>>> deprecated
> > >>>>>>>> > > these
> > >>>>>>>> > > > > > RPCs.
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > I will come up with some recommendations on how we
> > >>>>>>>> could
> > >>>>>>>> > improve
> > >>>>>>>> > > > that
> > >>>>>>>> > > > > > > one but until then, @Luke please feel free to review
> > >>>>>>>> the KIP.
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > @Satish, if we want this to make it to Kafka 3.8 I
> > >>>>>>>> believe we
> > >>>>>>>> > > need
> > >>>>>>>> > > > to
> > >>>>>>>> > > > > > > aim to get the KIP approved in the following weeks
> > >>>>>>>> otherwise it
> > >>>>>>>> > > will
> > >>>>>>>> > > > > slip
> > >>>>>>>> > > > > > > and we can not support it in Zookeeper mode.
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > I also would like to better understand what is the
> > >>>>>>>> community's
> > >>>>>>>> > > > stand
> > >>>>>>>> > > > > > for
> > >>>>>>>> > > > > > > adding a new feature for Zookeeper since it is
> marked
> > as
> > >>>>>>>> > deprecated
> > >>>>>>>> > > > > > already.
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > Thanks.
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
> > >>>>>>>> > > > christololov@gmail.com>
> > >>>>>>>> > > > > > > wrote:
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> Heya,
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> I do have some time to put into this, but to be
> > >>>>>>>> honest I am
> > >>>>>>>> > > still
> > >>>>>>>> > > > > > after
> > >>>>>>>> > > > > > > >> reviews of the KIP itself :)
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> After the latest changes it ought to be detailing
> > >>>>>>>> both a
> > >>>>>>>> > > Zookeeper
> > >>>>>>>> > > > > > > approach
> > >>>>>>>> > > > > > > >> and a KRaft approach.
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> Do you have any thoughts on how it could be
> > improved
> > >>>>>>>> or
> > >>>>>>>> > should I
> > >>>>>>>> > > > > > start a
> > >>>>>>>> > > > > > > >> voting thread?
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> Best,
> > >>>>>>>> > > > > > > >> Christo
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <
> > >>>>>>>> showuon@gmail.com>
> > >>>>>>>> > > > wrote:
> > >>>>>>>> > > > > > > >>
> > >>>>>>>> > > > > > > >> > Hi Christo,
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > > >> > Any update with this KIP?
> > >>>>>>>> > > > > > > >> > If you don't have time to complete it, I can
> > >>>>>>>> collaborate
> > >>>>>>>> > with
> > >>>>>>>> > > > you
> > >>>>>>>> > > > > to
> > >>>>>>>> > > > > > > work
> > >>>>>>>> > > > > > > >> > on it.
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > > >> > Thanks.
> > >>>>>>>> > > > > > > >> > Luke
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish
> Duggana <
> > >>>>>>>> > > > > > > satish.duggana@gmail.com>
> > >>>>>>>> > > > > > > >> > wrote:
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > > >> > > Hi Christo,
> > >>>>>>>> > > > > > > >> > > Thanks for volunteering to contribute to the
> > KIP
> > >>>>>>>> > > discussion. I
> > >>>>>>>> > > > > > > suggest
> > >>>>>>>> > > > > > > >> > > considering this KIP for both ZK and KRaft as
> > it
> > >>>>>>>> will be
> > >>>>>>>> > > > helpful
> > >>>>>>>> > > > > > for
> > >>>>>>>> > > > > > > >> > > this feature to be available in 3.8.0 running
> > >>>>>>>> with ZK
> > >>>>>>>> > > > clusters.
> > >>>>>>>> > > > > > > >> > >
> > >>>>>>>> > > > > > > >> > > Thanks,
> > >>>>>>>> > > > > > > >> > > Satish.
> > >>>>>>>> > > > > > > >> > >
> > >>>>>>>> > > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
> > >>>>>>>> > > > > > christololov@gmail.com
> > >>>>>>>> > > > > > > >
> > >>>>>>>> > > > > > > >> > > wrote:
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > Hello!
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > I volunteer to get this KIP moving forward
> > and
> > >>>>>>>> > implemented
> > >>>>>>>> > > > in
> > >>>>>>>> > > > > > > Apache
> > >>>>>>>> > > > > > > >> > > Kafka
> > >>>>>>>> > > > > > > >> > > > 3.8.
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > I have caught up with Mehari offline and we
> > >>>>>>>> have agreed
> > >>>>>>>> > > that
> > >>>>>>>> > > > > > given
> > >>>>>>>> > > > > > > >> > Apache
> > >>>>>>>> > > > > > > >> > > > Kafka 4.0 being around the corner we would
> > >>>>>>>> like to
> > >>>>>>>> > propose
> > >>>>>>>> > > > > this
> > >>>>>>>> > > > > > > feature
> > >>>>>>>> > > > > > > >> > > > only for KRaft clusters.
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > Any and all reviews and comments are
> welcome!
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > Best,
> > >>>>>>>> > > > > > > >> > > > Christo
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan
> Namal <
> > >>>>>>>> > > > > > > namal.doguscan@gmail.com>
> > >>>>>>>> > > > > > > >> > > > wrote:
> > >>>>>>>> > > > > > > >> > > >
> > >>>>>>>> > > > > > > >> > > > > Hi everyone, any progress on the status
> of
> > >>>>>>>> this KIP?
> > >>>>>>>> > > > Overall
> > >>>>>>>> > > > > > > looks
> > >>>>>>>> > > > > > > >> > > good to
> > >>>>>>>> > > > > > > >> > > > > me but I wonder whether we still need to
> > >>>>>>>> support it
> > >>>>>>>> > for
> > >>>>>>>> > > > > > > Zookeeper
> > >>>>>>>> > > > > > > >> > mode
> > >>>>>>>> > > > > > > >> > > > > given that it will be deprecated in the
> > next
> > >>>>>>>> 3 months.
> > >>>>>>>> > > > > > > >> > > > >
> > >>>>>>>> > > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari"
> > >>>>>>>> wrote:
> > >>>>>>>> > > > > > > >> > > > > > Hi everyone,
> > >>>>>>>> > > > > > > >> > > > > > I would like to start a discussion on
> > >>>>>>>> KIP-950:
> > >>>>>>>> > Tiered
> > >>>>>>>> > > > > > Storage
> > >>>>>>>> > > > > > > >> > > Disablement
> > >>>>>>>> > > > > > > >> > > > > (
> > >>>>>>>> > > > > > > >> > > > >
> > >>>>>>>> > > > > > > >> > > > >
> > >>>>>>>> > > > > > > >> > >
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > >
> > >>>>>>>> > >
> > >>>>>>>> >
> > >>>>>>>>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
> > >>>>>>>> > > > > > > >> > > > > ).
> > >>>>>>>> > > > > > > >> > > > > >
> > >>>>>>>> > > > > > > >> > > > > > This KIP proposes adding the ability to
> > >>>>>>>> disable and
> > >>>>>>>> > > > > > re-enable
> > >>>>>>>> > > > > > > >> > tiered
> > >>>>>>>> > > > > > > >> > > > > storage on a topic.
> > >>>>>>>> > > > > > > >> > > > > >
> > >>>>>>>> > > > > > > >> > > > > > Thanks,
> > >>>>>>>> > > > > > > >> > > > > > Mehari
> > >>>>>>>> > > > > > > >> > > > > >
> > >>>>>>>> > > > > > > >> > > > >
> > >>>>>>>> > > > > > > >> > >
> > >>>>>>>> > > > > > > >> >
> > >>>>>>>> > > > > > >
> > >>>>>>>> > > > > >
> > >>>>>>>> > > > >
> > >>>>>>>> > > >
> > >>>>>>>> > >
> > >>>>>>>> >
> > >>>>>>>>
> > >>>>>>>
> >
>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Christo Lolov <ch...@gmail.com>.

Heya!

re: Luke

1. I am not certain I follow the question. From DISABLED you can only go to
ENABLED regardless of whether your cluster is backed by Zookeeper or KRaft.
Am I misunderstanding your point?

2. Apologies, this was a leftover from previous versions. I have updated
the Zookeeper section. The steps ought to be: controller receives change,
commits necessary data to Zookeeper, enqueues disablement and starts
sending StopReplicas request to brokers; brokers receive StopReplicas and
propagate them all the way to RemoteLogManager#stopPartitions which takes
care of the rest.

3. Correct, it should say DISABLED - this should now be corrected.

4. I was thinking that if there is a mismatch we will just fail accepting
the request for disablement. This should be the same in both Zookeeper and
KRaft. Or am I misunderstanding your question?

5. Yeah. I am now doing a second pass on all diagrams and will update them
by the end of the day!

6. I think my current train of thought is that there will be unlimited
retries until all brokers respond in a similar way to how deletion of a
topic works today in ZK. In the meantime the state will continue to be
DISABLING. Do you have a better suggestion?

re: Kamal

Yep, I will update all diagrams

I am not certain I follow the reasoning for making retain and delete the
same. Deletion when the policy is retain happens asynchronously due to
expiration. I think that deletion when the policy is delete ought to (at
least for the initial implementation) happen synchronously. Should people
run into timeout problems we can always then have a follow-up KIP where we
make it asynchronous.

Best,
Christo

On Tue, 7 May 2024 at 10:04, Kamal Chandraprakash <
kamal.chandraprakash@gmail.com> wrote:

> Hi Christo,
>
> Thanks for the update!
>
> For both the policies "retain" and "delete", can we maintain the same
> approach to delete the segments async?
>
> > If the disablement policy is set to delete, the Log start offset (LSO) is
> updated to match the Local Log Start Offset and the remote log is deleted
> by calling the RemoteStorageManager#deleteLogSegmentData().
>
> In the KIP, it's mentioned that when the disable policy is set to "delete",
> the remote-log-segments will be
> deleted in-sync. The stopPartition call might get timed out when the number
> of remote log segments to
> delete is huge. We can further extend the same approach for the topic
> deletion requests.
>
> Also, Could you please update the state diagram about the transitions? It
> is not clear when to transit from
> DISABLING to DISABLED state?
>
> --
> Kamal
>
> On Mon, May 6, 2024 at 6:55 PM Kamal Chandraprakash <
> kamal.chandraprakash@gmail.com> wrote:
>
> > Ignore the above message. Got the answers after reading the state
> > transition section.
> >
> > > If the disablement policy is delete, tasks scheduled for the
> > topic-partitions in the RemoteDataExpirationThreadPool will also be
> > canceled.
> >
> > We are deleting the segments synchronously. Should we delete them
> > asynchronously? The same approach can be extended to topic deletion
> > requests.
> >
> > > 6. In ZK mode, what will the controller do if the "stopReplicas"
> > responses not received from all brokers? Reverting the changes?
> >
> > Since we are deleting the segments synchronously. This case can be bound
> > to happen when the number of remote log segments to
> > delete is huge.
> >
> >
> > On Mon, May 6, 2024, 18:12 Kamal Chandraprakash <
> > kamal.chandraprakash@gmail.com> wrote:
> >
> >> Hi Christo,
> >>
> >> Thanks for the update!
> >>
> >> 1. In the ZK mode, how will the transition from DISABLING to DISABLED
> >> state happen?
> >> For the "retain" policy, until we delete all the remote-log segments,
> the
> >> state will be
> >> DISABLING and the deletion can happen only when they breach either the
> >> retention
> >> time (or) size.
> >>
> >> How does the controller monitor that all the remote log segments are
> >> deleted for all
> >> the partitions of the topic before transitioning the state to DISABLED?
> >>
> >> 2. In Kraft, we have only ENABLED -> DISABLED state. How are we
> >> supporting the case
> >> "retain" -> "enable"?
> >>
> >> If the remote storage is degraded, we want to avoid uploading the
> >> segments temporarily
> >> and resume back once the remote storage is healthy. Is the case
> supported?
> >>
> >>
> >>
> >> On Fri, May 3, 2024 at 12:12 PM Luke Chen <sh...@gmail.com> wrote:
> >>
> >>> Also, I think using `stopReplicas` request is a good idea because it
> >>> won't cause any problems while migrating to KRaft mode.
> >>> The stopReplicas request is one of the request that KRaft controller
> >>> will send to ZK brokers during migration.
> >>>
> >>> Thanks.
> >>> Luke
> >>>
> >>> On Fri, May 3, 2024 at 11:48 AM Luke Chen <sh...@gmail.com> wrote:
> >>>
> >>>> Hi Christo,
> >>>>
> >>>> Thanks for the update.
> >>>>
> >>>> Questions:
> >>>> 1. For this
> >>>> "The possible state transition from DISABLED state is to the ENABLED."
> >>>> I think it only applies for KRaft mode. In ZK mode, the possible state
> >>>> is "DISABLING", right?
> >>>>
> >>>> 2. For this:
> >>>> "If the cluster is using Zookeeper as the control plane, enabling
> >>>> remote storage for a topic triggers the controller to send this
> information
> >>>> to Zookeeper. Each broker listens for changes in Zookeeper, and when a
> >>>> change is detected, the broker triggers
> >>>> RemoteLogManager#onLeadershipChange()."
> >>>>
> >>>> I think the way ZK brokers knows the leadership change is by getting
> >>>> the LeaderAndISRRequeset from the controller, not listening for
> changes in
> >>>> ZK.
> >>>>
> >>>> 3. In the KRaft handler steps, you said:
> >>>> "The controller also updates the Topic metadata to increment the
> >>>> tiered_epoch and update the tiered_stateto DISABLING state."
> >>>>
> >>>> Should it be "DISABLED" state since it's KRaft mode?
> >>>>
> >>>> 4. I was thinking how we handle the tiered_epoch not match error.
> >>>> For ZK, I think the controller won't write any data into ZK Znode,
> >>>> For KRaft, either configRecord or updateTopicMetadata records won't be
> >>>> written.
> >>>> Is that right? Because the current workflow makes me think there will
> >>>> be partial data updated in ZK/KRaft when tiered_epoch error.
> >>>>
> >>>> 5. Since we changed to use stopReplicas (V5) request now, the diagram
> >>>> for ZK workflow might also need to update.
> >>>>
> >>>> 6. In ZK mode, what will the controller do if the "stopReplicas"
> >>>> responses not received from all brokers? Reverting the changes?
> >>>> This won't happen in KRaft mode because it's broker's responsibility
> to
> >>>> fetch metadata update from controller.
> >>>>
> >>>>
> >>>> Thank you.
> >>>> Luke
> >>>>
> >>>>
> >>>> On Fri, Apr 19, 2024 at 10:23 PM Christo Lolov <
> christololov@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Heya all!
> >>>>>
> >>>>> I have updated KIP-950. A list of what I have updated is:
> >>>>>
> >>>>> * Explicitly state that Zookeeper-backed clusters will have ENABLED
> ->
> >>>>> DISABLING -> DISABLED while KRaft-backed clusters will only have
> ENABLED ->
> >>>>> DISABLED
> >>>>> * Added two configurations for the new thread pools and explained
> >>>>> where values will be picked-up mid Kafka version upgrade
> >>>>> * Explained how leftover remote partitions will be scheduled for
> >>>>> deletion
> >>>>> * Updated the API to use StopReplica V5 rather than a whole new
> >>>>> controller-to-broker API
> >>>>> * Explained that the disablement procedure will be triggered by the
> >>>>> controller listening for an (Incremental)AlterConfig change
> >>>>> * Explained that we will first move log start offset and then issue a
> >>>>> deletion
> >>>>> * Went into more details that changing remote.log.disable.policy
> after
> >>>>> disablement won't do anything and that if a customer would like
> additional
> >>>>> data deleted they would have to use already existing methods
> >>>>>
> >>>>> Let me know if there are any new comments or I have missed something!
> >>>>>
> >>>>> Best,
> >>>>> Christo
> >>>>>
> >>>>> On Mon, 15 Apr 2024 at 12:40, Christo Lolov <ch...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Heya Doguscan,
> >>>>>>
> >>>>>> I believe that the state of the world after this KIP will be the
> >>>>>> following:
> >>>>>>
> >>>>>> For Zookeeper-backed clusters there will be 3 states: ENABLED,
> >>>>>> DISABLING and DISABLED. We want this because Zookeeper-backed
> clusters will
> >>>>>> await a confirmation from the brokers that they have indeed stopped
> >>>>>> tiered-related operations on the topic.
> >>>>>>
> >>>>>> For KRaft-backed clusters there will be only 2 states: ENABLED and
> >>>>>> DISABLED. KRaft takes a fire-and-forget approach for topic
> deletion. I
> >>>>>> believe the same approach ought to be taken for tiered topics. The
> >>>>>> mechanism which will ensure that leftover state in remote due to
> failures
> >>>>>> is cleaned up to me is the retention mechanism. In today's code, a
> leader
> >>>>>> deletes all segments it finds in remote with offsets below the log
> start
> >>>>>> offset. I believe this will be good enough for cleaning up leftover
> state
> >>>>>> in remote due to failures.
> >>>>>>
> >>>>>> I know that quite a few changes have been discussed so I will aim to
> >>>>>> put them on paper in the upcoming days and let everyone know!
> >>>>>>
> >>>>>> Best,
> >>>>>> Christo
> >>>>>>
> >>>>>> On Tue, 9 Apr 2024 at 14:49, Doğuşcan Namal <
> namal.doguscan@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> +1 let's not introduce a new api and mark it immediately as
> >>>>>>> deprecated :)
> >>>>>>>
> >>>>>>> On your second comment Luke, one thing we need to clarify is when
> do
> >>>>>>> we consider remote storage to be DISABLED for a topic?
> >>>>>>> Particularly, what is the state when the remote storage is being
> >>>>>>> deleted in case of disablement.policy=delete? Is it DISABLING or
> DISABLED?
> >>>>>>>
> >>>>>>> If we move directly to the DISABLED state,
> >>>>>>>
> >>>>>>> a) in case of failures, the leaders should continue remote storage
> >>>>>>> deletion even if the topic is moved to the DISABLED state,
> otherwise we
> >>>>>>> risk having stray data on remote storage.
> >>>>>>> b) on each restart, we should initiate the remote storage deletion
> >>>>>>> because although we replayed a record with a DISABLED state, we
> can not be
> >>>>>>> sure if the remote data is deleted or not.
> >>>>>>>
> >>>>>>> We could either consider keeping the remote topic in DISABLING
> state
> >>>>>>> until all of the remote storage data is deleted, or we need an
> additional
> >>>>>>> mechanism to handle the remote stray data.
> >>>>>>>
> >>>>>>> The existing topic deletion, for instance, handles stray logs on
> >>>>>>> disk by detecting them on KafkaBroker startup and deleting before
> the
> >>>>>>> ReplicaManager is started.
> >>>>>>> Maybe we need a similar mechanism here as well if we don't want a
> >>>>>>> DISABLING state. Otherwise, we need a callback from Brokers to
> validate
> >>>>>>> that remote storage data is deleted and now we could move to the
> DISABLED
> >>>>>>> state.
> >>>>>>>
> >>>>>>> Thanks.
> >>>>>>>
> >>>>>>> On Tue, 9 Apr 2024 at 12:45, Luke Chen <sh...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hi Christo,
> >>>>>>>>
> >>>>>>>> > I would then opt for moving information from DisableRemoteTopic
> >>>>>>>> within the StopReplicas API which will then disappear in KRaft
> >>>>>>>> world as it
> >>>>>>>> is already scheduled for deprecation. What do you think?
> >>>>>>>>
> >>>>>>>> Sounds good to me.
> >>>>>>>>
> >>>>>>>> Thanks.
> >>>>>>>> Luke
> >>>>>>>>
> >>>>>>>> On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <
> >>>>>>>> christololov@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> > Heya Luke!
> >>>>>>>> >
> >>>>>>>> > I thought a bit more about it and I reached the same conclusion
> >>>>>>>> as you for
> >>>>>>>> > 2 as a follow-up from 1. In other words, in KRaft world I don't
> >>>>>>>> think the
> >>>>>>>> > controller needs to wait for acknowledgements for the brokers.
> >>>>>>>> All we care
> >>>>>>>> > about is that the leader (who is responsible for
> >>>>>>>> archiving/deleting data in
> >>>>>>>> > tiered storage) knows about the change and applies it properly.
> >>>>>>>> If there is
> >>>>>>>> > a leadership change halfway through the operation then the new
> >>>>>>>> leader still
> >>>>>>>> > needs to apply the message from the state topic and we know
> that a
> >>>>>>>> > disable-message will be applied before a reenablement-message. I
> >>>>>>>> will
> >>>>>>>> > change the KIP later today/tomorrow morning to reflect this
> >>>>>>>> reasoning.
> >>>>>>>> >
> >>>>>>>> > However, with this I believe that introducing a new API just for
> >>>>>>>> > Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a bit
> >>>>>>>> of an
> >>>>>>>> > overkill. I would then opt for moving information from
> >>>>>>>> DisableRemoteTopic
> >>>>>>>> > within the StopReplicas API which will then disappear in KRaft
> >>>>>>>> world as it
> >>>>>>>> > is already scheduled for deprecation. What do you think?
> >>>>>>>> >
> >>>>>>>> > Best,
> >>>>>>>> > Christo
> >>>>>>>> >
> >>>>>>>> > On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com>
> wrote:
> >>>>>>>> >
> >>>>>>>> > > Hi Christo,
> >>>>>>>> > >
> >>>>>>>> > > 1. I agree with Doguscan that in KRaft mode, the controller
> >>>>>>>> won't send
> >>>>>>>> > RPCs
> >>>>>>>> > > to the brokers (except in the migration path).
> >>>>>>>> > > So, I think we could adopt the similar way we did to
> >>>>>>>> > `AlterReplicaLogDirs`
> >>>>>>>> > > (
> >>>>>>>> > > KIP-858
> >>>>>>>> > > <
> >>>>>>>> > >
> >>>>>>>> >
> >>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
> >>>>>>>> > > >)
> >>>>>>>> > > that let the broker notify controller any update, instead of
> >>>>>>>> controller
> >>>>>>>> > to
> >>>>>>>> > > broker. And once the controller receives all the complete
> >>>>>>>> requests from
> >>>>>>>> > > brokers, it'll enter "Disabled" state. WDYT?
> >>>>>>>> > >
> >>>>>>>> > > 2. Why should we wait until all brokers to respond before
> >>>>>>>> moving to
> >>>>>>>> > > "Disabled" state in "KRaft mode"?
> >>>>>>>> > > Currently, only the leader node does the remote log
> >>>>>>>> upload/fetch tasks,
> >>>>>>>> > so
> >>>>>>>> > > does that mean the controller only need to make sure the
> leader
> >>>>>>>> completes
> >>>>>>>> > > the stopPartition?
> >>>>>>>> > > If during the leader node stopPartition process triggered
> >>>>>>>> leadership
> >>>>>>>> > > change, then the new leader should receive and apply the
> >>>>>>>> configRecord
> >>>>>>>> > > update before the leadership change record based on the KRaft
> >>>>>>>> design,
> >>>>>>>> > which
> >>>>>>>> > > means there will be no gap that the follower node becomes the
> >>>>>>>> leader and
> >>>>>>>> > > starting doing unexpected upload/fetch tasks, right?
> >>>>>>>> > > I agree we should make sure in ZK mode, all brokers are
> >>>>>>>> completed the
> >>>>>>>> > > stopPartitions before moving to "Disabled" state because ZK
> >>>>>>>> node watcher
> >>>>>>>> > is
> >>>>>>>> > > working in a separate thread. But not sure about KRaft mode.
> >>>>>>>> > >
> >>>>>>>> > > Thanks.
> >>>>>>>> > > Luke
> >>>>>>>> > >
> >>>>>>>> > >
> >>>>>>>> > > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <
> >>>>>>>> christololov@gmail.com>
> >>>>>>>> > > wrote:
> >>>>>>>> > >
> >>>>>>>> > > > Heya everyone!
> >>>>>>>> > > >
> >>>>>>>> > > > re: Doguscan
> >>>>>>>> > > >
> >>>>>>>> > > > I believe the answer to 101 needs a bit more discussion. As
> >>>>>>>> far as I
> >>>>>>>> > > know,
> >>>>>>>> > > > tiered storage today has methods to update a metadata of a
> >>>>>>>> segment to
> >>>>>>>> > say
> >>>>>>>> > > > "hey, I would like this deleted", but actual deletion is
> left
> >>>>>>>> to plugin
> >>>>>>>> > > > implementations (or any background cleaners). In other
> words,
> >>>>>>>> there is
> >>>>>>>> > no
> >>>>>>>> > > > "immediate" deletion. In this KIP, we would like to continue
> >>>>>>>> doing the
> >>>>>>>> > > same
> >>>>>>>> > > > if the retention policy is set to delete. So I believe the
> >>>>>>>> answer is
> >>>>>>>> > > > actually that a) we will update the metadata of the segments
> >>>>>>>> to mark
> >>>>>>>> > them
> >>>>>>>> > > > as deleted and b) we will advance the log start offset. Any
> >>>>>>>> deletion of
> >>>>>>>> > > > actual files will still be delegated to plugin
> >>>>>>>> implementations. I
> >>>>>>>> > believe
> >>>>>>>> > > > this is further supported by
> >>>>>>>> "*remote.log.disable.policy=delete:* Logs
> >>>>>>>> > > that
> >>>>>>>> > > > are archived in the remote storage will not be part of the
> >>>>>>>> contiguous
> >>>>>>>> > > > "active" log and will be deleted asynchronously as part of
> the
> >>>>>>>> > > disablement
> >>>>>>>> > > > process"
> >>>>>>>> > > >
> >>>>>>>> > > > Following from the above, I believe for 102 it is fine to
> >>>>>>>> allow setting
> >>>>>>>> > > of
> >>>>>>>> > > > remote.log.disable.policy on a disabled topic in much the
> >>>>>>>> same way we
> >>>>>>>> > > allow
> >>>>>>>> > > > other remote-related configurations to be set on a topic
> (i.e.
> >>>>>>>> > > > local.retention.*) - it just won't have an effect. Granted,
> I
> >>>>>>>> do
> >>>>>>>> > believe
> >>>>>>>> > > we
> >>>>>>>> > > > should restrict the policy being changed while a disablement
> >>>>>>>> is
> >>>>>>>> > ongoing.
> >>>>>>>> > > >
> >>>>>>>> > > > re: Satish and Kamal
> >>>>>>>> > > >
> >>>>>>>> > > > 104, 1 and 2 are fair asks, I will work with Doguscan to
> >>>>>>>> update the KIP
> >>>>>>>> > > > with the information!
> >>>>>>>> > > >
> >>>>>>>> > > > Best,
> >>>>>>>> > > > Christo
> >>>>>>>> > > >
> >>>>>>>> > > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <
> >>>>>>>> namal.doguscan@gmail.com
> >>>>>>>> > >
> >>>>>>>> > > > wrote:
> >>>>>>>> > > >
> >>>>>>>> > > > > Hi Satish, I will try to answer as much as I can and the
> >>>>>>>> others could
> >>>>>>>> > > > chime
> >>>>>>>> > > > > in with further details.
> >>>>>>>> > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > > > *101. For remote.log.disable.policy=delete: Does it delete
> >>>>>>>> the remote
> >>>>>>>> > > log
> >>>>>>>> > > > > data immediately and the data in remote storage will not
> be
> >>>>>>>> taken
> >>>>>>>> > into
> >>>>>>>> > > > > account by any replica? That means log-start-offset is
> >>>>>>>> moved to the
> >>>>>>>> > > > earlier
> >>>>>>>> > > > > local-log-start-offset.*
> >>>>>>>> > > > > *Exactly. RemoteLogData will be deleted immediately. *
> >>>>>>>> > > > > *So before the deletion starts we move LogStart offset to
> >>>>>>>> > > > > LocalLogStartOffset to ensure that no RemoteLog will be
> >>>>>>>> accessed
> >>>>>>>> > after
> >>>>>>>> > > > that
> >>>>>>>> > > > > point.*
> >>>>>>>> > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > > > * 102. Can we update the remote.log.disable.policy after
> >>>>>>>> tiered
> >>>>>>>> > storage
> >>>>>>>> > > > is
> >>>>>>>> > > > > disabled on a topic?*
> >>>>>>>> > > > >
> >>>>>>>> > > > > *This is a good point. I think we should not allow
> >>>>>>>> modifying this
> >>>>>>>> > > > > configuration*
> >>>>>>>> > > > > *because changing the policy from Deletion to Retain when
> >>>>>>>> there is an
> >>>>>>>> > > > > ongoing Deletion will result in an undefined behaviour and
> >>>>>>>> where we
> >>>>>>>> > > > retain
> >>>>>>>> > > > > half of the remote log and delete the other half.*
> >>>>>>>> > > > >
> >>>>>>>> > > > > * 103. Do we plan to add any metrics related to this
> >>>>>>>> feature?*
> >>>>>>>> > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > > > *Any recommendations?*
> >>>>>>>> > > > > *We may emit a gauge showing the enablement state of a
> >>>>>>>> topic but we
> >>>>>>>> > > could
> >>>>>>>> > > > > gather that info from the logs as well.*
> >>>>>>>> > > > > *The total duration for remote topic deletion could be
> >>>>>>>> added as well
> >>>>>>>> > > but
> >>>>>>>> > > > > this is more of a metric for the RemotePartitionRemover
> >>>>>>>> itself.*
> >>>>>>>> > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > > > *104. Please add configuration details about copier thread
> >>>>>>>> pool,
> >>>>>>>> > > > expiration
> >>>>>>>> > > > > thread pool and the migration of the existing
> >>>>>>>> > > > > RemoteLogManagerScheduledThreadPool.*
> >>>>>>>> > > > >
> >>>>>>>> > > > > *Will add the details.*
> >>>>>>>> > > > >
> >>>>>>>> > > > > 105. How is the behaviour with topic or partition deletion
> >>>>>>>> request
> >>>>>>>> > > > > handled when tiered storage disablement request is still
> >>>>>>>> being
> >>>>>>>> > > > > processed on a topic?
> >>>>>>>> > > > >
> >>>>>>>> > > > > *If the disablement policy is Delete then a successive
> >>>>>>>> topic deletion
> >>>>>>>> > > > > request is going to be a NOOP because RemoteLogs is
> already
> >>>>>>>> being
> >>>>>>>> > > > deleted.*
> >>>>>>>> > > > > *If the disablement policy is Retain, then we only moved
> the
> >>>>>>>> > > > LogStartOffset
> >>>>>>>> > > > > and didn't touch RemoteLogs anyway, so the delete topic
> >>>>>>>> request will
> >>>>>>>> > > > > result*
> >>>>>>>> > > > >
> >>>>>>>> > > > > *in the initiation of RemoteLog deletion.*
> >>>>>>>> > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
> >>>>>>>> > > > > kamal.chandraprakash@gmail.com> wrote:
> >>>>>>>> > > > >
> >>>>>>>> > > > > > Hi,
> >>>>>>>> > > > > >
> >>>>>>>> > > > > > Thanks for the KIP! Overall the KIP looks good and
> >>>>>>>> covered most of
> >>>>>>>> > > the
> >>>>>>>> > > > > > items.
> >>>>>>>> > > > > >
> >>>>>>>> > > > > > 1. Could you explain how the brokers will handle the
> >>>>>>>> > > DisableRemoteTopic
> >>>>>>>> > > > > API
> >>>>>>>> > > > > > request?
> >>>>>>>> > > > > >
> >>>>>>>> > > > > > 2. Who will initiate the controller interaction
> sequence?
> >>>>>>>> Does the
> >>>>>>>> > > > > > controller listens for
> >>>>>>>> > > > > > topic config updates and initiate the disablement?
> >>>>>>>> > > > > >
> >>>>>>>> > > > > > --
> >>>>>>>> > > > > > Kamal
> >>>>>>>> > > > > >
> >>>>>>>> > > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
> >>>>>>>> > > > satish.duggana@gmail.com
> >>>>>>>> > > > > >
> >>>>>>>> > > > > > wrote:
> >>>>>>>> > > > > >
> >>>>>>>> > > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
> >>>>>>>> > > > > > >
> >>>>>>>> > > > > > > I had an initial review of the KIP and left the below
> >>>>>>>> comments.
> >>>>>>>> > > > > > >
> >>>>>>>> > > > > > > 101. For remote.log.disable.policy=delete:
> >>>>>>>> > > > > > > Does it delete the remote log data immediately and the
> >>>>>>>> data in
> >>>>>>>> > > remote
> >>>>>>>> > > > > > > storage will not be taken into account by any replica?
> >>>>>>>> That means
> >>>>>>>> > > > > > > log-start-offset is moved to the earlier
> >>>>>>>> local-log-start-offset.
> >>>>>>>> > > > > > >
> >>>>>>>> > > > > > > 102. Can we update the remote.log.disable.policy after
> >>>>>>>> tiered
> >>>>>>>> > > storage
> >>>>>>>> > > > > > > is disabled on a topic?
> >>>>>>>> > > > > > >
> >>>>>>>> > > > > > > 103. Do we plan to add any metrics related to this
> >>>>>>>> feature?
> >>>>>>>> > > > > > >
> >>>>>>>> > > > > > > 104. Please add configuration details about copier
> >>>>>>>> thread pool,
> >>>>>>>> > > > > > > expiration thread pool and the migration of the
> existing
> >>>>>>>> > > > > > > RemoteLogManagerScheduledThreadPool.
> >>>>>>>> > > > > > >
> >>>>>>>> > > > > > > 105. How is the behaviour with topic or partition
> >>>>>>>> deletion
> >>>>>>>> > request
> >>>>>>>> > > > > > > handled when tiered storage disablement request is
> >>>>>>>> still being
> >>>>>>>> > > > > > > processed on a topic?
> >>>>>>>> > > > > > >
> >>>>>>>> > > > > > > ~Satish.
> >>>>>>>> > > > > > >
> >>>>>>>> > > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
> >>>>>>>> > > > namal.doguscan@gmail.com
> >>>>>>>> > > > > >
> >>>>>>>> > > > > > > wrote:
> >>>>>>>> > > > > > > >
> >>>>>>>> > > > > > > > Hi Christo and Luke,
> >>>>>>>> > > > > > > >
> >>>>>>>> > > > > > > > I think the KRaft section of the KIP requires slight
> >>>>>>>> > improvement.
> >>>>>>>> > > > The
> >>>>>>>> > > > > > > metadata propagation in KRaft is handled by the RAFT
> >>>>>>>> layer
> >>>>>>>> > instead
> >>>>>>>> > > of
> >>>>>>>> > > > > > > sending Controller -> Broker RPCs. In fact, KIP-631
> >>>>>>>> deprecated
> >>>>>>>> > > these
> >>>>>>>> > > > > > RPCs.
> >>>>>>>> > > > > > > >
> >>>>>>>> > > > > > > > I will come up with some recommendations on how we
> >>>>>>>> could
> >>>>>>>> > improve
> >>>>>>>> > > > that
> >>>>>>>> > > > > > > one but until then, @Luke please feel free to review
> >>>>>>>> the KIP.
> >>>>>>>> > > > > > > >
> >>>>>>>> > > > > > > > @Satish, if we want this to make it to Kafka 3.8 I
> >>>>>>>> believe we
> >>>>>>>> > > need
> >>>>>>>> > > > to
> >>>>>>>> > > > > > > aim to get the KIP approved in the following weeks
> >>>>>>>> otherwise it
> >>>>>>>> > > will
> >>>>>>>> > > > > slip
> >>>>>>>> > > > > > > and we can not support it in Zookeeper mode.
> >>>>>>>> > > > > > > >
> >>>>>>>> > > > > > > > I also would like to better understand what is the
> >>>>>>>> community's
> >>>>>>>> > > > stand
> >>>>>>>> > > > > > for
> >>>>>>>> > > > > > > adding a new feature for Zookeeper since it is marked
> as
> >>>>>>>> > deprecated
> >>>>>>>> > > > > > already.
> >>>>>>>> > > > > > > >
> >>>>>>>> > > > > > > > Thanks.
> >>>>>>>> > > > > > > >
> >>>>>>>> > > > > > > >
> >>>>>>>> > > > > > > >
> >>>>>>>> > > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
> >>>>>>>> > > > christololov@gmail.com>
> >>>>>>>> > > > > > > wrote:
> >>>>>>>> > > > > > > >>
> >>>>>>>> > > > > > > >> Heya,
> >>>>>>>> > > > > > > >>
> >>>>>>>> > > > > > > >> I do have some time to put into this, but to be
> >>>>>>>> honest I am
> >>>>>>>> > > still
> >>>>>>>> > > > > > after
> >>>>>>>> > > > > > > >> reviews of the KIP itself :)
> >>>>>>>> > > > > > > >>
> >>>>>>>> > > > > > > >> After the latest changes it ought to be detailing
> >>>>>>>> both a
> >>>>>>>> > > Zookeeper
> >>>>>>>> > > > > > > approach
> >>>>>>>> > > > > > > >> and a KRaft approach.
> >>>>>>>> > > > > > > >>
> >>>>>>>> > > > > > > >> Do you have any thoughts on how it could be
> improved
> >>>>>>>> or
> >>>>>>>> > should I
> >>>>>>>> > > > > > start a
> >>>>>>>> > > > > > > >> voting thread?
> >>>>>>>> > > > > > > >>
> >>>>>>>> > > > > > > >> Best,
> >>>>>>>> > > > > > > >> Christo
> >>>>>>>> > > > > > > >>
> >>>>>>>> > > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <
> >>>>>>>> showuon@gmail.com>
> >>>>>>>> > > > wrote:
> >>>>>>>> > > > > > > >>
> >>>>>>>> > > > > > > >> > Hi Christo,
> >>>>>>>> > > > > > > >> >
> >>>>>>>> > > > > > > >> > Any update with this KIP?
> >>>>>>>> > > > > > > >> > If you don't have time to complete it, I can
> >>>>>>>> collaborate
> >>>>>>>> > with
> >>>>>>>> > > > you
> >>>>>>>> > > > > to
> >>>>>>>> > > > > > > work
> >>>>>>>> > > > > > > >> > on it.
> >>>>>>>> > > > > > > >> >
> >>>>>>>> > > > > > > >> > Thanks.
> >>>>>>>> > > > > > > >> > Luke
> >>>>>>>> > > > > > > >> >
> >>>>>>>> > > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish Duggana <
> >>>>>>>> > > > > > > satish.duggana@gmail.com>
> >>>>>>>> > > > > > > >> > wrote:
> >>>>>>>> > > > > > > >> >
> >>>>>>>> > > > > > > >> > > Hi Christo,
> >>>>>>>> > > > > > > >> > > Thanks for volunteering to contribute to the
> KIP
> >>>>>>>> > > discussion. I
> >>>>>>>> > > > > > > suggest
> >>>>>>>> > > > > > > >> > > considering this KIP for both ZK and KRaft as
> it
> >>>>>>>> will be
> >>>>>>>> > > > helpful
> >>>>>>>> > > > > > for
> >>>>>>>> > > > > > > >> > > this feature to be available in 3.8.0 running
> >>>>>>>> with ZK
> >>>>>>>> > > > clusters.
> >>>>>>>> > > > > > > >> > >
> >>>>>>>> > > > > > > >> > > Thanks,
> >>>>>>>> > > > > > > >> > > Satish.
> >>>>>>>> > > > > > > >> > >
> >>>>>>>> > > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
> >>>>>>>> > > > > > christololov@gmail.com
> >>>>>>>> > > > > > > >
> >>>>>>>> > > > > > > >> > > wrote:
> >>>>>>>> > > > > > > >> > > >
> >>>>>>>> > > > > > > >> > > > Hello!
> >>>>>>>> > > > > > > >> > > >
> >>>>>>>> > > > > > > >> > > > I volunteer to get this KIP moving forward
> and
> >>>>>>>> > implemented
> >>>>>>>> > > > in
> >>>>>>>> > > > > > > Apache
> >>>>>>>> > > > > > > >> > > Kafka
> >>>>>>>> > > > > > > >> > > > 3.8.
> >>>>>>>> > > > > > > >> > > >
> >>>>>>>> > > > > > > >> > > > I have caught up with Mehari offline and we
> >>>>>>>> have agreed
> >>>>>>>> > > that
> >>>>>>>> > > > > > given
> >>>>>>>> > > > > > > >> > Apache
> >>>>>>>> > > > > > > >> > > > Kafka 4.0 being around the corner we would
> >>>>>>>> like to
> >>>>>>>> > propose
> >>>>>>>> > > > > this
> >>>>>>>> > > > > > > feature
> >>>>>>>> > > > > > > >> > > > only for KRaft clusters.
> >>>>>>>> > > > > > > >> > > >
> >>>>>>>> > > > > > > >> > > > Any and all reviews and comments are welcome!
> >>>>>>>> > > > > > > >> > > >
> >>>>>>>> > > > > > > >> > > > Best,
> >>>>>>>> > > > > > > >> > > > Christo
> >>>>>>>> > > > > > > >> > > >
> >>>>>>>> > > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan Namal <
> >>>>>>>> > > > > > > namal.doguscan@gmail.com>
> >>>>>>>> > > > > > > >> > > > wrote:
> >>>>>>>> > > > > > > >> > > >
> >>>>>>>> > > > > > > >> > > > > Hi everyone, any progress on the status of
> >>>>>>>> this KIP?
> >>>>>>>> > > > Overall
> >>>>>>>> > > > > > > looks
> >>>>>>>> > > > > > > >> > > good to
> >>>>>>>> > > > > > > >> > > > > me but I wonder whether we still need to
> >>>>>>>> support it
> >>>>>>>> > for
> >>>>>>>> > > > > > > Zookeeper
> >>>>>>>> > > > > > > >> > mode
> >>>>>>>> > > > > > > >> > > > > given that it will be deprecated in the
> next
> >>>>>>>> 3 months.
> >>>>>>>> > > > > > > >> > > > >
> >>>>>>>> > > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari"
> >>>>>>>> wrote:
> >>>>>>>> > > > > > > >> > > > > > Hi everyone,
> >>>>>>>> > > > > > > >> > > > > > I would like to start a discussion on
> >>>>>>>> KIP-950:
> >>>>>>>> > Tiered
> >>>>>>>> > > > > > Storage
> >>>>>>>> > > > > > > >> > > Disablement
> >>>>>>>> > > > > > > >> > > > > (
> >>>>>>>> > > > > > > >> > > > >
> >>>>>>>> > > > > > > >> > > > >
> >>>>>>>> > > > > > > >> > >
> >>>>>>>> > > > > > > >> >
> >>>>>>>> > > > > > >
> >>>>>>>> > > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > >
> >>>>>>>> > >
> >>>>>>>> >
> >>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
> >>>>>>>> > > > > > > >> > > > > ).
> >>>>>>>> > > > > > > >> > > > > >
> >>>>>>>> > > > > > > >> > > > > > This KIP proposes adding the ability to
> >>>>>>>> disable and
> >>>>>>>> > > > > > re-enable
> >>>>>>>> > > > > > > >> > tiered
> >>>>>>>> > > > > > > >> > > > > storage on a topic.
> >>>>>>>> > > > > > > >> > > > > >
> >>>>>>>> > > > > > > >> > > > > > Thanks,
> >>>>>>>> > > > > > > >> > > > > > Mehari
> >>>>>>>> > > > > > > >> > > > > >
> >>>>>>>> > > > > > > >> > > > >
> >>>>>>>> > > > > > > >> > >
> >>>>>>>> > > > > > > >> >
> >>>>>>>> > > > > > >
> >>>>>>>> > > > > >
> >>>>>>>> > > > >
> >>>>>>>> > > >
> >>>>>>>> > >
> >>>>>>>> >
> >>>>>>>>
> >>>>>>>
>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Kamal Chandraprakash <ka...@gmail.com>.

Hi Christo,

Thanks for the update!

For both the policies "retain" and "delete", can we maintain the same
approach to delete the segments async?

> If the disablement policy is set to delete, the Log start offset (LSO) is
updated to match the Local Log Start Offset and the remote log is deleted
by calling the RemoteStorageManager#deleteLogSegmentData().

In the KIP, it's mentioned that when the disable policy is set to "delete",
the remote-log-segments will be
deleted in-sync. The stopPartition call might get timed out when the number
of remote log segments to
delete is huge. We can further extend the same approach for the topic
deletion requests.

Also, Could you please update the state diagram about the transitions? It
is not clear when to transit from
DISABLING to DISABLED state?

--
Kamal

On Mon, May 6, 2024 at 6:55 PM Kamal Chandraprakash <
kamal.chandraprakash@gmail.com> wrote:

> Ignore the above message. Got the answers after reading the state
> transition section.
>
> > If the disablement policy is delete, tasks scheduled for the
> topic-partitions in the RemoteDataExpirationThreadPool will also be
> canceled.
>
> We are deleting the segments synchronously. Should we delete them
> asynchronously? The same approach can be extended to topic deletion
> requests.
>
> > 6. In ZK mode, what will the controller do if the "stopReplicas"
> responses not received from all brokers? Reverting the changes?
>
> Since we are deleting the segments synchronously. This case can be bound
> to happen when the number of remote log segments to
> delete is huge.
>
>
> On Mon, May 6, 2024, 18:12 Kamal Chandraprakash <
> kamal.chandraprakash@gmail.com> wrote:
>
>> Hi Christo,
>>
>> Thanks for the update!
>>
>> 1. In the ZK mode, how will the transition from DISABLING to DISABLED
>> state happen?
>> For the "retain" policy, until we delete all the remote-log segments, the
>> state will be
>> DISABLING and the deletion can happen only when they breach either the
>> retention
>> time (or) size.
>>
>> How does the controller monitor that all the remote log segments are
>> deleted for all
>> the partitions of the topic before transitioning the state to DISABLED?
>>
>> 2. In Kraft, we have only ENABLED -> DISABLED state. How are we
>> supporting the case
>> "retain" -> "enable"?
>>
>> If the remote storage is degraded, we want to avoid uploading the
>> segments temporarily
>> and resume back once the remote storage is healthy. Is the case supported?
>>
>>
>>
>> On Fri, May 3, 2024 at 12:12 PM Luke Chen <sh...@gmail.com> wrote:
>>
>>> Also, I think using `stopReplicas` request is a good idea because it
>>> won't cause any problems while migrating to KRaft mode.
>>> The stopReplicas request is one of the request that KRaft controller
>>> will send to ZK brokers during migration.
>>>
>>> Thanks.
>>> Luke
>>>
>>> On Fri, May 3, 2024 at 11:48 AM Luke Chen <sh...@gmail.com> wrote:
>>>
>>>> Hi Christo,
>>>>
>>>> Thanks for the update.
>>>>
>>>> Questions:
>>>> 1. For this
>>>> "The possible state transition from DISABLED state is to the ENABLED."
>>>> I think it only applies for KRaft mode. In ZK mode, the possible state
>>>> is "DISABLING", right?
>>>>
>>>> 2. For this:
>>>> "If the cluster is using Zookeeper as the control plane, enabling
>>>> remote storage for a topic triggers the controller to send this information
>>>> to Zookeeper. Each broker listens for changes in Zookeeper, and when a
>>>> change is detected, the broker triggers
>>>> RemoteLogManager#onLeadershipChange()."
>>>>
>>>> I think the way ZK brokers knows the leadership change is by getting
>>>> the LeaderAndISRRequeset from the controller, not listening for changes in
>>>> ZK.
>>>>
>>>> 3. In the KRaft handler steps, you said:
>>>> "The controller also updates the Topic metadata to increment the
>>>> tiered_epoch and update the tiered_stateto DISABLING state."
>>>>
>>>> Should it be "DISABLED" state since it's KRaft mode?
>>>>
>>>> 4. I was thinking how we handle the tiered_epoch not match error.
>>>> For ZK, I think the controller won't write any data into ZK Znode,
>>>> For KRaft, either configRecord or updateTopicMetadata records won't be
>>>> written.
>>>> Is that right? Because the current workflow makes me think there will
>>>> be partial data updated in ZK/KRaft when tiered_epoch error.
>>>>
>>>> 5. Since we changed to use stopReplicas (V5) request now, the diagram
>>>> for ZK workflow might also need to update.
>>>>
>>>> 6. In ZK mode, what will the controller do if the "stopReplicas"
>>>> responses not received from all brokers? Reverting the changes?
>>>> This won't happen in KRaft mode because it's broker's responsibility to
>>>> fetch metadata update from controller.
>>>>
>>>>
>>>> Thank you.
>>>> Luke
>>>>
>>>>
>>>> On Fri, Apr 19, 2024 at 10:23 PM Christo Lolov <ch...@gmail.com>
>>>> wrote:
>>>>
>>>>> Heya all!
>>>>>
>>>>> I have updated KIP-950. A list of what I have updated is:
>>>>>
>>>>> * Explicitly state that Zookeeper-backed clusters will have ENABLED ->
>>>>> DISABLING -> DISABLED while KRaft-backed clusters will only have ENABLED ->
>>>>> DISABLED
>>>>> * Added two configurations for the new thread pools and explained
>>>>> where values will be picked-up mid Kafka version upgrade
>>>>> * Explained how leftover remote partitions will be scheduled for
>>>>> deletion
>>>>> * Updated the API to use StopReplica V5 rather than a whole new
>>>>> controller-to-broker API
>>>>> * Explained that the disablement procedure will be triggered by the
>>>>> controller listening for an (Incremental)AlterConfig change
>>>>> * Explained that we will first move log start offset and then issue a
>>>>> deletion
>>>>> * Went into more details that changing remote.log.disable.policy after
>>>>> disablement won't do anything and that if a customer would like additional
>>>>> data deleted they would have to use already existing methods
>>>>>
>>>>> Let me know if there are any new comments or I have missed something!
>>>>>
>>>>> Best,
>>>>> Christo
>>>>>
>>>>> On Mon, 15 Apr 2024 at 12:40, Christo Lolov <ch...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Heya Doguscan,
>>>>>>
>>>>>> I believe that the state of the world after this KIP will be the
>>>>>> following:
>>>>>>
>>>>>> For Zookeeper-backed clusters there will be 3 states: ENABLED,
>>>>>> DISABLING and DISABLED. We want this because Zookeeper-backed clusters will
>>>>>> await a confirmation from the brokers that they have indeed stopped
>>>>>> tiered-related operations on the topic.
>>>>>>
>>>>>> For KRaft-backed clusters there will be only 2 states: ENABLED and
>>>>>> DISABLED. KRaft takes a fire-and-forget approach for topic deletion. I
>>>>>> believe the same approach ought to be taken for tiered topics. The
>>>>>> mechanism which will ensure that leftover state in remote due to failures
>>>>>> is cleaned up to me is the retention mechanism. In today's code, a leader
>>>>>> deletes all segments it finds in remote with offsets below the log start
>>>>>> offset. I believe this will be good enough for cleaning up leftover state
>>>>>> in remote due to failures.
>>>>>>
>>>>>> I know that quite a few changes have been discussed so I will aim to
>>>>>> put them on paper in the upcoming days and let everyone know!
>>>>>>
>>>>>> Best,
>>>>>> Christo
>>>>>>
>>>>>> On Tue, 9 Apr 2024 at 14:49, Doğuşcan Namal <na...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 let's not introduce a new api and mark it immediately as
>>>>>>> deprecated :)
>>>>>>>
>>>>>>> On your second comment Luke, one thing we need to clarify is when do
>>>>>>> we consider remote storage to be DISABLED for a topic?
>>>>>>> Particularly, what is the state when the remote storage is being
>>>>>>> deleted in case of disablement.policy=delete? Is it DISABLING or DISABLED?
>>>>>>>
>>>>>>> If we move directly to the DISABLED state,
>>>>>>>
>>>>>>> a) in case of failures, the leaders should continue remote storage
>>>>>>> deletion even if the topic is moved to the DISABLED state, otherwise we
>>>>>>> risk having stray data on remote storage.
>>>>>>> b) on each restart, we should initiate the remote storage deletion
>>>>>>> because although we replayed a record with a DISABLED state, we can not be
>>>>>>> sure if the remote data is deleted or not.
>>>>>>>
>>>>>>> We could either consider keeping the remote topic in DISABLING state
>>>>>>> until all of the remote storage data is deleted, or we need an additional
>>>>>>> mechanism to handle the remote stray data.
>>>>>>>
>>>>>>> The existing topic deletion, for instance, handles stray logs on
>>>>>>> disk by detecting them on KafkaBroker startup and deleting before the
>>>>>>> ReplicaManager is started.
>>>>>>> Maybe we need a similar mechanism here as well if we don't want a
>>>>>>> DISABLING state. Otherwise, we need a callback from Brokers to validate
>>>>>>> that remote storage data is deleted and now we could move to the DISABLED
>>>>>>> state.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> On Tue, 9 Apr 2024 at 12:45, Luke Chen <sh...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Christo,
>>>>>>>>
>>>>>>>> > I would then opt for moving information from DisableRemoteTopic
>>>>>>>> within the StopReplicas API which will then disappear in KRaft
>>>>>>>> world as it
>>>>>>>> is already scheduled for deprecation. What do you think?
>>>>>>>>
>>>>>>>> Sounds good to me.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> Luke
>>>>>>>>
>>>>>>>> On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <
>>>>>>>> christololov@gmail.com> wrote:
>>>>>>>>
>>>>>>>> > Heya Luke!
>>>>>>>> >
>>>>>>>> > I thought a bit more about it and I reached the same conclusion
>>>>>>>> as you for
>>>>>>>> > 2 as a follow-up from 1. In other words, in KRaft world I don't
>>>>>>>> think the
>>>>>>>> > controller needs to wait for acknowledgements for the brokers.
>>>>>>>> All we care
>>>>>>>> > about is that the leader (who is responsible for
>>>>>>>> archiving/deleting data in
>>>>>>>> > tiered storage) knows about the change and applies it properly.
>>>>>>>> If there is
>>>>>>>> > a leadership change halfway through the operation then the new
>>>>>>>> leader still
>>>>>>>> > needs to apply the message from the state topic and we know that a
>>>>>>>> > disable-message will be applied before a reenablement-message. I
>>>>>>>> will
>>>>>>>> > change the KIP later today/tomorrow morning to reflect this
>>>>>>>> reasoning.
>>>>>>>> >
>>>>>>>> > However, with this I believe that introducing a new API just for
>>>>>>>> > Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a bit
>>>>>>>> of an
>>>>>>>> > overkill. I would then opt for moving information from
>>>>>>>> DisableRemoteTopic
>>>>>>>> > within the StopReplicas API which will then disappear in KRaft
>>>>>>>> world as it
>>>>>>>> > is already scheduled for deprecation. What do you think?
>>>>>>>> >
>>>>>>>> > Best,
>>>>>>>> > Christo
>>>>>>>> >
>>>>>>>> > On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > > Hi Christo,
>>>>>>>> > >
>>>>>>>> > > 1. I agree with Doguscan that in KRaft mode, the controller
>>>>>>>> won't send
>>>>>>>> > RPCs
>>>>>>>> > > to the brokers (except in the migration path).
>>>>>>>> > > So, I think we could adopt the similar way we did to
>>>>>>>> > `AlterReplicaLogDirs`
>>>>>>>> > > (
>>>>>>>> > > KIP-858
>>>>>>>> > > <
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
>>>>>>>> > > >)
>>>>>>>> > > that let the broker notify controller any update, instead of
>>>>>>>> controller
>>>>>>>> > to
>>>>>>>> > > broker. And once the controller receives all the complete
>>>>>>>> requests from
>>>>>>>> > > brokers, it'll enter "Disabled" state. WDYT?
>>>>>>>> > >
>>>>>>>> > > 2. Why should we wait until all brokers to respond before
>>>>>>>> moving to
>>>>>>>> > > "Disabled" state in "KRaft mode"?
>>>>>>>> > > Currently, only the leader node does the remote log
>>>>>>>> upload/fetch tasks,
>>>>>>>> > so
>>>>>>>> > > does that mean the controller only need to make sure the leader
>>>>>>>> completes
>>>>>>>> > > the stopPartition?
>>>>>>>> > > If during the leader node stopPartition process triggered
>>>>>>>> leadership
>>>>>>>> > > change, then the new leader should receive and apply the
>>>>>>>> configRecord
>>>>>>>> > > update before the leadership change record based on the KRaft
>>>>>>>> design,
>>>>>>>> > which
>>>>>>>> > > means there will be no gap that the follower node becomes the
>>>>>>>> leader and
>>>>>>>> > > starting doing unexpected upload/fetch tasks, right?
>>>>>>>> > > I agree we should make sure in ZK mode, all brokers are
>>>>>>>> completed the
>>>>>>>> > > stopPartitions before moving to "Disabled" state because ZK
>>>>>>>> node watcher
>>>>>>>> > is
>>>>>>>> > > working in a separate thread. But not sure about KRaft mode.
>>>>>>>> > >
>>>>>>>> > > Thanks.
>>>>>>>> > > Luke
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <
>>>>>>>> christololov@gmail.com>
>>>>>>>> > > wrote:
>>>>>>>> > >
>>>>>>>> > > > Heya everyone!
>>>>>>>> > > >
>>>>>>>> > > > re: Doguscan
>>>>>>>> > > >
>>>>>>>> > > > I believe the answer to 101 needs a bit more discussion. As
>>>>>>>> far as I
>>>>>>>> > > know,
>>>>>>>> > > > tiered storage today has methods to update a metadata of a
>>>>>>>> segment to
>>>>>>>> > say
>>>>>>>> > > > "hey, I would like this deleted", but actual deletion is left
>>>>>>>> to plugin
>>>>>>>> > > > implementations (or any background cleaners). In other words,
>>>>>>>> there is
>>>>>>>> > no
>>>>>>>> > > > "immediate" deletion. In this KIP, we would like to continue
>>>>>>>> doing the
>>>>>>>> > > same
>>>>>>>> > > > if the retention policy is set to delete. So I believe the
>>>>>>>> answer is
>>>>>>>> > > > actually that a) we will update the metadata of the segments
>>>>>>>> to mark
>>>>>>>> > them
>>>>>>>> > > > as deleted and b) we will advance the log start offset. Any
>>>>>>>> deletion of
>>>>>>>> > > > actual files will still be delegated to plugin
>>>>>>>> implementations. I
>>>>>>>> > believe
>>>>>>>> > > > this is further supported by
>>>>>>>> "*remote.log.disable.policy=delete:* Logs
>>>>>>>> > > that
>>>>>>>> > > > are archived in the remote storage will not be part of the
>>>>>>>> contiguous
>>>>>>>> > > > "active" log and will be deleted asynchronously as part of the
>>>>>>>> > > disablement
>>>>>>>> > > > process"
>>>>>>>> > > >
>>>>>>>> > > > Following from the above, I believe for 102 it is fine to
>>>>>>>> allow setting
>>>>>>>> > > of
>>>>>>>> > > > remote.log.disable.policy on a disabled topic in much the
>>>>>>>> same way we
>>>>>>>> > > allow
>>>>>>>> > > > other remote-related configurations to be set on a topic (i.e.
>>>>>>>> > > > local.retention.*) - it just won't have an effect. Granted, I
>>>>>>>> do
>>>>>>>> > believe
>>>>>>>> > > we
>>>>>>>> > > > should restrict the policy being changed while a disablement
>>>>>>>> is
>>>>>>>> > ongoing.
>>>>>>>> > > >
>>>>>>>> > > > re: Satish and Kamal
>>>>>>>> > > >
>>>>>>>> > > > 104, 1 and 2 are fair asks, I will work with Doguscan to
>>>>>>>> update the KIP
>>>>>>>> > > > with the information!
>>>>>>>> > > >
>>>>>>>> > > > Best,
>>>>>>>> > > > Christo
>>>>>>>> > > >
>>>>>>>> > > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <
>>>>>>>> namal.doguscan@gmail.com
>>>>>>>> > >
>>>>>>>> > > > wrote:
>>>>>>>> > > >
>>>>>>>> > > > > Hi Satish, I will try to answer as much as I can and the
>>>>>>>> others could
>>>>>>>> > > > chime
>>>>>>>> > > > > in with further details.
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > > *101. For remote.log.disable.policy=delete: Does it delete
>>>>>>>> the remote
>>>>>>>> > > log
>>>>>>>> > > > > data immediately and the data in remote storage will not be
>>>>>>>> taken
>>>>>>>> > into
>>>>>>>> > > > > account by any replica? That means log-start-offset is
>>>>>>>> moved to the
>>>>>>>> > > > earlier
>>>>>>>> > > > > local-log-start-offset.*
>>>>>>>> > > > > *Exactly. RemoteLogData will be deleted immediately. *
>>>>>>>> > > > > *So before the deletion starts we move LogStart offset to
>>>>>>>> > > > > LocalLogStartOffset to ensure that no RemoteLog will be
>>>>>>>> accessed
>>>>>>>> > after
>>>>>>>> > > > that
>>>>>>>> > > > > point.*
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > > * 102. Can we update the remote.log.disable.policy after
>>>>>>>> tiered
>>>>>>>> > storage
>>>>>>>> > > > is
>>>>>>>> > > > > disabled on a topic?*
>>>>>>>> > > > >
>>>>>>>> > > > > *This is a good point. I think we should not allow
>>>>>>>> modifying this
>>>>>>>> > > > > configuration*
>>>>>>>> > > > > *because changing the policy from Deletion to Retain when
>>>>>>>> there is an
>>>>>>>> > > > > ongoing Deletion will result in an undefined behaviour and
>>>>>>>> where we
>>>>>>>> > > > retain
>>>>>>>> > > > > half of the remote log and delete the other half.*
>>>>>>>> > > > >
>>>>>>>> > > > > * 103. Do we plan to add any metrics related to this
>>>>>>>> feature?*
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > > *Any recommendations?*
>>>>>>>> > > > > *We may emit a gauge showing the enablement state of a
>>>>>>>> topic but we
>>>>>>>> > > could
>>>>>>>> > > > > gather that info from the logs as well.*
>>>>>>>> > > > > *The total duration for remote topic deletion could be
>>>>>>>> added as well
>>>>>>>> > > but
>>>>>>>> > > > > this is more of a metric for the RemotePartitionRemover
>>>>>>>> itself.*
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > > *104. Please add configuration details about copier thread
>>>>>>>> pool,
>>>>>>>> > > > expiration
>>>>>>>> > > > > thread pool and the migration of the existing
>>>>>>>> > > > > RemoteLogManagerScheduledThreadPool.*
>>>>>>>> > > > >
>>>>>>>> > > > > *Will add the details.*
>>>>>>>> > > > >
>>>>>>>> > > > > 105. How is the behaviour with topic or partition deletion
>>>>>>>> request
>>>>>>>> > > > > handled when tiered storage disablement request is still
>>>>>>>> being
>>>>>>>> > > > > processed on a topic?
>>>>>>>> > > > >
>>>>>>>> > > > > *If the disablement policy is Delete then a successive
>>>>>>>> topic deletion
>>>>>>>> > > > > request is going to be a NOOP because RemoteLogs is already
>>>>>>>> being
>>>>>>>> > > > deleted.*
>>>>>>>> > > > > *If the disablement policy is Retain, then we only moved the
>>>>>>>> > > > LogStartOffset
>>>>>>>> > > > > and didn't touch RemoteLogs anyway, so the delete topic
>>>>>>>> request will
>>>>>>>> > > > > result*
>>>>>>>> > > > >
>>>>>>>> > > > > *in the initiation of RemoteLog deletion.*
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
>>>>>>>> > > > > kamal.chandraprakash@gmail.com> wrote:
>>>>>>>> > > > >
>>>>>>>> > > > > > Hi,
>>>>>>>> > > > > >
>>>>>>>> > > > > > Thanks for the KIP! Overall the KIP looks good and
>>>>>>>> covered most of
>>>>>>>> > > the
>>>>>>>> > > > > > items.
>>>>>>>> > > > > >
>>>>>>>> > > > > > 1. Could you explain how the brokers will handle the
>>>>>>>> > > DisableRemoteTopic
>>>>>>>> > > > > API
>>>>>>>> > > > > > request?
>>>>>>>> > > > > >
>>>>>>>> > > > > > 2. Who will initiate the controller interaction sequence?
>>>>>>>> Does the
>>>>>>>> > > > > > controller listens for
>>>>>>>> > > > > > topic config updates and initiate the disablement?
>>>>>>>> > > > > >
>>>>>>>> > > > > > --
>>>>>>>> > > > > > Kamal
>>>>>>>> > > > > >
>>>>>>>> > > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
>>>>>>>> > > > satish.duggana@gmail.com
>>>>>>>> > > > > >
>>>>>>>> > > > > > wrote:
>>>>>>>> > > > > >
>>>>>>>> > > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
>>>>>>>> > > > > > >
>>>>>>>> > > > > > > I had an initial review of the KIP and left the below
>>>>>>>> comments.
>>>>>>>> > > > > > >
>>>>>>>> > > > > > > 101. For remote.log.disable.policy=delete:
>>>>>>>> > > > > > > Does it delete the remote log data immediately and the
>>>>>>>> data in
>>>>>>>> > > remote
>>>>>>>> > > > > > > storage will not be taken into account by any replica?
>>>>>>>> That means
>>>>>>>> > > > > > > log-start-offset is moved to the earlier
>>>>>>>> local-log-start-offset.
>>>>>>>> > > > > > >
>>>>>>>> > > > > > > 102. Can we update the remote.log.disable.policy after
>>>>>>>> tiered
>>>>>>>> > > storage
>>>>>>>> > > > > > > is disabled on a topic?
>>>>>>>> > > > > > >
>>>>>>>> > > > > > > 103. Do we plan to add any metrics related to this
>>>>>>>> feature?
>>>>>>>> > > > > > >
>>>>>>>> > > > > > > 104. Please add configuration details about copier
>>>>>>>> thread pool,
>>>>>>>> > > > > > > expiration thread pool and the migration of the existing
>>>>>>>> > > > > > > RemoteLogManagerScheduledThreadPool.
>>>>>>>> > > > > > >
>>>>>>>> > > > > > > 105. How is the behaviour with topic or partition
>>>>>>>> deletion
>>>>>>>> > request
>>>>>>>> > > > > > > handled when tiered storage disablement request is
>>>>>>>> still being
>>>>>>>> > > > > > > processed on a topic?
>>>>>>>> > > > > > >
>>>>>>>> > > > > > > ~Satish.
>>>>>>>> > > > > > >
>>>>>>>> > > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
>>>>>>>> > > > namal.doguscan@gmail.com
>>>>>>>> > > > > >
>>>>>>>> > > > > > > wrote:
>>>>>>>> > > > > > > >
>>>>>>>> > > > > > > > Hi Christo and Luke,
>>>>>>>> > > > > > > >
>>>>>>>> > > > > > > > I think the KRaft section of the KIP requires slight
>>>>>>>> > improvement.
>>>>>>>> > > > The
>>>>>>>> > > > > > > metadata propagation in KRaft is handled by the RAFT
>>>>>>>> layer
>>>>>>>> > instead
>>>>>>>> > > of
>>>>>>>> > > > > > > sending Controller -> Broker RPCs. In fact, KIP-631
>>>>>>>> deprecated
>>>>>>>> > > these
>>>>>>>> > > > > > RPCs.
>>>>>>>> > > > > > > >
>>>>>>>> > > > > > > > I will come up with some recommendations on how we
>>>>>>>> could
>>>>>>>> > improve
>>>>>>>> > > > that
>>>>>>>> > > > > > > one but until then, @Luke please feel free to review
>>>>>>>> the KIP.
>>>>>>>> > > > > > > >
>>>>>>>> > > > > > > > @Satish, if we want this to make it to Kafka 3.8 I
>>>>>>>> believe we
>>>>>>>> > > need
>>>>>>>> > > > to
>>>>>>>> > > > > > > aim to get the KIP approved in the following weeks
>>>>>>>> otherwise it
>>>>>>>> > > will
>>>>>>>> > > > > slip
>>>>>>>> > > > > > > and we can not support it in Zookeeper mode.
>>>>>>>> > > > > > > >
>>>>>>>> > > > > > > > I also would like to better understand what is the
>>>>>>>> community's
>>>>>>>> > > > stand
>>>>>>>> > > > > > for
>>>>>>>> > > > > > > adding a new feature for Zookeeper since it is marked as
>>>>>>>> > deprecated
>>>>>>>> > > > > > already.
>>>>>>>> > > > > > > >
>>>>>>>> > > > > > > > Thanks.
>>>>>>>> > > > > > > >
>>>>>>>> > > > > > > >
>>>>>>>> > > > > > > >
>>>>>>>> > > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
>>>>>>>> > > > christololov@gmail.com>
>>>>>>>> > > > > > > wrote:
>>>>>>>> > > > > > > >>
>>>>>>>> > > > > > > >> Heya,
>>>>>>>> > > > > > > >>
>>>>>>>> > > > > > > >> I do have some time to put into this, but to be
>>>>>>>> honest I am
>>>>>>>> > > still
>>>>>>>> > > > > > after
>>>>>>>> > > > > > > >> reviews of the KIP itself :)
>>>>>>>> > > > > > > >>
>>>>>>>> > > > > > > >> After the latest changes it ought to be detailing
>>>>>>>> both a
>>>>>>>> > > Zookeeper
>>>>>>>> > > > > > > approach
>>>>>>>> > > > > > > >> and a KRaft approach.
>>>>>>>> > > > > > > >>
>>>>>>>> > > > > > > >> Do you have any thoughts on how it could be improved
>>>>>>>> or
>>>>>>>> > should I
>>>>>>>> > > > > > start a
>>>>>>>> > > > > > > >> voting thread?
>>>>>>>> > > > > > > >>
>>>>>>>> > > > > > > >> Best,
>>>>>>>> > > > > > > >> Christo
>>>>>>>> > > > > > > >>
>>>>>>>> > > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <
>>>>>>>> showuon@gmail.com>
>>>>>>>> > > > wrote:
>>>>>>>> > > > > > > >>
>>>>>>>> > > > > > > >> > Hi Christo,
>>>>>>>> > > > > > > >> >
>>>>>>>> > > > > > > >> > Any update with this KIP?
>>>>>>>> > > > > > > >> > If you don't have time to complete it, I can
>>>>>>>> collaborate
>>>>>>>> > with
>>>>>>>> > > > you
>>>>>>>> > > > > to
>>>>>>>> > > > > > > work
>>>>>>>> > > > > > > >> > on it.
>>>>>>>> > > > > > > >> >
>>>>>>>> > > > > > > >> > Thanks.
>>>>>>>> > > > > > > >> > Luke
>>>>>>>> > > > > > > >> >
>>>>>>>> > > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish Duggana <
>>>>>>>> > > > > > > satish.duggana@gmail.com>
>>>>>>>> > > > > > > >> > wrote:
>>>>>>>> > > > > > > >> >
>>>>>>>> > > > > > > >> > > Hi Christo,
>>>>>>>> > > > > > > >> > > Thanks for volunteering to contribute to the KIP
>>>>>>>> > > discussion. I
>>>>>>>> > > > > > > suggest
>>>>>>>> > > > > > > >> > > considering this KIP for both ZK and KRaft as it
>>>>>>>> will be
>>>>>>>> > > > helpful
>>>>>>>> > > > > > for
>>>>>>>> > > > > > > >> > > this feature to be available in 3.8.0 running
>>>>>>>> with ZK
>>>>>>>> > > > clusters.
>>>>>>>> > > > > > > >> > >
>>>>>>>> > > > > > > >> > > Thanks,
>>>>>>>> > > > > > > >> > > Satish.
>>>>>>>> > > > > > > >> > >
>>>>>>>> > > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
>>>>>>>> > > > > > christololov@gmail.com
>>>>>>>> > > > > > > >
>>>>>>>> > > > > > > >> > > wrote:
>>>>>>>> > > > > > > >> > > >
>>>>>>>> > > > > > > >> > > > Hello!
>>>>>>>> > > > > > > >> > > >
>>>>>>>> > > > > > > >> > > > I volunteer to get this KIP moving forward and
>>>>>>>> > implemented
>>>>>>>> > > > in
>>>>>>>> > > > > > > Apache
>>>>>>>> > > > > > > >> > > Kafka
>>>>>>>> > > > > > > >> > > > 3.8.
>>>>>>>> > > > > > > >> > > >
>>>>>>>> > > > > > > >> > > > I have caught up with Mehari offline and we
>>>>>>>> have agreed
>>>>>>>> > > that
>>>>>>>> > > > > > given
>>>>>>>> > > > > > > >> > Apache
>>>>>>>> > > > > > > >> > > > Kafka 4.0 being around the corner we would
>>>>>>>> like to
>>>>>>>> > propose
>>>>>>>> > > > > this
>>>>>>>> > > > > > > feature
>>>>>>>> > > > > > > >> > > > only for KRaft clusters.
>>>>>>>> > > > > > > >> > > >
>>>>>>>> > > > > > > >> > > > Any and all reviews and comments are welcome!
>>>>>>>> > > > > > > >> > > >
>>>>>>>> > > > > > > >> > > > Best,
>>>>>>>> > > > > > > >> > > > Christo
>>>>>>>> > > > > > > >> > > >
>>>>>>>> > > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan Namal <
>>>>>>>> > > > > > > namal.doguscan@gmail.com>
>>>>>>>> > > > > > > >> > > > wrote:
>>>>>>>> > > > > > > >> > > >
>>>>>>>> > > > > > > >> > > > > Hi everyone, any progress on the status of
>>>>>>>> this KIP?
>>>>>>>> > > > Overall
>>>>>>>> > > > > > > looks
>>>>>>>> > > > > > > >> > > good to
>>>>>>>> > > > > > > >> > > > > me but I wonder whether we still need to
>>>>>>>> support it
>>>>>>>> > for
>>>>>>>> > > > > > > Zookeeper
>>>>>>>> > > > > > > >> > mode
>>>>>>>> > > > > > > >> > > > > given that it will be deprecated in the next
>>>>>>>> 3 months.
>>>>>>>> > > > > > > >> > > > >
>>>>>>>> > > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari"
>>>>>>>> wrote:
>>>>>>>> > > > > > > >> > > > > > Hi everyone,
>>>>>>>> > > > > > > >> > > > > > I would like to start a discussion on
>>>>>>>> KIP-950:
>>>>>>>> > Tiered
>>>>>>>> > > > > > Storage
>>>>>>>> > > > > > > >> > > Disablement
>>>>>>>> > > > > > > >> > > > > (
>>>>>>>> > > > > > > >> > > > >
>>>>>>>> > > > > > > >> > > > >
>>>>>>>> > > > > > > >> > >
>>>>>>>> > > > > > > >> >
>>>>>>>> > > > > > >
>>>>>>>> > > > > >
>>>>>>>> > > > >
>>>>>>>> > > >
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
>>>>>>>> > > > > > > >> > > > > ).
>>>>>>>> > > > > > > >> > > > > >
>>>>>>>> > > > > > > >> > > > > > This KIP proposes adding the ability to
>>>>>>>> disable and
>>>>>>>> > > > > > re-enable
>>>>>>>> > > > > > > >> > tiered
>>>>>>>> > > > > > > >> > > > > storage on a topic.
>>>>>>>> > > > > > > >> > > > > >
>>>>>>>> > > > > > > >> > > > > > Thanks,
>>>>>>>> > > > > > > >> > > > > > Mehari
>>>>>>>> > > > > > > >> > > > > >
>>>>>>>> > > > > > > >> > > > >
>>>>>>>> > > > > > > >> > >
>>>>>>>> > > > > > > >> >
>>>>>>>> > > > > > >
>>>>>>>> > > > > >
>>>>>>>> > > > >
>>>>>>>> > > >
>>>>>>>> > >
>>>>>>>> >
>>>>>>>>
>>>>>>>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Luke Chen <sh...@gmail.com>.

Also, I think using `stopReplicas` request is a good idea because it won't
cause any problems while migrating to KRaft mode.
The stopReplicas request is one of the request that KRaft controller will
send to ZK brokers during migration.

Thanks.
Luke

On Fri, May 3, 2024 at 11:48 AM Luke Chen <sh...@gmail.com> wrote:

> Hi Christo,
>
> Thanks for the update.
>
> Questions:
> 1. For this
> "The possible state transition from DISABLED state is to the ENABLED."
> I think it only applies for KRaft mode. In ZK mode, the possible state is
> "DISABLING", right?
>
> 2. For this:
> "If the cluster is using Zookeeper as the control plane, enabling remote
> storage for a topic triggers the controller to send this information to
> Zookeeper. Each broker listens for changes in Zookeeper, and when a change
> is detected, the broker triggers RemoteLogManager#onLeadershipChange()."
>
> I think the way ZK brokers knows the leadership change is by getting the
> LeaderAndISRRequeset from the controller, not listening for changes in ZK.
>
> 3. In the KRaft handler steps, you said:
> "The controller also updates the Topic metadata to increment the
> tiered_epoch and update the tiered_stateto DISABLING state."
>
> Should it be "DISABLED" state since it's KRaft mode?
>
> 4. I was thinking how we handle the tiered_epoch not match error.
> For ZK, I think the controller won't write any data into ZK Znode,
> For KRaft, either configRecord or updateTopicMetadata records won't be
> written.
> Is that right? Because the current workflow makes me think there will be
> partial data updated in ZK/KRaft when tiered_epoch error.
>
> 5. Since we changed to use stopReplicas (V5) request now, the diagram for
> ZK workflow might also need to update.
>
> 6. In ZK mode, what will the controller do if the "stopReplicas" responses
> not received from all brokers? Reverting the changes?
> This won't happen in KRaft mode because it's broker's responsibility to
> fetch metadata update from controller.
>
>
> Thank you.
> Luke
>
>
> On Fri, Apr 19, 2024 at 10:23 PM Christo Lolov <ch...@gmail.com>
> wrote:
>
>> Heya all!
>>
>> I have updated KIP-950. A list of what I have updated is:
>>
>> * Explicitly state that Zookeeper-backed clusters will have ENABLED ->
>> DISABLING -> DISABLED while KRaft-backed clusters will only have ENABLED ->
>> DISABLED
>> * Added two configurations for the new thread pools and explained where
>> values will be picked-up mid Kafka version upgrade
>> * Explained how leftover remote partitions will be scheduled for deletion
>> * Updated the API to use StopReplica V5 rather than a whole new
>> controller-to-broker API
>> * Explained that the disablement procedure will be triggered by the
>> controller listening for an (Incremental)AlterConfig change
>> * Explained that we will first move log start offset and then issue a
>> deletion
>> * Went into more details that changing remote.log.disable.policy after
>> disablement won't do anything and that if a customer would like additional
>> data deleted they would have to use already existing methods
>>
>> Let me know if there are any new comments or I have missed something!
>>
>> Best,
>> Christo
>>
>> On Mon, 15 Apr 2024 at 12:40, Christo Lolov <ch...@gmail.com>
>> wrote:
>>
>>> Heya Doguscan,
>>>
>>> I believe that the state of the world after this KIP will be the
>>> following:
>>>
>>> For Zookeeper-backed clusters there will be 3 states: ENABLED, DISABLING
>>> and DISABLED. We want this because Zookeeper-backed clusters will await a
>>> confirmation from the brokers that they have indeed stopped tiered-related
>>> operations on the topic.
>>>
>>> For KRaft-backed clusters there will be only 2 states: ENABLED and
>>> DISABLED. KRaft takes a fire-and-forget approach for topic deletion. I
>>> believe the same approach ought to be taken for tiered topics. The
>>> mechanism which will ensure that leftover state in remote due to failures
>>> is cleaned up to me is the retention mechanism. In today's code, a leader
>>> deletes all segments it finds in remote with offsets below the log start
>>> offset. I believe this will be good enough for cleaning up leftover state
>>> in remote due to failures.
>>>
>>> I know that quite a few changes have been discussed so I will aim to put
>>> them on paper in the upcoming days and let everyone know!
>>>
>>> Best,
>>> Christo
>>>
>>> On Tue, 9 Apr 2024 at 14:49, Doğuşcan Namal <na...@gmail.com>
>>> wrote:
>>>
>>>> +1 let's not introduce a new api and mark it immediately as deprecated
>>>> :)
>>>>
>>>> On your second comment Luke, one thing we need to clarify is when do we
>>>> consider remote storage to be DISABLED for a topic?
>>>> Particularly, what is the state when the remote storage is being
>>>> deleted in case of disablement.policy=delete? Is it DISABLING or DISABLED?
>>>>
>>>> If we move directly to the DISABLED state,
>>>>
>>>> a) in case of failures, the leaders should continue remote storage
>>>> deletion even if the topic is moved to the DISABLED state, otherwise we
>>>> risk having stray data on remote storage.
>>>> b) on each restart, we should initiate the remote storage deletion
>>>> because although we replayed a record with a DISABLED state, we can not be
>>>> sure if the remote data is deleted or not.
>>>>
>>>> We could either consider keeping the remote topic in DISABLING state
>>>> until all of the remote storage data is deleted, or we need an additional
>>>> mechanism to handle the remote stray data.
>>>>
>>>> The existing topic deletion, for instance, handles stray logs on disk
>>>> by detecting them on KafkaBroker startup and deleting before the
>>>> ReplicaManager is started.
>>>> Maybe we need a similar mechanism here as well if we don't want a
>>>> DISABLING state. Otherwise, we need a callback from Brokers to validate
>>>> that remote storage data is deleted and now we could move to the DISABLED
>>>> state.
>>>>
>>>> Thanks.
>>>>
>>>> On Tue, 9 Apr 2024 at 12:45, Luke Chen <sh...@gmail.com> wrote:
>>>>
>>>>> Hi Christo,
>>>>>
>>>>> > I would then opt for moving information from DisableRemoteTopic
>>>>> within the StopReplicas API which will then disappear in KRaft world
>>>>> as it
>>>>> is already scheduled for deprecation. What do you think?
>>>>>
>>>>> Sounds good to me.
>>>>>
>>>>> Thanks.
>>>>> Luke
>>>>>
>>>>> On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <ch...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Heya Luke!
>>>>> >
>>>>> > I thought a bit more about it and I reached the same conclusion as
>>>>> you for
>>>>> > 2 as a follow-up from 1. In other words, in KRaft world I don't
>>>>> think the
>>>>> > controller needs to wait for acknowledgements for the brokers. All
>>>>> we care
>>>>> > about is that the leader (who is responsible for archiving/deleting
>>>>> data in
>>>>> > tiered storage) knows about the change and applies it properly. If
>>>>> there is
>>>>> > a leadership change halfway through the operation then the new
>>>>> leader still
>>>>> > needs to apply the message from the state topic and we know that a
>>>>> > disable-message will be applied before a reenablement-message. I will
>>>>> > change the KIP later today/tomorrow morning to reflect this
>>>>> reasoning.
>>>>> >
>>>>> > However, with this I believe that introducing a new API just for
>>>>> > Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a bit of
>>>>> an
>>>>> > overkill. I would then opt for moving information from
>>>>> DisableRemoteTopic
>>>>> > within the StopReplicas API which will then disappear in KRaft world
>>>>> as it
>>>>> > is already scheduled for deprecation. What do you think?
>>>>> >
>>>>> > Best,
>>>>> > Christo
>>>>> >
>>>>> > On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com> wrote:
>>>>> >
>>>>> > > Hi Christo,
>>>>> > >
>>>>> > > 1. I agree with Doguscan that in KRaft mode, the controller won't
>>>>> send
>>>>> > RPCs
>>>>> > > to the brokers (except in the migration path).
>>>>> > > So, I think we could adopt the similar way we did to
>>>>> > `AlterReplicaLogDirs`
>>>>> > > (
>>>>> > > KIP-858
>>>>> > > <
>>>>> > >
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
>>>>> > > >)
>>>>> > > that let the broker notify controller any update, instead of
>>>>> controller
>>>>> > to
>>>>> > > broker. And once the controller receives all the complete requests
>>>>> from
>>>>> > > brokers, it'll enter "Disabled" state. WDYT?
>>>>> > >
>>>>> > > 2. Why should we wait until all brokers to respond before moving to
>>>>> > > "Disabled" state in "KRaft mode"?
>>>>> > > Currently, only the leader node does the remote log upload/fetch
>>>>> tasks,
>>>>> > so
>>>>> > > does that mean the controller only need to make sure the leader
>>>>> completes
>>>>> > > the stopPartition?
>>>>> > > If during the leader node stopPartition process triggered
>>>>> leadership
>>>>> > > change, then the new leader should receive and apply the
>>>>> configRecord
>>>>> > > update before the leadership change record based on the KRaft
>>>>> design,
>>>>> > which
>>>>> > > means there will be no gap that the follower node becomes the
>>>>> leader and
>>>>> > > starting doing unexpected upload/fetch tasks, right?
>>>>> > > I agree we should make sure in ZK mode, all brokers are completed
>>>>> the
>>>>> > > stopPartitions before moving to "Disabled" state because ZK node
>>>>> watcher
>>>>> > is
>>>>> > > working in a separate thread. But not sure about KRaft mode.
>>>>> > >
>>>>> > > Thanks.
>>>>> > > Luke
>>>>> > >
>>>>> > >
>>>>> > > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <
>>>>> christololov@gmail.com>
>>>>> > > wrote:
>>>>> > >
>>>>> > > > Heya everyone!
>>>>> > > >
>>>>> > > > re: Doguscan
>>>>> > > >
>>>>> > > > I believe the answer to 101 needs a bit more discussion. As far
>>>>> as I
>>>>> > > know,
>>>>> > > > tiered storage today has methods to update a metadata of a
>>>>> segment to
>>>>> > say
>>>>> > > > "hey, I would like this deleted", but actual deletion is left to
>>>>> plugin
>>>>> > > > implementations (or any background cleaners). In other words,
>>>>> there is
>>>>> > no
>>>>> > > > "immediate" deletion. In this KIP, we would like to continue
>>>>> doing the
>>>>> > > same
>>>>> > > > if the retention policy is set to delete. So I believe the
>>>>> answer is
>>>>> > > > actually that a) we will update the metadata of the segments to
>>>>> mark
>>>>> > them
>>>>> > > > as deleted and b) we will advance the log start offset. Any
>>>>> deletion of
>>>>> > > > actual files will still be delegated to plugin implementations. I
>>>>> > believe
>>>>> > > > this is further supported by
>>>>> "*remote.log.disable.policy=delete:* Logs
>>>>> > > that
>>>>> > > > are archived in the remote storage will not be part of the
>>>>> contiguous
>>>>> > > > "active" log and will be deleted asynchronously as part of the
>>>>> > > disablement
>>>>> > > > process"
>>>>> > > >
>>>>> > > > Following from the above, I believe for 102 it is fine to allow
>>>>> setting
>>>>> > > of
>>>>> > > > remote.log.disable.policy on a disabled topic in much the same
>>>>> way we
>>>>> > > allow
>>>>> > > > other remote-related configurations to be set on a topic (i.e.
>>>>> > > > local.retention.*) - it just won't have an effect. Granted, I do
>>>>> > believe
>>>>> > > we
>>>>> > > > should restrict the policy being changed while a disablement is
>>>>> > ongoing.
>>>>> > > >
>>>>> > > > re: Satish and Kamal
>>>>> > > >
>>>>> > > > 104, 1 and 2 are fair asks, I will work with Doguscan to update
>>>>> the KIP
>>>>> > > > with the information!
>>>>> > > >
>>>>> > > > Best,
>>>>> > > > Christo
>>>>> > > >
>>>>> > > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <
>>>>> namal.doguscan@gmail.com
>>>>> > >
>>>>> > > > wrote:
>>>>> > > >
>>>>> > > > > Hi Satish, I will try to answer as much as I can and the
>>>>> others could
>>>>> > > > chime
>>>>> > > > > in with further details.
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > *101. For remote.log.disable.policy=delete: Does it delete the
>>>>> remote
>>>>> > > log
>>>>> > > > > data immediately and the data in remote storage will not be
>>>>> taken
>>>>> > into
>>>>> > > > > account by any replica? That means log-start-offset is moved
>>>>> to the
>>>>> > > > earlier
>>>>> > > > > local-log-start-offset.*
>>>>> > > > > *Exactly. RemoteLogData will be deleted immediately. *
>>>>> > > > > *So before the deletion starts we move LogStart offset to
>>>>> > > > > LocalLogStartOffset to ensure that no RemoteLog will be
>>>>> accessed
>>>>> > after
>>>>> > > > that
>>>>> > > > > point.*
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > * 102. Can we update the remote.log.disable.policy after tiered
>>>>> > storage
>>>>> > > > is
>>>>> > > > > disabled on a topic?*
>>>>> > > > >
>>>>> > > > > *This is a good point. I think we should not allow modifying
>>>>> this
>>>>> > > > > configuration*
>>>>> > > > > *because changing the policy from Deletion to Retain when
>>>>> there is an
>>>>> > > > > ongoing Deletion will result in an undefined behaviour and
>>>>> where we
>>>>> > > > retain
>>>>> > > > > half of the remote log and delete the other half.*
>>>>> > > > >
>>>>> > > > > * 103. Do we plan to add any metrics related to this feature?*
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > *Any recommendations?*
>>>>> > > > > *We may emit a gauge showing the enablement state of a topic
>>>>> but we
>>>>> > > could
>>>>> > > > > gather that info from the logs as well.*
>>>>> > > > > *The total duration for remote topic deletion could be added
>>>>> as well
>>>>> > > but
>>>>> > > > > this is more of a metric for the RemotePartitionRemover
>>>>> itself.*
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > *104. Please add configuration details about copier thread
>>>>> pool,
>>>>> > > > expiration
>>>>> > > > > thread pool and the migration of the existing
>>>>> > > > > RemoteLogManagerScheduledThreadPool.*
>>>>> > > > >
>>>>> > > > > *Will add the details.*
>>>>> > > > >
>>>>> > > > > 105. How is the behaviour with topic or partition deletion
>>>>> request
>>>>> > > > > handled when tiered storage disablement request is still being
>>>>> > > > > processed on a topic?
>>>>> > > > >
>>>>> > > > > *If the disablement policy is Delete then a successive topic
>>>>> deletion
>>>>> > > > > request is going to be a NOOP because RemoteLogs is already
>>>>> being
>>>>> > > > deleted.*
>>>>> > > > > *If the disablement policy is Retain, then we only moved the
>>>>> > > > LogStartOffset
>>>>> > > > > and didn't touch RemoteLogs anyway, so the delete topic
>>>>> request will
>>>>> > > > > result*
>>>>> > > > >
>>>>> > > > > *in the initiation of RemoteLog deletion.*
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
>>>>> > > > > kamal.chandraprakash@gmail.com> wrote:
>>>>> > > > >
>>>>> > > > > > Hi,
>>>>> > > > > >
>>>>> > > > > > Thanks for the KIP! Overall the KIP looks good and covered
>>>>> most of
>>>>> > > the
>>>>> > > > > > items.
>>>>> > > > > >
>>>>> > > > > > 1. Could you explain how the brokers will handle the
>>>>> > > DisableRemoteTopic
>>>>> > > > > API
>>>>> > > > > > request?
>>>>> > > > > >
>>>>> > > > > > 2. Who will initiate the controller interaction sequence?
>>>>> Does the
>>>>> > > > > > controller listens for
>>>>> > > > > > topic config updates and initiate the disablement?
>>>>> > > > > >
>>>>> > > > > > --
>>>>> > > > > > Kamal
>>>>> > > > > >
>>>>> > > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
>>>>> > > > satish.duggana@gmail.com
>>>>> > > > > >
>>>>> > > > > > wrote:
>>>>> > > > > >
>>>>> > > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
>>>>> > > > > > >
>>>>> > > > > > > I had an initial review of the KIP and left the below
>>>>> comments.
>>>>> > > > > > >
>>>>> > > > > > > 101. For remote.log.disable.policy=delete:
>>>>> > > > > > > Does it delete the remote log data immediately and the
>>>>> data in
>>>>> > > remote
>>>>> > > > > > > storage will not be taken into account by any replica?
>>>>> That means
>>>>> > > > > > > log-start-offset is moved to the earlier
>>>>> local-log-start-offset.
>>>>> > > > > > >
>>>>> > > > > > > 102. Can we update the remote.log.disable.policy after
>>>>> tiered
>>>>> > > storage
>>>>> > > > > > > is disabled on a topic?
>>>>> > > > > > >
>>>>> > > > > > > 103. Do we plan to add any metrics related to this feature?
>>>>> > > > > > >
>>>>> > > > > > > 104. Please add configuration details about copier thread
>>>>> pool,
>>>>> > > > > > > expiration thread pool and the migration of the existing
>>>>> > > > > > > RemoteLogManagerScheduledThreadPool.
>>>>> > > > > > >
>>>>> > > > > > > 105. How is the behaviour with topic or partition deletion
>>>>> > request
>>>>> > > > > > > handled when tiered storage disablement request is still
>>>>> being
>>>>> > > > > > > processed on a topic?
>>>>> > > > > > >
>>>>> > > > > > > ~Satish.
>>>>> > > > > > >
>>>>> > > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
>>>>> > > > namal.doguscan@gmail.com
>>>>> > > > > >
>>>>> > > > > > > wrote:
>>>>> > > > > > > >
>>>>> > > > > > > > Hi Christo and Luke,
>>>>> > > > > > > >
>>>>> > > > > > > > I think the KRaft section of the KIP requires slight
>>>>> > improvement.
>>>>> > > > The
>>>>> > > > > > > metadata propagation in KRaft is handled by the RAFT layer
>>>>> > instead
>>>>> > > of
>>>>> > > > > > > sending Controller -> Broker RPCs. In fact, KIP-631
>>>>> deprecated
>>>>> > > these
>>>>> > > > > > RPCs.
>>>>> > > > > > > >
>>>>> > > > > > > > I will come up with some recommendations on how we could
>>>>> > improve
>>>>> > > > that
>>>>> > > > > > > one but until then, @Luke please feel free to review the
>>>>> KIP.
>>>>> > > > > > > >
>>>>> > > > > > > > @Satish, if we want this to make it to Kafka 3.8 I
>>>>> believe we
>>>>> > > need
>>>>> > > > to
>>>>> > > > > > > aim to get the KIP approved in the following weeks
>>>>> otherwise it
>>>>> > > will
>>>>> > > > > slip
>>>>> > > > > > > and we can not support it in Zookeeper mode.
>>>>> > > > > > > >
>>>>> > > > > > > > I also would like to better understand what is the
>>>>> community's
>>>>> > > > stand
>>>>> > > > > > for
>>>>> > > > > > > adding a new feature for Zookeeper since it is marked as
>>>>> > deprecated
>>>>> > > > > > already.
>>>>> > > > > > > >
>>>>> > > > > > > > Thanks.
>>>>> > > > > > > >
>>>>> > > > > > > >
>>>>> > > > > > > >
>>>>> > > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
>>>>> > > > christololov@gmail.com>
>>>>> > > > > > > wrote:
>>>>> > > > > > > >>
>>>>> > > > > > > >> Heya,
>>>>> > > > > > > >>
>>>>> > > > > > > >> I do have some time to put into this, but to be honest
>>>>> I am
>>>>> > > still
>>>>> > > > > > after
>>>>> > > > > > > >> reviews of the KIP itself :)
>>>>> > > > > > > >>
>>>>> > > > > > > >> After the latest changes it ought to be detailing both a
>>>>> > > Zookeeper
>>>>> > > > > > > approach
>>>>> > > > > > > >> and a KRaft approach.
>>>>> > > > > > > >>
>>>>> > > > > > > >> Do you have any thoughts on how it could be improved or
>>>>> > should I
>>>>> > > > > > start a
>>>>> > > > > > > >> voting thread?
>>>>> > > > > > > >>
>>>>> > > > > > > >> Best,
>>>>> > > > > > > >> Christo
>>>>> > > > > > > >>
>>>>> > > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <
>>>>> showuon@gmail.com>
>>>>> > > > wrote:
>>>>> > > > > > > >>
>>>>> > > > > > > >> > Hi Christo,
>>>>> > > > > > > >> >
>>>>> > > > > > > >> > Any update with this KIP?
>>>>> > > > > > > >> > If you don't have time to complete it, I can
>>>>> collaborate
>>>>> > with
>>>>> > > > you
>>>>> > > > > to
>>>>> > > > > > > work
>>>>> > > > > > > >> > on it.
>>>>> > > > > > > >> >
>>>>> > > > > > > >> > Thanks.
>>>>> > > > > > > >> > Luke
>>>>> > > > > > > >> >
>>>>> > > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish Duggana <
>>>>> > > > > > > satish.duggana@gmail.com>
>>>>> > > > > > > >> > wrote:
>>>>> > > > > > > >> >
>>>>> > > > > > > >> > > Hi Christo,
>>>>> > > > > > > >> > > Thanks for volunteering to contribute to the KIP
>>>>> > > discussion. I
>>>>> > > > > > > suggest
>>>>> > > > > > > >> > > considering this KIP for both ZK and KRaft as it
>>>>> will be
>>>>> > > > helpful
>>>>> > > > > > for
>>>>> > > > > > > >> > > this feature to be available in 3.8.0 running with
>>>>> ZK
>>>>> > > > clusters.
>>>>> > > > > > > >> > >
>>>>> > > > > > > >> > > Thanks,
>>>>> > > > > > > >> > > Satish.
>>>>> > > > > > > >> > >
>>>>> > > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
>>>>> > > > > > christololov@gmail.com
>>>>> > > > > > > >
>>>>> > > > > > > >> > > wrote:
>>>>> > > > > > > >> > > >
>>>>> > > > > > > >> > > > Hello!
>>>>> > > > > > > >> > > >
>>>>> > > > > > > >> > > > I volunteer to get this KIP moving forward and
>>>>> > implemented
>>>>> > > > in
>>>>> > > > > > > Apache
>>>>> > > > > > > >> > > Kafka
>>>>> > > > > > > >> > > > 3.8.
>>>>> > > > > > > >> > > >
>>>>> > > > > > > >> > > > I have caught up with Mehari offline and we have
>>>>> agreed
>>>>> > > that
>>>>> > > > > > given
>>>>> > > > > > > >> > Apache
>>>>> > > > > > > >> > > > Kafka 4.0 being around the corner we would like to
>>>>> > propose
>>>>> > > > > this
>>>>> > > > > > > feature
>>>>> > > > > > > >> > > > only for KRaft clusters.
>>>>> > > > > > > >> > > >
>>>>> > > > > > > >> > > > Any and all reviews and comments are welcome!
>>>>> > > > > > > >> > > >
>>>>> > > > > > > >> > > > Best,
>>>>> > > > > > > >> > > > Christo
>>>>> > > > > > > >> > > >
>>>>> > > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan Namal <
>>>>> > > > > > > namal.doguscan@gmail.com>
>>>>> > > > > > > >> > > > wrote:
>>>>> > > > > > > >> > > >
>>>>> > > > > > > >> > > > > Hi everyone, any progress on the status of this
>>>>> KIP?
>>>>> > > > Overall
>>>>> > > > > > > looks
>>>>> > > > > > > >> > > good to
>>>>> > > > > > > >> > > > > me but I wonder whether we still need to
>>>>> support it
>>>>> > for
>>>>> > > > > > > Zookeeper
>>>>> > > > > > > >> > mode
>>>>> > > > > > > >> > > > > given that it will be deprecated in the next 3
>>>>> months.
>>>>> > > > > > > >> > > > >
>>>>> > > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari" wrote:
>>>>> > > > > > > >> > > > > > Hi everyone,
>>>>> > > > > > > >> > > > > > I would like to start a discussion on KIP-950:
>>>>> > Tiered
>>>>> > > > > > Storage
>>>>> > > > > > > >> > > Disablement
>>>>> > > > > > > >> > > > > (
>>>>> > > > > > > >> > > > >
>>>>> > > > > > > >> > > > >
>>>>> > > > > > > >> > >
>>>>> > > > > > > >> >
>>>>> > > > > > >
>>>>> > > > > >
>>>>> > > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
>>>>> > > > > > > >> > > > > ).
>>>>> > > > > > > >> > > > > >
>>>>> > > > > > > >> > > > > > This KIP proposes adding the ability to
>>>>> disable and
>>>>> > > > > > re-enable
>>>>> > > > > > > >> > tiered
>>>>> > > > > > > >> > > > > storage on a topic.
>>>>> > > > > > > >> > > > > >
>>>>> > > > > > > >> > > > > > Thanks,
>>>>> > > > > > > >> > > > > > Mehari
>>>>> > > > > > > >> > > > > >
>>>>> > > > > > > >> > > > >
>>>>> > > > > > > >> > >
>>>>> > > > > > > >> >
>>>>> > > > > > >
>>>>> > > > > >
>>>>> > > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>>
>>>>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Luke Chen <sh...@gmail.com>.

Hi Christo,

Thanks for the update.

Questions:
1. For this
"The possible state transition from DISABLED state is to the ENABLED."
I think it only applies for KRaft mode. In ZK mode, the possible state is
"DISABLING", right?

2. For this:
"If the cluster is using Zookeeper as the control plane, enabling remote
storage for a topic triggers the controller to send this information to
Zookeeper. Each broker listens for changes in Zookeeper, and when a change
is detected, the broker triggers RemoteLogManager#onLeadershipChange()."

I think the way ZK brokers knows the leadership change is by getting the
LeaderAndISRRequeset from the controller, not listening for changes in ZK.

3. In the KRaft handler steps, you said:
"The controller also updates the Topic metadata to increment the
tiered_epoch and update the tiered_stateto DISABLING state."

Should it be "DISABLED" state since it's KRaft mode?

4. I was thinking how we handle the tiered_epoch not match error.
For ZK, I think the controller won't write any data into ZK Znode,
For KRaft, either configRecord or updateTopicMetadata records won't be
written.
Is that right? Because the current workflow makes me think there will be
partial data updated in ZK/KRaft when tiered_epoch error.

5. Since we changed to use stopReplicas (V5) request now, the diagram for
ZK workflow might also need to update.

6. In ZK mode, what will the controller do if the "stopReplicas" responses
not received from all brokers? Reverting the changes?
This won't happen in KRaft mode because it's broker's responsibility to
fetch metadata update from controller.


Thank you.
Luke


On Fri, Apr 19, 2024 at 10:23 PM Christo Lolov <ch...@gmail.com>
wrote:

> Heya all!
>
> I have updated KIP-950. A list of what I have updated is:
>
> * Explicitly state that Zookeeper-backed clusters will have ENABLED ->
> DISABLING -> DISABLED while KRaft-backed clusters will only have ENABLED ->
> DISABLED
> * Added two configurations for the new thread pools and explained where
> values will be picked-up mid Kafka version upgrade
> * Explained how leftover remote partitions will be scheduled for deletion
> * Updated the API to use StopReplica V5 rather than a whole new
> controller-to-broker API
> * Explained that the disablement procedure will be triggered by the
> controller listening for an (Incremental)AlterConfig change
> * Explained that we will first move log start offset and then issue a
> deletion
> * Went into more details that changing remote.log.disable.policy after
> disablement won't do anything and that if a customer would like additional
> data deleted they would have to use already existing methods
>
> Let me know if there are any new comments or I have missed something!
>
> Best,
> Christo
>
> On Mon, 15 Apr 2024 at 12:40, Christo Lolov <ch...@gmail.com>
> wrote:
>
>> Heya Doguscan,
>>
>> I believe that the state of the world after this KIP will be the
>> following:
>>
>> For Zookeeper-backed clusters there will be 3 states: ENABLED, DISABLING
>> and DISABLED. We want this because Zookeeper-backed clusters will await a
>> confirmation from the brokers that they have indeed stopped tiered-related
>> operations on the topic.
>>
>> For KRaft-backed clusters there will be only 2 states: ENABLED and
>> DISABLED. KRaft takes a fire-and-forget approach for topic deletion. I
>> believe the same approach ought to be taken for tiered topics. The
>> mechanism which will ensure that leftover state in remote due to failures
>> is cleaned up to me is the retention mechanism. In today's code, a leader
>> deletes all segments it finds in remote with offsets below the log start
>> offset. I believe this will be good enough for cleaning up leftover state
>> in remote due to failures.
>>
>> I know that quite a few changes have been discussed so I will aim to put
>> them on paper in the upcoming days and let everyone know!
>>
>> Best,
>> Christo
>>
>> On Tue, 9 Apr 2024 at 14:49, Doğuşcan Namal <na...@gmail.com>
>> wrote:
>>
>>> +1 let's not introduce a new api and mark it immediately as deprecated
>>> :)
>>>
>>> On your second comment Luke, one thing we need to clarify is when do we
>>> consider remote storage to be DISABLED for a topic?
>>> Particularly, what is the state when the remote storage is being deleted
>>> in case of disablement.policy=delete? Is it DISABLING or DISABLED?
>>>
>>> If we move directly to the DISABLED state,
>>>
>>> a) in case of failures, the leaders should continue remote storage
>>> deletion even if the topic is moved to the DISABLED state, otherwise we
>>> risk having stray data on remote storage.
>>> b) on each restart, we should initiate the remote storage deletion
>>> because although we replayed a record with a DISABLED state, we can not be
>>> sure if the remote data is deleted or not.
>>>
>>> We could either consider keeping the remote topic in DISABLING state
>>> until all of the remote storage data is deleted, or we need an additional
>>> mechanism to handle the remote stray data.
>>>
>>> The existing topic deletion, for instance, handles stray logs on disk by
>>> detecting them on KafkaBroker startup and deleting before the
>>> ReplicaManager is started.
>>> Maybe we need a similar mechanism here as well if we don't want a
>>> DISABLING state. Otherwise, we need a callback from Brokers to validate
>>> that remote storage data is deleted and now we could move to the DISABLED
>>> state.
>>>
>>> Thanks.
>>>
>>> On Tue, 9 Apr 2024 at 12:45, Luke Chen <sh...@gmail.com> wrote:
>>>
>>>> Hi Christo,
>>>>
>>>> > I would then opt for moving information from DisableRemoteTopic
>>>> within the StopReplicas API which will then disappear in KRaft world as
>>>> it
>>>> is already scheduled for deprecation. What do you think?
>>>>
>>>> Sounds good to me.
>>>>
>>>> Thanks.
>>>> Luke
>>>>
>>>> On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <ch...@gmail.com>
>>>> wrote:
>>>>
>>>> > Heya Luke!
>>>> >
>>>> > I thought a bit more about it and I reached the same conclusion as
>>>> you for
>>>> > 2 as a follow-up from 1. In other words, in KRaft world I don't think
>>>> the
>>>> > controller needs to wait for acknowledgements for the brokers. All we
>>>> care
>>>> > about is that the leader (who is responsible for archiving/deleting
>>>> data in
>>>> > tiered storage) knows about the change and applies it properly. If
>>>> there is
>>>> > a leadership change halfway through the operation then the new leader
>>>> still
>>>> > needs to apply the message from the state topic and we know that a
>>>> > disable-message will be applied before a reenablement-message. I will
>>>> > change the KIP later today/tomorrow morning to reflect this reasoning.
>>>> >
>>>> > However, with this I believe that introducing a new API just for
>>>> > Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a bit of an
>>>> > overkill. I would then opt for moving information from
>>>> DisableRemoteTopic
>>>> > within the StopReplicas API which will then disappear in KRaft world
>>>> as it
>>>> > is already scheduled for deprecation. What do you think?
>>>> >
>>>> > Best,
>>>> > Christo
>>>> >
>>>> > On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com> wrote:
>>>> >
>>>> > > Hi Christo,
>>>> > >
>>>> > > 1. I agree with Doguscan that in KRaft mode, the controller won't
>>>> send
>>>> > RPCs
>>>> > > to the brokers (except in the migration path).
>>>> > > So, I think we could adopt the similar way we did to
>>>> > `AlterReplicaLogDirs`
>>>> > > (
>>>> > > KIP-858
>>>> > > <
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
>>>> > > >)
>>>> > > that let the broker notify controller any update, instead of
>>>> controller
>>>> > to
>>>> > > broker. And once the controller receives all the complete requests
>>>> from
>>>> > > brokers, it'll enter "Disabled" state. WDYT?
>>>> > >
>>>> > > 2. Why should we wait until all brokers to respond before moving to
>>>> > > "Disabled" state in "KRaft mode"?
>>>> > > Currently, only the leader node does the remote log upload/fetch
>>>> tasks,
>>>> > so
>>>> > > does that mean the controller only need to make sure the leader
>>>> completes
>>>> > > the stopPartition?
>>>> > > If during the leader node stopPartition process triggered leadership
>>>> > > change, then the new leader should receive and apply the
>>>> configRecord
>>>> > > update before the leadership change record based on the KRaft
>>>> design,
>>>> > which
>>>> > > means there will be no gap that the follower node becomes the
>>>> leader and
>>>> > > starting doing unexpected upload/fetch tasks, right?
>>>> > > I agree we should make sure in ZK mode, all brokers are completed
>>>> the
>>>> > > stopPartitions before moving to "Disabled" state because ZK node
>>>> watcher
>>>> > is
>>>> > > working in a separate thread. But not sure about KRaft mode.
>>>> > >
>>>> > > Thanks.
>>>> > > Luke
>>>> > >
>>>> > >
>>>> > > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <
>>>> christololov@gmail.com>
>>>> > > wrote:
>>>> > >
>>>> > > > Heya everyone!
>>>> > > >
>>>> > > > re: Doguscan
>>>> > > >
>>>> > > > I believe the answer to 101 needs a bit more discussion. As far
>>>> as I
>>>> > > know,
>>>> > > > tiered storage today has methods to update a metadata of a
>>>> segment to
>>>> > say
>>>> > > > "hey, I would like this deleted", but actual deletion is left to
>>>> plugin
>>>> > > > implementations (or any background cleaners). In other words,
>>>> there is
>>>> > no
>>>> > > > "immediate" deletion. In this KIP, we would like to continue
>>>> doing the
>>>> > > same
>>>> > > > if the retention policy is set to delete. So I believe the answer
>>>> is
>>>> > > > actually that a) we will update the metadata of the segments to
>>>> mark
>>>> > them
>>>> > > > as deleted and b) we will advance the log start offset. Any
>>>> deletion of
>>>> > > > actual files will still be delegated to plugin implementations. I
>>>> > believe
>>>> > > > this is further supported by "*remote.log.disable.policy=delete:*
>>>> Logs
>>>> > > that
>>>> > > > are archived in the remote storage will not be part of the
>>>> contiguous
>>>> > > > "active" log and will be deleted asynchronously as part of the
>>>> > > disablement
>>>> > > > process"
>>>> > > >
>>>> > > > Following from the above, I believe for 102 it is fine to allow
>>>> setting
>>>> > > of
>>>> > > > remote.log.disable.policy on a disabled topic in much the same
>>>> way we
>>>> > > allow
>>>> > > > other remote-related configurations to be set on a topic (i.e.
>>>> > > > local.retention.*) - it just won't have an effect. Granted, I do
>>>> > believe
>>>> > > we
>>>> > > > should restrict the policy being changed while a disablement is
>>>> > ongoing.
>>>> > > >
>>>> > > > re: Satish and Kamal
>>>> > > >
>>>> > > > 104, 1 and 2 are fair asks, I will work with Doguscan to update
>>>> the KIP
>>>> > > > with the information!
>>>> > > >
>>>> > > > Best,
>>>> > > > Christo
>>>> > > >
>>>> > > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <
>>>> namal.doguscan@gmail.com
>>>> > >
>>>> > > > wrote:
>>>> > > >
>>>> > > > > Hi Satish, I will try to answer as much as I can and the others
>>>> could
>>>> > > > chime
>>>> > > > > in with further details.
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > *101. For remote.log.disable.policy=delete: Does it delete the
>>>> remote
>>>> > > log
>>>> > > > > data immediately and the data in remote storage will not be
>>>> taken
>>>> > into
>>>> > > > > account by any replica? That means log-start-offset is moved to
>>>> the
>>>> > > > earlier
>>>> > > > > local-log-start-offset.*
>>>> > > > > *Exactly. RemoteLogData will be deleted immediately. *
>>>> > > > > *So before the deletion starts we move LogStart offset to
>>>> > > > > LocalLogStartOffset to ensure that no RemoteLog will be accessed
>>>> > after
>>>> > > > that
>>>> > > > > point.*
>>>> > > > >
>>>> > > > >
>>>> > > > > * 102. Can we update the remote.log.disable.policy after tiered
>>>> > storage
>>>> > > > is
>>>> > > > > disabled on a topic?*
>>>> > > > >
>>>> > > > > *This is a good point. I think we should not allow modifying
>>>> this
>>>> > > > > configuration*
>>>> > > > > *because changing the policy from Deletion to Retain when there
>>>> is an
>>>> > > > > ongoing Deletion will result in an undefined behaviour and
>>>> where we
>>>> > > > retain
>>>> > > > > half of the remote log and delete the other half.*
>>>> > > > >
>>>> > > > > * 103. Do we plan to add any metrics related to this feature?*
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > *Any recommendations?*
>>>> > > > > *We may emit a gauge showing the enablement state of a topic
>>>> but we
>>>> > > could
>>>> > > > > gather that info from the logs as well.*
>>>> > > > > *The total duration for remote topic deletion could be added as
>>>> well
>>>> > > but
>>>> > > > > this is more of a metric for the RemotePartitionRemover itself.*
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > *104. Please add configuration details about copier thread pool,
>>>> > > > expiration
>>>> > > > > thread pool and the migration of the existing
>>>> > > > > RemoteLogManagerScheduledThreadPool.*
>>>> > > > >
>>>> > > > > *Will add the details.*
>>>> > > > >
>>>> > > > > 105. How is the behaviour with topic or partition deletion
>>>> request
>>>> > > > > handled when tiered storage disablement request is still being
>>>> > > > > processed on a topic?
>>>> > > > >
>>>> > > > > *If the disablement policy is Delete then a successive topic
>>>> deletion
>>>> > > > > request is going to be a NOOP because RemoteLogs is already
>>>> being
>>>> > > > deleted.*
>>>> > > > > *If the disablement policy is Retain, then we only moved the
>>>> > > > LogStartOffset
>>>> > > > > and didn't touch RemoteLogs anyway, so the delete topic request
>>>> will
>>>> > > > > result*
>>>> > > > >
>>>> > > > > *in the initiation of RemoteLog deletion.*
>>>> > > > >
>>>> > > > >
>>>> > > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
>>>> > > > > kamal.chandraprakash@gmail.com> wrote:
>>>> > > > >
>>>> > > > > > Hi,
>>>> > > > > >
>>>> > > > > > Thanks for the KIP! Overall the KIP looks good and covered
>>>> most of
>>>> > > the
>>>> > > > > > items.
>>>> > > > > >
>>>> > > > > > 1. Could you explain how the brokers will handle the
>>>> > > DisableRemoteTopic
>>>> > > > > API
>>>> > > > > > request?
>>>> > > > > >
>>>> > > > > > 2. Who will initiate the controller interaction sequence?
>>>> Does the
>>>> > > > > > controller listens for
>>>> > > > > > topic config updates and initiate the disablement?
>>>> > > > > >
>>>> > > > > > --
>>>> > > > > > Kamal
>>>> > > > > >
>>>> > > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
>>>> > > > satish.duggana@gmail.com
>>>> > > > > >
>>>> > > > > > wrote:
>>>> > > > > >
>>>> > > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
>>>> > > > > > >
>>>> > > > > > > I had an initial review of the KIP and left the below
>>>> comments.
>>>> > > > > > >
>>>> > > > > > > 101. For remote.log.disable.policy=delete:
>>>> > > > > > > Does it delete the remote log data immediately and the data
>>>> in
>>>> > > remote
>>>> > > > > > > storage will not be taken into account by any replica? That
>>>> means
>>>> > > > > > > log-start-offset is moved to the earlier
>>>> local-log-start-offset.
>>>> > > > > > >
>>>> > > > > > > 102. Can we update the remote.log.disable.policy after
>>>> tiered
>>>> > > storage
>>>> > > > > > > is disabled on a topic?
>>>> > > > > > >
>>>> > > > > > > 103. Do we plan to add any metrics related to this feature?
>>>> > > > > > >
>>>> > > > > > > 104. Please add configuration details about copier thread
>>>> pool,
>>>> > > > > > > expiration thread pool and the migration of the existing
>>>> > > > > > > RemoteLogManagerScheduledThreadPool.
>>>> > > > > > >
>>>> > > > > > > 105. How is the behaviour with topic or partition deletion
>>>> > request
>>>> > > > > > > handled when tiered storage disablement request is still
>>>> being
>>>> > > > > > > processed on a topic?
>>>> > > > > > >
>>>> > > > > > > ~Satish.
>>>> > > > > > >
>>>> > > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
>>>> > > > namal.doguscan@gmail.com
>>>> > > > > >
>>>> > > > > > > wrote:
>>>> > > > > > > >
>>>> > > > > > > > Hi Christo and Luke,
>>>> > > > > > > >
>>>> > > > > > > > I think the KRaft section of the KIP requires slight
>>>> > improvement.
>>>> > > > The
>>>> > > > > > > metadata propagation in KRaft is handled by the RAFT layer
>>>> > instead
>>>> > > of
>>>> > > > > > > sending Controller -> Broker RPCs. In fact, KIP-631
>>>> deprecated
>>>> > > these
>>>> > > > > > RPCs.
>>>> > > > > > > >
>>>> > > > > > > > I will come up with some recommendations on how we could
>>>> > improve
>>>> > > > that
>>>> > > > > > > one but until then, @Luke please feel free to review the
>>>> KIP.
>>>> > > > > > > >
>>>> > > > > > > > @Satish, if we want this to make it to Kafka 3.8 I
>>>> believe we
>>>> > > need
>>>> > > > to
>>>> > > > > > > aim to get the KIP approved in the following weeks
>>>> otherwise it
>>>> > > will
>>>> > > > > slip
>>>> > > > > > > and we can not support it in Zookeeper mode.
>>>> > > > > > > >
>>>> > > > > > > > I also would like to better understand what is the
>>>> community's
>>>> > > > stand
>>>> > > > > > for
>>>> > > > > > > adding a new feature for Zookeeper since it is marked as
>>>> > deprecated
>>>> > > > > > already.
>>>> > > > > > > >
>>>> > > > > > > > Thanks.
>>>> > > > > > > >
>>>> > > > > > > >
>>>> > > > > > > >
>>>> > > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
>>>> > > > christololov@gmail.com>
>>>> > > > > > > wrote:
>>>> > > > > > > >>
>>>> > > > > > > >> Heya,
>>>> > > > > > > >>
>>>> > > > > > > >> I do have some time to put into this, but to be honest I
>>>> am
>>>> > > still
>>>> > > > > > after
>>>> > > > > > > >> reviews of the KIP itself :)
>>>> > > > > > > >>
>>>> > > > > > > >> After the latest changes it ought to be detailing both a
>>>> > > Zookeeper
>>>> > > > > > > approach
>>>> > > > > > > >> and a KRaft approach.
>>>> > > > > > > >>
>>>> > > > > > > >> Do you have any thoughts on how it could be improved or
>>>> > should I
>>>> > > > > > start a
>>>> > > > > > > >> voting thread?
>>>> > > > > > > >>
>>>> > > > > > > >> Best,
>>>> > > > > > > >> Christo
>>>> > > > > > > >>
>>>> > > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <
>>>> showuon@gmail.com>
>>>> > > > wrote:
>>>> > > > > > > >>
>>>> > > > > > > >> > Hi Christo,
>>>> > > > > > > >> >
>>>> > > > > > > >> > Any update with this KIP?
>>>> > > > > > > >> > If you don't have time to complete it, I can
>>>> collaborate
>>>> > with
>>>> > > > you
>>>> > > > > to
>>>> > > > > > > work
>>>> > > > > > > >> > on it.
>>>> > > > > > > >> >
>>>> > > > > > > >> > Thanks.
>>>> > > > > > > >> > Luke
>>>> > > > > > > >> >
>>>> > > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish Duggana <
>>>> > > > > > > satish.duggana@gmail.com>
>>>> > > > > > > >> > wrote:
>>>> > > > > > > >> >
>>>> > > > > > > >> > > Hi Christo,
>>>> > > > > > > >> > > Thanks for volunteering to contribute to the KIP
>>>> > > discussion. I
>>>> > > > > > > suggest
>>>> > > > > > > >> > > considering this KIP for both ZK and KRaft as it
>>>> will be
>>>> > > > helpful
>>>> > > > > > for
>>>> > > > > > > >> > > this feature to be available in 3.8.0 running with ZK
>>>> > > > clusters.
>>>> > > > > > > >> > >
>>>> > > > > > > >> > > Thanks,
>>>> > > > > > > >> > > Satish.
>>>> > > > > > > >> > >
>>>> > > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
>>>> > > > > > christololov@gmail.com
>>>> > > > > > > >
>>>> > > > > > > >> > > wrote:
>>>> > > > > > > >> > > >
>>>> > > > > > > >> > > > Hello!
>>>> > > > > > > >> > > >
>>>> > > > > > > >> > > > I volunteer to get this KIP moving forward and
>>>> > implemented
>>>> > > > in
>>>> > > > > > > Apache
>>>> > > > > > > >> > > Kafka
>>>> > > > > > > >> > > > 3.8.
>>>> > > > > > > >> > > >
>>>> > > > > > > >> > > > I have caught up with Mehari offline and we have
>>>> agreed
>>>> > > that
>>>> > > > > > given
>>>> > > > > > > >> > Apache
>>>> > > > > > > >> > > > Kafka 4.0 being around the corner we would like to
>>>> > propose
>>>> > > > > this
>>>> > > > > > > feature
>>>> > > > > > > >> > > > only for KRaft clusters.
>>>> > > > > > > >> > > >
>>>> > > > > > > >> > > > Any and all reviews and comments are welcome!
>>>> > > > > > > >> > > >
>>>> > > > > > > >> > > > Best,
>>>> > > > > > > >> > > > Christo
>>>> > > > > > > >> > > >
>>>> > > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan Namal <
>>>> > > > > > > namal.doguscan@gmail.com>
>>>> > > > > > > >> > > > wrote:
>>>> > > > > > > >> > > >
>>>> > > > > > > >> > > > > Hi everyone, any progress on the status of this
>>>> KIP?
>>>> > > > Overall
>>>> > > > > > > looks
>>>> > > > > > > >> > > good to
>>>> > > > > > > >> > > > > me but I wonder whether we still need to support
>>>> it
>>>> > for
>>>> > > > > > > Zookeeper
>>>> > > > > > > >> > mode
>>>> > > > > > > >> > > > > given that it will be deprecated in the next 3
>>>> months.
>>>> > > > > > > >> > > > >
>>>> > > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari" wrote:
>>>> > > > > > > >> > > > > > Hi everyone,
>>>> > > > > > > >> > > > > > I would like to start a discussion on KIP-950:
>>>> > Tiered
>>>> > > > > > Storage
>>>> > > > > > > >> > > Disablement
>>>> > > > > > > >> > > > > (
>>>> > > > > > > >> > > > >
>>>> > > > > > > >> > > > >
>>>> > > > > > > >> > >
>>>> > > > > > > >> >
>>>> > > > > > >
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
>>>> > > > > > > >> > > > > ).
>>>> > > > > > > >> > > > > >
>>>> > > > > > > >> > > > > > This KIP proposes adding the ability to
>>>> disable and
>>>> > > > > > re-enable
>>>> > > > > > > >> > tiered
>>>> > > > > > > >> > > > > storage on a topic.
>>>> > > > > > > >> > > > > >
>>>> > > > > > > >> > > > > > Thanks,
>>>> > > > > > > >> > > > > > Mehari
>>>> > > > > > > >> > > > > >
>>>> > > > > > > >> > > > >
>>>> > > > > > > >> > >
>>>> > > > > > > >> >
>>>> > > > > > >
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Christo Lolov <ch...@gmail.com>.

Heya all!

I have updated KIP-950. A list of what I have updated is:

* Explicitly state that Zookeeper-backed clusters will have ENABLED ->
DISABLING -> DISABLED while KRaft-backed clusters will only have ENABLED ->
DISABLED
* Added two configurations for the new thread pools and explained where
values will be picked-up mid Kafka version upgrade
* Explained how leftover remote partitions will be scheduled for deletion
* Updated the API to use StopReplica V5 rather than a whole new
controller-to-broker API
* Explained that the disablement procedure will be triggered by the
controller listening for an (Incremental)AlterConfig change
* Explained that we will first move log start offset and then issue a
deletion
* Went into more details that changing remote.log.disable.policy after
disablement won't do anything and that if a customer would like additional
data deleted they would have to use already existing methods

Let me know if there are any new comments or I have missed something!

Best,
Christo

On Mon, 15 Apr 2024 at 12:40, Christo Lolov <ch...@gmail.com> wrote:

> Heya Doguscan,
>
> I believe that the state of the world after this KIP will be the following:
>
> For Zookeeper-backed clusters there will be 3 states: ENABLED, DISABLING
> and DISABLED. We want this because Zookeeper-backed clusters will await a
> confirmation from the brokers that they have indeed stopped tiered-related
> operations on the topic.
>
> For KRaft-backed clusters there will be only 2 states: ENABLED and
> DISABLED. KRaft takes a fire-and-forget approach for topic deletion. I
> believe the same approach ought to be taken for tiered topics. The
> mechanism which will ensure that leftover state in remote due to failures
> is cleaned up to me is the retention mechanism. In today's code, a leader
> deletes all segments it finds in remote with offsets below the log start
> offset. I believe this will be good enough for cleaning up leftover state
> in remote due to failures.
>
> I know that quite a few changes have been discussed so I will aim to put
> them on paper in the upcoming days and let everyone know!
>
> Best,
> Christo
>
> On Tue, 9 Apr 2024 at 14:49, Doğuşcan Namal <na...@gmail.com>
> wrote:
>
>> +1 let's not introduce a new api and mark it immediately as deprecated :)
>>
>> On your second comment Luke, one thing we need to clarify is when do we
>> consider remote storage to be DISABLED for a topic?
>> Particularly, what is the state when the remote storage is being deleted
>> in case of disablement.policy=delete? Is it DISABLING or DISABLED?
>>
>> If we move directly to the DISABLED state,
>>
>> a) in case of failures, the leaders should continue remote storage
>> deletion even if the topic is moved to the DISABLED state, otherwise we
>> risk having stray data on remote storage.
>> b) on each restart, we should initiate the remote storage deletion
>> because although we replayed a record with a DISABLED state, we can not be
>> sure if the remote data is deleted or not.
>>
>> We could either consider keeping the remote topic in DISABLING state
>> until all of the remote storage data is deleted, or we need an additional
>> mechanism to handle the remote stray data.
>>
>> The existing topic deletion, for instance, handles stray logs on disk by
>> detecting them on KafkaBroker startup and deleting before the
>> ReplicaManager is started.
>> Maybe we need a similar mechanism here as well if we don't want a
>> DISABLING state. Otherwise, we need a callback from Brokers to validate
>> that remote storage data is deleted and now we could move to the DISABLED
>> state.
>>
>> Thanks.
>>
>> On Tue, 9 Apr 2024 at 12:45, Luke Chen <sh...@gmail.com> wrote:
>>
>>> Hi Christo,
>>>
>>> > I would then opt for moving information from DisableRemoteTopic
>>> within the StopReplicas API which will then disappear in KRaft world as
>>> it
>>> is already scheduled for deprecation. What do you think?
>>>
>>> Sounds good to me.
>>>
>>> Thanks.
>>> Luke
>>>
>>> On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <ch...@gmail.com>
>>> wrote:
>>>
>>> > Heya Luke!
>>> >
>>> > I thought a bit more about it and I reached the same conclusion as you
>>> for
>>> > 2 as a follow-up from 1. In other words, in KRaft world I don't think
>>> the
>>> > controller needs to wait for acknowledgements for the brokers. All we
>>> care
>>> > about is that the leader (who is responsible for archiving/deleting
>>> data in
>>> > tiered storage) knows about the change and applies it properly. If
>>> there is
>>> > a leadership change halfway through the operation then the new leader
>>> still
>>> > needs to apply the message from the state topic and we know that a
>>> > disable-message will be applied before a reenablement-message. I will
>>> > change the KIP later today/tomorrow morning to reflect this reasoning.
>>> >
>>> > However, with this I believe that introducing a new API just for
>>> > Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a bit of an
>>> > overkill. I would then opt for moving information from
>>> DisableRemoteTopic
>>> > within the StopReplicas API which will then disappear in KRaft world
>>> as it
>>> > is already scheduled for deprecation. What do you think?
>>> >
>>> > Best,
>>> > Christo
>>> >
>>> > On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com> wrote:
>>> >
>>> > > Hi Christo,
>>> > >
>>> > > 1. I agree with Doguscan that in KRaft mode, the controller won't
>>> send
>>> > RPCs
>>> > > to the brokers (except in the migration path).
>>> > > So, I think we could adopt the similar way we did to
>>> > `AlterReplicaLogDirs`
>>> > > (
>>> > > KIP-858
>>> > > <
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
>>> > > >)
>>> > > that let the broker notify controller any update, instead of
>>> controller
>>> > to
>>> > > broker. And once the controller receives all the complete requests
>>> from
>>> > > brokers, it'll enter "Disabled" state. WDYT?
>>> > >
>>> > > 2. Why should we wait until all brokers to respond before moving to
>>> > > "Disabled" state in "KRaft mode"?
>>> > > Currently, only the leader node does the remote log upload/fetch
>>> tasks,
>>> > so
>>> > > does that mean the controller only need to make sure the leader
>>> completes
>>> > > the stopPartition?
>>> > > If during the leader node stopPartition process triggered leadership
>>> > > change, then the new leader should receive and apply the configRecord
>>> > > update before the leadership change record based on the KRaft design,
>>> > which
>>> > > means there will be no gap that the follower node becomes the leader
>>> and
>>> > > starting doing unexpected upload/fetch tasks, right?
>>> > > I agree we should make sure in ZK mode, all brokers are completed the
>>> > > stopPartitions before moving to "Disabled" state because ZK node
>>> watcher
>>> > is
>>> > > working in a separate thread. But not sure about KRaft mode.
>>> > >
>>> > > Thanks.
>>> > > Luke
>>> > >
>>> > >
>>> > > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <
>>> christololov@gmail.com>
>>> > > wrote:
>>> > >
>>> > > > Heya everyone!
>>> > > >
>>> > > > re: Doguscan
>>> > > >
>>> > > > I believe the answer to 101 needs a bit more discussion. As far as
>>> I
>>> > > know,
>>> > > > tiered storage today has methods to update a metadata of a segment
>>> to
>>> > say
>>> > > > "hey, I would like this deleted", but actual deletion is left to
>>> plugin
>>> > > > implementations (or any background cleaners). In other words,
>>> there is
>>> > no
>>> > > > "immediate" deletion. In this KIP, we would like to continue doing
>>> the
>>> > > same
>>> > > > if the retention policy is set to delete. So I believe the answer
>>> is
>>> > > > actually that a) we will update the metadata of the segments to
>>> mark
>>> > them
>>> > > > as deleted and b) we will advance the log start offset. Any
>>> deletion of
>>> > > > actual files will still be delegated to plugin implementations. I
>>> > believe
>>> > > > this is further supported by "*remote.log.disable.policy=delete:*
>>> Logs
>>> > > that
>>> > > > are archived in the remote storage will not be part of the
>>> contiguous
>>> > > > "active" log and will be deleted asynchronously as part of the
>>> > > disablement
>>> > > > process"
>>> > > >
>>> > > > Following from the above, I believe for 102 it is fine to allow
>>> setting
>>> > > of
>>> > > > remote.log.disable.policy on a disabled topic in much the same way
>>> we
>>> > > allow
>>> > > > other remote-related configurations to be set on a topic (i.e.
>>> > > > local.retention.*) - it just won't have an effect. Granted, I do
>>> > believe
>>> > > we
>>> > > > should restrict the policy being changed while a disablement is
>>> > ongoing.
>>> > > >
>>> > > > re: Satish and Kamal
>>> > > >
>>> > > > 104, 1 and 2 are fair asks, I will work with Doguscan to update
>>> the KIP
>>> > > > with the information!
>>> > > >
>>> > > > Best,
>>> > > > Christo
>>> > > >
>>> > > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <
>>> namal.doguscan@gmail.com
>>> > >
>>> > > > wrote:
>>> > > >
>>> > > > > Hi Satish, I will try to answer as much as I can and the others
>>> could
>>> > > > chime
>>> > > > > in with further details.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > *101. For remote.log.disable.policy=delete: Does it delete the
>>> remote
>>> > > log
>>> > > > > data immediately and the data in remote storage will not be taken
>>> > into
>>> > > > > account by any replica? That means log-start-offset is moved to
>>> the
>>> > > > earlier
>>> > > > > local-log-start-offset.*
>>> > > > > *Exactly. RemoteLogData will be deleted immediately. *
>>> > > > > *So before the deletion starts we move LogStart offset to
>>> > > > > LocalLogStartOffset to ensure that no RemoteLog will be accessed
>>> > after
>>> > > > that
>>> > > > > point.*
>>> > > > >
>>> > > > >
>>> > > > > * 102. Can we update the remote.log.disable.policy after tiered
>>> > storage
>>> > > > is
>>> > > > > disabled on a topic?*
>>> > > > >
>>> > > > > *This is a good point. I think we should not allow modifying this
>>> > > > > configuration*
>>> > > > > *because changing the policy from Deletion to Retain when there
>>> is an
>>> > > > > ongoing Deletion will result in an undefined behaviour and where
>>> we
>>> > > > retain
>>> > > > > half of the remote log and delete the other half.*
>>> > > > >
>>> > > > > * 103. Do we plan to add any metrics related to this feature?*
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > *Any recommendations?*
>>> > > > > *We may emit a gauge showing the enablement state of a topic but
>>> we
>>> > > could
>>> > > > > gather that info from the logs as well.*
>>> > > > > *The total duration for remote topic deletion could be added as
>>> well
>>> > > but
>>> > > > > this is more of a metric for the RemotePartitionRemover itself.*
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > *104. Please add configuration details about copier thread pool,
>>> > > > expiration
>>> > > > > thread pool and the migration of the existing
>>> > > > > RemoteLogManagerScheduledThreadPool.*
>>> > > > >
>>> > > > > *Will add the details.*
>>> > > > >
>>> > > > > 105. How is the behaviour with topic or partition deletion
>>> request
>>> > > > > handled when tiered storage disablement request is still being
>>> > > > > processed on a topic?
>>> > > > >
>>> > > > > *If the disablement policy is Delete then a successive topic
>>> deletion
>>> > > > > request is going to be a NOOP because RemoteLogs is already being
>>> > > > deleted.*
>>> > > > > *If the disablement policy is Retain, then we only moved the
>>> > > > LogStartOffset
>>> > > > > and didn't touch RemoteLogs anyway, so the delete topic request
>>> will
>>> > > > > result*
>>> > > > >
>>> > > > > *in the initiation of RemoteLog deletion.*
>>> > > > >
>>> > > > >
>>> > > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
>>> > > > > kamal.chandraprakash@gmail.com> wrote:
>>> > > > >
>>> > > > > > Hi,
>>> > > > > >
>>> > > > > > Thanks for the KIP! Overall the KIP looks good and covered
>>> most of
>>> > > the
>>> > > > > > items.
>>> > > > > >
>>> > > > > > 1. Could you explain how the brokers will handle the
>>> > > DisableRemoteTopic
>>> > > > > API
>>> > > > > > request?
>>> > > > > >
>>> > > > > > 2. Who will initiate the controller interaction sequence? Does
>>> the
>>> > > > > > controller listens for
>>> > > > > > topic config updates and initiate the disablement?
>>> > > > > >
>>> > > > > > --
>>> > > > > > Kamal
>>> > > > > >
>>> > > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
>>> > > > satish.duggana@gmail.com
>>> > > > > >
>>> > > > > > wrote:
>>> > > > > >
>>> > > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
>>> > > > > > >
>>> > > > > > > I had an initial review of the KIP and left the below
>>> comments.
>>> > > > > > >
>>> > > > > > > 101. For remote.log.disable.policy=delete:
>>> > > > > > > Does it delete the remote log data immediately and the data
>>> in
>>> > > remote
>>> > > > > > > storage will not be taken into account by any replica? That
>>> means
>>> > > > > > > log-start-offset is moved to the earlier
>>> local-log-start-offset.
>>> > > > > > >
>>> > > > > > > 102. Can we update the remote.log.disable.policy after tiered
>>> > > storage
>>> > > > > > > is disabled on a topic?
>>> > > > > > >
>>> > > > > > > 103. Do we plan to add any metrics related to this feature?
>>> > > > > > >
>>> > > > > > > 104. Please add configuration details about copier thread
>>> pool,
>>> > > > > > > expiration thread pool and the migration of the existing
>>> > > > > > > RemoteLogManagerScheduledThreadPool.
>>> > > > > > >
>>> > > > > > > 105. How is the behaviour with topic or partition deletion
>>> > request
>>> > > > > > > handled when tiered storage disablement request is still
>>> being
>>> > > > > > > processed on a topic?
>>> > > > > > >
>>> > > > > > > ~Satish.
>>> > > > > > >
>>> > > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
>>> > > > namal.doguscan@gmail.com
>>> > > > > >
>>> > > > > > > wrote:
>>> > > > > > > >
>>> > > > > > > > Hi Christo and Luke,
>>> > > > > > > >
>>> > > > > > > > I think the KRaft section of the KIP requires slight
>>> > improvement.
>>> > > > The
>>> > > > > > > metadata propagation in KRaft is handled by the RAFT layer
>>> > instead
>>> > > of
>>> > > > > > > sending Controller -> Broker RPCs. In fact, KIP-631
>>> deprecated
>>> > > these
>>> > > > > > RPCs.
>>> > > > > > > >
>>> > > > > > > > I will come up with some recommendations on how we could
>>> > improve
>>> > > > that
>>> > > > > > > one but until then, @Luke please feel free to review the KIP.
>>> > > > > > > >
>>> > > > > > > > @Satish, if we want this to make it to Kafka 3.8 I believe
>>> we
>>> > > need
>>> > > > to
>>> > > > > > > aim to get the KIP approved in the following weeks otherwise
>>> it
>>> > > will
>>> > > > > slip
>>> > > > > > > and we can not support it in Zookeeper mode.
>>> > > > > > > >
>>> > > > > > > > I also would like to better understand what is the
>>> community's
>>> > > > stand
>>> > > > > > for
>>> > > > > > > adding a new feature for Zookeeper since it is marked as
>>> > deprecated
>>> > > > > > already.
>>> > > > > > > >
>>> > > > > > > > Thanks.
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
>>> > > > christololov@gmail.com>
>>> > > > > > > wrote:
>>> > > > > > > >>
>>> > > > > > > >> Heya,
>>> > > > > > > >>
>>> > > > > > > >> I do have some time to put into this, but to be honest I
>>> am
>>> > > still
>>> > > > > > after
>>> > > > > > > >> reviews of the KIP itself :)
>>> > > > > > > >>
>>> > > > > > > >> After the latest changes it ought to be detailing both a
>>> > > Zookeeper
>>> > > > > > > approach
>>> > > > > > > >> and a KRaft approach.
>>> > > > > > > >>
>>> > > > > > > >> Do you have any thoughts on how it could be improved or
>>> > should I
>>> > > > > > start a
>>> > > > > > > >> voting thread?
>>> > > > > > > >>
>>> > > > > > > >> Best,
>>> > > > > > > >> Christo
>>> > > > > > > >>
>>> > > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <
>>> showuon@gmail.com>
>>> > > > wrote:
>>> > > > > > > >>
>>> > > > > > > >> > Hi Christo,
>>> > > > > > > >> >
>>> > > > > > > >> > Any update with this KIP?
>>> > > > > > > >> > If you don't have time to complete it, I can collaborate
>>> > with
>>> > > > you
>>> > > > > to
>>> > > > > > > work
>>> > > > > > > >> > on it.
>>> > > > > > > >> >
>>> > > > > > > >> > Thanks.
>>> > > > > > > >> > Luke
>>> > > > > > > >> >
>>> > > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish Duggana <
>>> > > > > > > satish.duggana@gmail.com>
>>> > > > > > > >> > wrote:
>>> > > > > > > >> >
>>> > > > > > > >> > > Hi Christo,
>>> > > > > > > >> > > Thanks for volunteering to contribute to the KIP
>>> > > discussion. I
>>> > > > > > > suggest
>>> > > > > > > >> > > considering this KIP for both ZK and KRaft as it will
>>> be
>>> > > > helpful
>>> > > > > > for
>>> > > > > > > >> > > this feature to be available in 3.8.0 running with ZK
>>> > > > clusters.
>>> > > > > > > >> > >
>>> > > > > > > >> > > Thanks,
>>> > > > > > > >> > > Satish.
>>> > > > > > > >> > >
>>> > > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
>>> > > > > > christololov@gmail.com
>>> > > > > > > >
>>> > > > > > > >> > > wrote:
>>> > > > > > > >> > > >
>>> > > > > > > >> > > > Hello!
>>> > > > > > > >> > > >
>>> > > > > > > >> > > > I volunteer to get this KIP moving forward and
>>> > implemented
>>> > > > in
>>> > > > > > > Apache
>>> > > > > > > >> > > Kafka
>>> > > > > > > >> > > > 3.8.
>>> > > > > > > >> > > >
>>> > > > > > > >> > > > I have caught up with Mehari offline and we have
>>> agreed
>>> > > that
>>> > > > > > given
>>> > > > > > > >> > Apache
>>> > > > > > > >> > > > Kafka 4.0 being around the corner we would like to
>>> > propose
>>> > > > > this
>>> > > > > > > feature
>>> > > > > > > >> > > > only for KRaft clusters.
>>> > > > > > > >> > > >
>>> > > > > > > >> > > > Any and all reviews and comments are welcome!
>>> > > > > > > >> > > >
>>> > > > > > > >> > > > Best,
>>> > > > > > > >> > > > Christo
>>> > > > > > > >> > > >
>>> > > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan Namal <
>>> > > > > > > namal.doguscan@gmail.com>
>>> > > > > > > >> > > > wrote:
>>> > > > > > > >> > > >
>>> > > > > > > >> > > > > Hi everyone, any progress on the status of this
>>> KIP?
>>> > > > Overall
>>> > > > > > > looks
>>> > > > > > > >> > > good to
>>> > > > > > > >> > > > > me but I wonder whether we still need to support
>>> it
>>> > for
>>> > > > > > > Zookeeper
>>> > > > > > > >> > mode
>>> > > > > > > >> > > > > given that it will be deprecated in the next 3
>>> months.
>>> > > > > > > >> > > > >
>>> > > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari" wrote:
>>> > > > > > > >> > > > > > Hi everyone,
>>> > > > > > > >> > > > > > I would like to start a discussion on KIP-950:
>>> > Tiered
>>> > > > > > Storage
>>> > > > > > > >> > > Disablement
>>> > > > > > > >> > > > > (
>>> > > > > > > >> > > > >
>>> > > > > > > >> > > > >
>>> > > > > > > >> > >
>>> > > > > > > >> >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
>>> > > > > > > >> > > > > ).
>>> > > > > > > >> > > > > >
>>> > > > > > > >> > > > > > This KIP proposes adding the ability to disable
>>> and
>>> > > > > > re-enable
>>> > > > > > > >> > tiered
>>> > > > > > > >> > > > > storage on a topic.
>>> > > > > > > >> > > > > >
>>> > > > > > > >> > > > > > Thanks,
>>> > > > > > > >> > > > > > Mehari
>>> > > > > > > >> > > > > >
>>> > > > > > > >> > > > >
>>> > > > > > > >> > >
>>> > > > > > > >> >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Christo Lolov <ch...@gmail.com>.

Heya Doguscan,

I believe that the state of the world after this KIP will be the following:

For Zookeeper-backed clusters there will be 3 states: ENABLED, DISABLING
and DISABLED. We want this because Zookeeper-backed clusters will await a
confirmation from the brokers that they have indeed stopped tiered-related
operations on the topic.

For KRaft-backed clusters there will be only 2 states: ENABLED and
DISABLED. KRaft takes a fire-and-forget approach for topic deletion. I
believe the same approach ought to be taken for tiered topics. The
mechanism which will ensure that leftover state in remote due to failures
is cleaned up to me is the retention mechanism. In today's code, a leader
deletes all segments it finds in remote with offsets below the log start
offset. I believe this will be good enough for cleaning up leftover state
in remote due to failures.

I know that quite a few changes have been discussed so I will aim to put
them on paper in the upcoming days and let everyone know!

Best,
Christo

On Tue, 9 Apr 2024 at 14:49, Doğuşcan Namal <na...@gmail.com>
wrote:

> +1 let's not introduce a new api and mark it immediately as deprecated :)
>
> On your second comment Luke, one thing we need to clarify is when do we
> consider remote storage to be DISABLED for a topic?
> Particularly, what is the state when the remote storage is being deleted
> in case of disablement.policy=delete? Is it DISABLING or DISABLED?
>
> If we move directly to the DISABLED state,
>
> a) in case of failures, the leaders should continue remote storage
> deletion even if the topic is moved to the DISABLED state, otherwise we
> risk having stray data on remote storage.
> b) on each restart, we should initiate the remote storage deletion because
> although we replayed a record with a DISABLED state, we can not be sure if
> the remote data is deleted or not.
>
> We could either consider keeping the remote topic in DISABLING state until
> all of the remote storage data is deleted, or we need an additional
> mechanism to handle the remote stray data.
>
> The existing topic deletion, for instance, handles stray logs on disk by
> detecting them on KafkaBroker startup and deleting before the
> ReplicaManager is started.
> Maybe we need a similar mechanism here as well if we don't want a
> DISABLING state. Otherwise, we need a callback from Brokers to validate
> that remote storage data is deleted and now we could move to the DISABLED
> state.
>
> Thanks.
>
> On Tue, 9 Apr 2024 at 12:45, Luke Chen <sh...@gmail.com> wrote:
>
>> Hi Christo,
>>
>> > I would then opt for moving information from DisableRemoteTopic
>> within the StopReplicas API which will then disappear in KRaft world as it
>> is already scheduled for deprecation. What do you think?
>>
>> Sounds good to me.
>>
>> Thanks.
>> Luke
>>
>> On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <ch...@gmail.com>
>> wrote:
>>
>> > Heya Luke!
>> >
>> > I thought a bit more about it and I reached the same conclusion as you
>> for
>> > 2 as a follow-up from 1. In other words, in KRaft world I don't think
>> the
>> > controller needs to wait for acknowledgements for the brokers. All we
>> care
>> > about is that the leader (who is responsible for archiving/deleting
>> data in
>> > tiered storage) knows about the change and applies it properly. If
>> there is
>> > a leadership change halfway through the operation then the new leader
>> still
>> > needs to apply the message from the state topic and we know that a
>> > disable-message will be applied before a reenablement-message. I will
>> > change the KIP later today/tomorrow morning to reflect this reasoning.
>> >
>> > However, with this I believe that introducing a new API just for
>> > Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a bit of an
>> > overkill. I would then opt for moving information from
>> DisableRemoteTopic
>> > within the StopReplicas API which will then disappear in KRaft world as
>> it
>> > is already scheduled for deprecation. What do you think?
>> >
>> > Best,
>> > Christo
>> >
>> > On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com> wrote:
>> >
>> > > Hi Christo,
>> > >
>> > > 1. I agree with Doguscan that in KRaft mode, the controller won't send
>> > RPCs
>> > > to the brokers (except in the migration path).
>> > > So, I think we could adopt the similar way we did to
>> > `AlterReplicaLogDirs`
>> > > (
>> > > KIP-858
>> > > <
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
>> > > >)
>> > > that let the broker notify controller any update, instead of
>> controller
>> > to
>> > > broker. And once the controller receives all the complete requests
>> from
>> > > brokers, it'll enter "Disabled" state. WDYT?
>> > >
>> > > 2. Why should we wait until all brokers to respond before moving to
>> > > "Disabled" state in "KRaft mode"?
>> > > Currently, only the leader node does the remote log upload/fetch
>> tasks,
>> > so
>> > > does that mean the controller only need to make sure the leader
>> completes
>> > > the stopPartition?
>> > > If during the leader node stopPartition process triggered leadership
>> > > change, then the new leader should receive and apply the configRecord
>> > > update before the leadership change record based on the KRaft design,
>> > which
>> > > means there will be no gap that the follower node becomes the leader
>> and
>> > > starting doing unexpected upload/fetch tasks, right?
>> > > I agree we should make sure in ZK mode, all brokers are completed the
>> > > stopPartitions before moving to "Disabled" state because ZK node
>> watcher
>> > is
>> > > working in a separate thread. But not sure about KRaft mode.
>> > >
>> > > Thanks.
>> > > Luke
>> > >
>> > >
>> > > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <christololov@gmail.com
>> >
>> > > wrote:
>> > >
>> > > > Heya everyone!
>> > > >
>> > > > re: Doguscan
>> > > >
>> > > > I believe the answer to 101 needs a bit more discussion. As far as I
>> > > know,
>> > > > tiered storage today has methods to update a metadata of a segment
>> to
>> > say
>> > > > "hey, I would like this deleted", but actual deletion is left to
>> plugin
>> > > > implementations (or any background cleaners). In other words, there
>> is
>> > no
>> > > > "immediate" deletion. In this KIP, we would like to continue doing
>> the
>> > > same
>> > > > if the retention policy is set to delete. So I believe the answer is
>> > > > actually that a) we will update the metadata of the segments to mark
>> > them
>> > > > as deleted and b) we will advance the log start offset. Any
>> deletion of
>> > > > actual files will still be delegated to plugin implementations. I
>> > believe
>> > > > this is further supported by "*remote.log.disable.policy=delete:*
>> Logs
>> > > that
>> > > > are archived in the remote storage will not be part of the
>> contiguous
>> > > > "active" log and will be deleted asynchronously as part of the
>> > > disablement
>> > > > process"
>> > > >
>> > > > Following from the above, I believe for 102 it is fine to allow
>> setting
>> > > of
>> > > > remote.log.disable.policy on a disabled topic in much the same way
>> we
>> > > allow
>> > > > other remote-related configurations to be set on a topic (i.e.
>> > > > local.retention.*) - it just won't have an effect. Granted, I do
>> > believe
>> > > we
>> > > > should restrict the policy being changed while a disablement is
>> > ongoing.
>> > > >
>> > > > re: Satish and Kamal
>> > > >
>> > > > 104, 1 and 2 are fair asks, I will work with Doguscan to update the
>> KIP
>> > > > with the information!
>> > > >
>> > > > Best,
>> > > > Christo
>> > > >
>> > > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <
>> namal.doguscan@gmail.com
>> > >
>> > > > wrote:
>> > > >
>> > > > > Hi Satish, I will try to answer as much as I can and the others
>> could
>> > > > chime
>> > > > > in with further details.
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > *101. For remote.log.disable.policy=delete: Does it delete the
>> remote
>> > > log
>> > > > > data immediately and the data in remote storage will not be taken
>> > into
>> > > > > account by any replica? That means log-start-offset is moved to
>> the
>> > > > earlier
>> > > > > local-log-start-offset.*
>> > > > > *Exactly. RemoteLogData will be deleted immediately. *
>> > > > > *So before the deletion starts we move LogStart offset to
>> > > > > LocalLogStartOffset to ensure that no RemoteLog will be accessed
>> > after
>> > > > that
>> > > > > point.*
>> > > > >
>> > > > >
>> > > > > * 102. Can we update the remote.log.disable.policy after tiered
>> > storage
>> > > > is
>> > > > > disabled on a topic?*
>> > > > >
>> > > > > *This is a good point. I think we should not allow modifying this
>> > > > > configuration*
>> > > > > *because changing the policy from Deletion to Retain when there
>> is an
>> > > > > ongoing Deletion will result in an undefined behaviour and where
>> we
>> > > > retain
>> > > > > half of the remote log and delete the other half.*
>> > > > >
>> > > > > * 103. Do we plan to add any metrics related to this feature?*
>> > > > >
>> > > > >
>> > > > >
>> > > > > *Any recommendations?*
>> > > > > *We may emit a gauge showing the enablement state of a topic but
>> we
>> > > could
>> > > > > gather that info from the logs as well.*
>> > > > > *The total duration for remote topic deletion could be added as
>> well
>> > > but
>> > > > > this is more of a metric for the RemotePartitionRemover itself.*
>> > > > >
>> > > > >
>> > > > >
>> > > > > *104. Please add configuration details about copier thread pool,
>> > > > expiration
>> > > > > thread pool and the migration of the existing
>> > > > > RemoteLogManagerScheduledThreadPool.*
>> > > > >
>> > > > > *Will add the details.*
>> > > > >
>> > > > > 105. How is the behaviour with topic or partition deletion request
>> > > > > handled when tiered storage disablement request is still being
>> > > > > processed on a topic?
>> > > > >
>> > > > > *If the disablement policy is Delete then a successive topic
>> deletion
>> > > > > request is going to be a NOOP because RemoteLogs is already being
>> > > > deleted.*
>> > > > > *If the disablement policy is Retain, then we only moved the
>> > > > LogStartOffset
>> > > > > and didn't touch RemoteLogs anyway, so the delete topic request
>> will
>> > > > > result*
>> > > > >
>> > > > > *in the initiation of RemoteLog deletion.*
>> > > > >
>> > > > >
>> > > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
>> > > > > kamal.chandraprakash@gmail.com> wrote:
>> > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > > Thanks for the KIP! Overall the KIP looks good and covered most
>> of
>> > > the
>> > > > > > items.
>> > > > > >
>> > > > > > 1. Could you explain how the brokers will handle the
>> > > DisableRemoteTopic
>> > > > > API
>> > > > > > request?
>> > > > > >
>> > > > > > 2. Who will initiate the controller interaction sequence? Does
>> the
>> > > > > > controller listens for
>> > > > > > topic config updates and initiate the disablement?
>> > > > > >
>> > > > > > --
>> > > > > > Kamal
>> > > > > >
>> > > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
>> > > > satish.duggana@gmail.com
>> > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
>> > > > > > >
>> > > > > > > I had an initial review of the KIP and left the below
>> comments.
>> > > > > > >
>> > > > > > > 101. For remote.log.disable.policy=delete:
>> > > > > > > Does it delete the remote log data immediately and the data in
>> > > remote
>> > > > > > > storage will not be taken into account by any replica? That
>> means
>> > > > > > > log-start-offset is moved to the earlier
>> local-log-start-offset.
>> > > > > > >
>> > > > > > > 102. Can we update the remote.log.disable.policy after tiered
>> > > storage
>> > > > > > > is disabled on a topic?
>> > > > > > >
>> > > > > > > 103. Do we plan to add any metrics related to this feature?
>> > > > > > >
>> > > > > > > 104. Please add configuration details about copier thread
>> pool,
>> > > > > > > expiration thread pool and the migration of the existing
>> > > > > > > RemoteLogManagerScheduledThreadPool.
>> > > > > > >
>> > > > > > > 105. How is the behaviour with topic or partition deletion
>> > request
>> > > > > > > handled when tiered storage disablement request is still being
>> > > > > > > processed on a topic?
>> > > > > > >
>> > > > > > > ~Satish.
>> > > > > > >
>> > > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
>> > > > namal.doguscan@gmail.com
>> > > > > >
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > Hi Christo and Luke,
>> > > > > > > >
>> > > > > > > > I think the KRaft section of the KIP requires slight
>> > improvement.
>> > > > The
>> > > > > > > metadata propagation in KRaft is handled by the RAFT layer
>> > instead
>> > > of
>> > > > > > > sending Controller -> Broker RPCs. In fact, KIP-631 deprecated
>> > > these
>> > > > > > RPCs.
>> > > > > > > >
>> > > > > > > > I will come up with some recommendations on how we could
>> > improve
>> > > > that
>> > > > > > > one but until then, @Luke please feel free to review the KIP.
>> > > > > > > >
>> > > > > > > > @Satish, if we want this to make it to Kafka 3.8 I believe
>> we
>> > > need
>> > > > to
>> > > > > > > aim to get the KIP approved in the following weeks otherwise
>> it
>> > > will
>> > > > > slip
>> > > > > > > and we can not support it in Zookeeper mode.
>> > > > > > > >
>> > > > > > > > I also would like to better understand what is the
>> community's
>> > > > stand
>> > > > > > for
>> > > > > > > adding a new feature for Zookeeper since it is marked as
>> > deprecated
>> > > > > > already.
>> > > > > > > >
>> > > > > > > > Thanks.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
>> > > > christololov@gmail.com>
>> > > > > > > wrote:
>> > > > > > > >>
>> > > > > > > >> Heya,
>> > > > > > > >>
>> > > > > > > >> I do have some time to put into this, but to be honest I am
>> > > still
>> > > > > > after
>> > > > > > > >> reviews of the KIP itself :)
>> > > > > > > >>
>> > > > > > > >> After the latest changes it ought to be detailing both a
>> > > Zookeeper
>> > > > > > > approach
>> > > > > > > >> and a KRaft approach.
>> > > > > > > >>
>> > > > > > > >> Do you have any thoughts on how it could be improved or
>> > should I
>> > > > > > start a
>> > > > > > > >> voting thread?
>> > > > > > > >>
>> > > > > > > >> Best,
>> > > > > > > >> Christo
>> > > > > > > >>
>> > > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <showuon@gmail.com
>> >
>> > > > wrote:
>> > > > > > > >>
>> > > > > > > >> > Hi Christo,
>> > > > > > > >> >
>> > > > > > > >> > Any update with this KIP?
>> > > > > > > >> > If you don't have time to complete it, I can collaborate
>> > with
>> > > > you
>> > > > > to
>> > > > > > > work
>> > > > > > > >> > on it.
>> > > > > > > >> >
>> > > > > > > >> > Thanks.
>> > > > > > > >> > Luke
>> > > > > > > >> >
>> > > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish Duggana <
>> > > > > > > satish.duggana@gmail.com>
>> > > > > > > >> > wrote:
>> > > > > > > >> >
>> > > > > > > >> > > Hi Christo,
>> > > > > > > >> > > Thanks for volunteering to contribute to the KIP
>> > > discussion. I
>> > > > > > > suggest
>> > > > > > > >> > > considering this KIP for both ZK and KRaft as it will
>> be
>> > > > helpful
>> > > > > > for
>> > > > > > > >> > > this feature to be available in 3.8.0 running with ZK
>> > > > clusters.
>> > > > > > > >> > >
>> > > > > > > >> > > Thanks,
>> > > > > > > >> > > Satish.
>> > > > > > > >> > >
>> > > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
>> > > > > > christololov@gmail.com
>> > > > > > > >
>> > > > > > > >> > > wrote:
>> > > > > > > >> > > >
>> > > > > > > >> > > > Hello!
>> > > > > > > >> > > >
>> > > > > > > >> > > > I volunteer to get this KIP moving forward and
>> > implemented
>> > > > in
>> > > > > > > Apache
>> > > > > > > >> > > Kafka
>> > > > > > > >> > > > 3.8.
>> > > > > > > >> > > >
>> > > > > > > >> > > > I have caught up with Mehari offline and we have
>> agreed
>> > > that
>> > > > > > given
>> > > > > > > >> > Apache
>> > > > > > > >> > > > Kafka 4.0 being around the corner we would like to
>> > propose
>> > > > > this
>> > > > > > > feature
>> > > > > > > >> > > > only for KRaft clusters.
>> > > > > > > >> > > >
>> > > > > > > >> > > > Any and all reviews and comments are welcome!
>> > > > > > > >> > > >
>> > > > > > > >> > > > Best,
>> > > > > > > >> > > > Christo
>> > > > > > > >> > > >
>> > > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan Namal <
>> > > > > > > namal.doguscan@gmail.com>
>> > > > > > > >> > > > wrote:
>> > > > > > > >> > > >
>> > > > > > > >> > > > > Hi everyone, any progress on the status of this
>> KIP?
>> > > > Overall
>> > > > > > > looks
>> > > > > > > >> > > good to
>> > > > > > > >> > > > > me but I wonder whether we still need to support it
>> > for
>> > > > > > > Zookeeper
>> > > > > > > >> > mode
>> > > > > > > >> > > > > given that it will be deprecated in the next 3
>> months.
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari" wrote:
>> > > > > > > >> > > > > > Hi everyone,
>> > > > > > > >> > > > > > I would like to start a discussion on KIP-950:
>> > Tiered
>> > > > > > Storage
>> > > > > > > >> > > Disablement
>> > > > > > > >> > > > > (
>> > > > > > > >> > > > >
>> > > > > > > >> > > > >
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
>> > > > > > > >> > > > > ).
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > This KIP proposes adding the ability to disable
>> and
>> > > > > > re-enable
>> > > > > > > >> > tiered
>> > > > > > > >> > > > > storage on a topic.
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > Thanks,
>> > > > > > > >> > > > > > Mehari
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > >
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Doğuşcan Namal <na...@gmail.com>.

+1 let's not introduce a new api and mark it immediately as deprecated :)

On your second comment Luke, one thing we need to clarify is when do we
consider remote storage to be DISABLED for a topic?
Particularly, what is the state when the remote storage is being deleted in
case of disablement.policy=delete? Is it DISABLING or DISABLED?

If we move directly to the DISABLED state,

a) in case of failures, the leaders should continue remote storage deletion
even if the topic is moved to the DISABLED state, otherwise we risk having
stray data on remote storage.
b) on each restart, we should initiate the remote storage deletion because
although we replayed a record with a DISABLED state, we can not be sure if
the remote data is deleted or not.

We could either consider keeping the remote topic in DISABLING state until
all of the remote storage data is deleted, or we need an additional
mechanism to handle the remote stray data.

The existing topic deletion, for instance, handles stray logs on disk by
detecting them on KafkaBroker startup and deleting before the
ReplicaManager is started.
Maybe we need a similar mechanism here as well if we don't want a DISABLING
state. Otherwise, we need a callback from Brokers to validate that remote
storage data is deleted and now we could move to the DISABLED state.

Thanks.

On Tue, 9 Apr 2024 at 12:45, Luke Chen <sh...@gmail.com> wrote:

> Hi Christo,
>
> > I would then opt for moving information from DisableRemoteTopic
> within the StopReplicas API which will then disappear in KRaft world as it
> is already scheduled for deprecation. What do you think?
>
> Sounds good to me.
>
> Thanks.
> Luke
>
> On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <ch...@gmail.com>
> wrote:
>
> > Heya Luke!
> >
> > I thought a bit more about it and I reached the same conclusion as you
> for
> > 2 as a follow-up from 1. In other words, in KRaft world I don't think the
> > controller needs to wait for acknowledgements for the brokers. All we
> care
> > about is that the leader (who is responsible for archiving/deleting data
> in
> > tiered storage) knows about the change and applies it properly. If there
> is
> > a leadership change halfway through the operation then the new leader
> still
> > needs to apply the message from the state topic and we know that a
> > disable-message will be applied before a reenablement-message. I will
> > change the KIP later today/tomorrow morning to reflect this reasoning.
> >
> > However, with this I believe that introducing a new API just for
> > Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a bit of an
> > overkill. I would then opt for moving information from DisableRemoteTopic
> > within the StopReplicas API which will then disappear in KRaft world as
> it
> > is already scheduled for deprecation. What do you think?
> >
> > Best,
> > Christo
> >
> > On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com> wrote:
> >
> > > Hi Christo,
> > >
> > > 1. I agree with Doguscan that in KRaft mode, the controller won't send
> > RPCs
> > > to the brokers (except in the migration path).
> > > So, I think we could adopt the similar way we did to
> > `AlterReplicaLogDirs`
> > > (
> > > KIP-858
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
> > > >)
> > > that let the broker notify controller any update, instead of controller
> > to
> > > broker. And once the controller receives all the complete requests from
> > > brokers, it'll enter "Disabled" state. WDYT?
> > >
> > > 2. Why should we wait until all brokers to respond before moving to
> > > "Disabled" state in "KRaft mode"?
> > > Currently, only the leader node does the remote log upload/fetch tasks,
> > so
> > > does that mean the controller only need to make sure the leader
> completes
> > > the stopPartition?
> > > If during the leader node stopPartition process triggered leadership
> > > change, then the new leader should receive and apply the configRecord
> > > update before the leadership change record based on the KRaft design,
> > which
> > > means there will be no gap that the follower node becomes the leader
> and
> > > starting doing unexpected upload/fetch tasks, right?
> > > I agree we should make sure in ZK mode, all brokers are completed the
> > > stopPartitions before moving to "Disabled" state because ZK node
> watcher
> > is
> > > working in a separate thread. But not sure about KRaft mode.
> > >
> > > Thanks.
> > > Luke
> > >
> > >
> > > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <ch...@gmail.com>
> > > wrote:
> > >
> > > > Heya everyone!
> > > >
> > > > re: Doguscan
> > > >
> > > > I believe the answer to 101 needs a bit more discussion. As far as I
> > > know,
> > > > tiered storage today has methods to update a metadata of a segment to
> > say
> > > > "hey, I would like this deleted", but actual deletion is left to
> plugin
> > > > implementations (or any background cleaners). In other words, there
> is
> > no
> > > > "immediate" deletion. In this KIP, we would like to continue doing
> the
> > > same
> > > > if the retention policy is set to delete. So I believe the answer is
> > > > actually that a) we will update the metadata of the segments to mark
> > them
> > > > as deleted and b) we will advance the log start offset. Any deletion
> of
> > > > actual files will still be delegated to plugin implementations. I
> > believe
> > > > this is further supported by "*remote.log.disable.policy=delete:*
> Logs
> > > that
> > > > are archived in the remote storage will not be part of the contiguous
> > > > "active" log and will be deleted asynchronously as part of the
> > > disablement
> > > > process"
> > > >
> > > > Following from the above, I believe for 102 it is fine to allow
> setting
> > > of
> > > > remote.log.disable.policy on a disabled topic in much the same way we
> > > allow
> > > > other remote-related configurations to be set on a topic (i.e.
> > > > local.retention.*) - it just won't have an effect. Granted, I do
> > believe
> > > we
> > > > should restrict the policy being changed while a disablement is
> > ongoing.
> > > >
> > > > re: Satish and Kamal
> > > >
> > > > 104, 1 and 2 are fair asks, I will work with Doguscan to update the
> KIP
> > > > with the information!
> > > >
> > > > Best,
> > > > Christo
> > > >
> > > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <
> namal.doguscan@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi Satish, I will try to answer as much as I can and the others
> could
> > > > chime
> > > > > in with further details.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > *101. For remote.log.disable.policy=delete: Does it delete the
> remote
> > > log
> > > > > data immediately and the data in remote storage will not be taken
> > into
> > > > > account by any replica? That means log-start-offset is moved to the
> > > > earlier
> > > > > local-log-start-offset.*
> > > > > *Exactly. RemoteLogData will be deleted immediately. *
> > > > > *So before the deletion starts we move LogStart offset to
> > > > > LocalLogStartOffset to ensure that no RemoteLog will be accessed
> > after
> > > > that
> > > > > point.*
> > > > >
> > > > >
> > > > > * 102. Can we update the remote.log.disable.policy after tiered
> > storage
> > > > is
> > > > > disabled on a topic?*
> > > > >
> > > > > *This is a good point. I think we should not allow modifying this
> > > > > configuration*
> > > > > *because changing the policy from Deletion to Retain when there is
> an
> > > > > ongoing Deletion will result in an undefined behaviour and where we
> > > > retain
> > > > > half of the remote log and delete the other half.*
> > > > >
> > > > > * 103. Do we plan to add any metrics related to this feature?*
> > > > >
> > > > >
> > > > >
> > > > > *Any recommendations?*
> > > > > *We may emit a gauge showing the enablement state of a topic but we
> > > could
> > > > > gather that info from the logs as well.*
> > > > > *The total duration for remote topic deletion could be added as
> well
> > > but
> > > > > this is more of a metric for the RemotePartitionRemover itself.*
> > > > >
> > > > >
> > > > >
> > > > > *104. Please add configuration details about copier thread pool,
> > > > expiration
> > > > > thread pool and the migration of the existing
> > > > > RemoteLogManagerScheduledThreadPool.*
> > > > >
> > > > > *Will add the details.*
> > > > >
> > > > > 105. How is the behaviour with topic or partition deletion request
> > > > > handled when tiered storage disablement request is still being
> > > > > processed on a topic?
> > > > >
> > > > > *If the disablement policy is Delete then a successive topic
> deletion
> > > > > request is going to be a NOOP because RemoteLogs is already being
> > > > deleted.*
> > > > > *If the disablement policy is Retain, then we only moved the
> > > > LogStartOffset
> > > > > and didn't touch RemoteLogs anyway, so the delete topic request
> will
> > > > > result*
> > > > >
> > > > > *in the initiation of RemoteLog deletion.*
> > > > >
> > > > >
> > > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
> > > > > kamal.chandraprakash@gmail.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Thanks for the KIP! Overall the KIP looks good and covered most
> of
> > > the
> > > > > > items.
> > > > > >
> > > > > > 1. Could you explain how the brokers will handle the
> > > DisableRemoteTopic
> > > > > API
> > > > > > request?
> > > > > >
> > > > > > 2. Who will initiate the controller interaction sequence? Does
> the
> > > > > > controller listens for
> > > > > > topic config updates and initiate the disablement?
> > > > > >
> > > > > > --
> > > > > > Kamal
> > > > > >
> > > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
> > > > satish.duggana@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
> > > > > > >
> > > > > > > I had an initial review of the KIP and left the below comments.
> > > > > > >
> > > > > > > 101. For remote.log.disable.policy=delete:
> > > > > > > Does it delete the remote log data immediately and the data in
> > > remote
> > > > > > > storage will not be taken into account by any replica? That
> means
> > > > > > > log-start-offset is moved to the earlier
> local-log-start-offset.
> > > > > > >
> > > > > > > 102. Can we update the remote.log.disable.policy after tiered
> > > storage
> > > > > > > is disabled on a topic?
> > > > > > >
> > > > > > > 103. Do we plan to add any metrics related to this feature?
> > > > > > >
> > > > > > > 104. Please add configuration details about copier thread pool,
> > > > > > > expiration thread pool and the migration of the existing
> > > > > > > RemoteLogManagerScheduledThreadPool.
> > > > > > >
> > > > > > > 105. How is the behaviour with topic or partition deletion
> > request
> > > > > > > handled when tiered storage disablement request is still being
> > > > > > > processed on a topic?
> > > > > > >
> > > > > > > ~Satish.
> > > > > > >
> > > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
> > > > namal.doguscan@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Christo and Luke,
> > > > > > > >
> > > > > > > > I think the KRaft section of the KIP requires slight
> > improvement.
> > > > The
> > > > > > > metadata propagation in KRaft is handled by the RAFT layer
> > instead
> > > of
> > > > > > > sending Controller -> Broker RPCs. In fact, KIP-631 deprecated
> > > these
> > > > > > RPCs.
> > > > > > > >
> > > > > > > > I will come up with some recommendations on how we could
> > improve
> > > > that
> > > > > > > one but until then, @Luke please feel free to review the KIP.
> > > > > > > >
> > > > > > > > @Satish, if we want this to make it to Kafka 3.8 I believe we
> > > need
> > > > to
> > > > > > > aim to get the KIP approved in the following weeks otherwise it
> > > will
> > > > > slip
> > > > > > > and we can not support it in Zookeeper mode.
> > > > > > > >
> > > > > > > > I also would like to better understand what is the
> community's
> > > > stand
> > > > > > for
> > > > > > > adding a new feature for Zookeeper since it is marked as
> > deprecated
> > > > > > already.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
> > > > christololov@gmail.com>
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> Heya,
> > > > > > > >>
> > > > > > > >> I do have some time to put into this, but to be honest I am
> > > still
> > > > > > after
> > > > > > > >> reviews of the KIP itself :)
> > > > > > > >>
> > > > > > > >> After the latest changes it ought to be detailing both a
> > > Zookeeper
> > > > > > > approach
> > > > > > > >> and a KRaft approach.
> > > > > > > >>
> > > > > > > >> Do you have any thoughts on how it could be improved or
> > should I
> > > > > > start a
> > > > > > > >> voting thread?
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >> Christo
> > > > > > > >>
> > > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <sh...@gmail.com>
> > > > wrote:
> > > > > > > >>
> > > > > > > >> > Hi Christo,
> > > > > > > >> >
> > > > > > > >> > Any update with this KIP?
> > > > > > > >> > If you don't have time to complete it, I can collaborate
> > with
> > > > you
> > > > > to
> > > > > > > work
> > > > > > > >> > on it.
> > > > > > > >> >
> > > > > > > >> > Thanks.
> > > > > > > >> > Luke
> > > > > > > >> >
> > > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish Duggana <
> > > > > > > satish.duggana@gmail.com>
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> > > Hi Christo,
> > > > > > > >> > > Thanks for volunteering to contribute to the KIP
> > > discussion. I
> > > > > > > suggest
> > > > > > > >> > > considering this KIP for both ZK and KRaft as it will be
> > > > helpful
> > > > > > for
> > > > > > > >> > > this feature to be available in 3.8.0 running with ZK
> > > > clusters.
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Satish.
> > > > > > > >> > >
> > > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
> > > > > > christololov@gmail.com
> > > > > > > >
> > > > > > > >> > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > Hello!
> > > > > > > >> > > >
> > > > > > > >> > > > I volunteer to get this KIP moving forward and
> > implemented
> > > > in
> > > > > > > Apache
> > > > > > > >> > > Kafka
> > > > > > > >> > > > 3.8.
> > > > > > > >> > > >
> > > > > > > >> > > > I have caught up with Mehari offline and we have
> agreed
> > > that
> > > > > > given
> > > > > > > >> > Apache
> > > > > > > >> > > > Kafka 4.0 being around the corner we would like to
> > propose
> > > > > this
> > > > > > > feature
> > > > > > > >> > > > only for KRaft clusters.
> > > > > > > >> > > >
> > > > > > > >> > > > Any and all reviews and comments are welcome!
> > > > > > > >> > > >
> > > > > > > >> > > > Best,
> > > > > > > >> > > > Christo
> > > > > > > >> > > >
> > > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan Namal <
> > > > > > > namal.doguscan@gmail.com>
> > > > > > > >> > > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi everyone, any progress on the status of this KIP?
> > > > Overall
> > > > > > > looks
> > > > > > > >> > > good to
> > > > > > > >> > > > > me but I wonder whether we still need to support it
> > for
> > > > > > > Zookeeper
> > > > > > > >> > mode
> > > > > > > >> > > > > given that it will be deprecated in the next 3
> months.
> > > > > > > >> > > > >
> > > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari" wrote:
> > > > > > > >> > > > > > Hi everyone,
> > > > > > > >> > > > > > I would like to start a discussion on KIP-950:
> > Tiered
> > > > > > Storage
> > > > > > > >> > > Disablement
> > > > > > > >> > > > > (
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
> > > > > > > >> > > > > ).
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > This KIP proposes adding the ability to disable
> and
> > > > > > re-enable
> > > > > > > >> > tiered
> > > > > > > >> > > > > storage on a topic.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > Thanks,
> > > > > > > >> > > > > > Mehari
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Luke Chen <sh...@gmail.com>.

Hi Christo,

> I would then opt for moving information from DisableRemoteTopic
within the StopReplicas API which will then disappear in KRaft world as it
is already scheduled for deprecation. What do you think?

Sounds good to me.

Thanks.
Luke

On Tue, Apr 9, 2024 at 6:46 PM Christo Lolov <ch...@gmail.com> wrote:

> Heya Luke!
>
> I thought a bit more about it and I reached the same conclusion as you for
> 2 as a follow-up from 1. In other words, in KRaft world I don't think the
> controller needs to wait for acknowledgements for the brokers. All we care
> about is that the leader (who is responsible for archiving/deleting data in
> tiered storage) knows about the change and applies it properly. If there is
> a leadership change halfway through the operation then the new leader still
> needs to apply the message from the state topic and we know that a
> disable-message will be applied before a reenablement-message. I will
> change the KIP later today/tomorrow morning to reflect this reasoning.
>
> However, with this I believe that introducing a new API just for
> Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a bit of an
> overkill. I would then opt for moving information from DisableRemoteTopic
> within the StopReplicas API which will then disappear in KRaft world as it
> is already scheduled for deprecation. What do you think?
>
> Best,
> Christo
>
> On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com> wrote:
>
> > Hi Christo,
> >
> > 1. I agree with Doguscan that in KRaft mode, the controller won't send
> RPCs
> > to the brokers (except in the migration path).
> > So, I think we could adopt the similar way we did to
> `AlterReplicaLogDirs`
> > (
> > KIP-858
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
> > >)
> > that let the broker notify controller any update, instead of controller
> to
> > broker. And once the controller receives all the complete requests from
> > brokers, it'll enter "Disabled" state. WDYT?
> >
> > 2. Why should we wait until all brokers to respond before moving to
> > "Disabled" state in "KRaft mode"?
> > Currently, only the leader node does the remote log upload/fetch tasks,
> so
> > does that mean the controller only need to make sure the leader completes
> > the stopPartition?
> > If during the leader node stopPartition process triggered leadership
> > change, then the new leader should receive and apply the configRecord
> > update before the leadership change record based on the KRaft design,
> which
> > means there will be no gap that the follower node becomes the leader and
> > starting doing unexpected upload/fetch tasks, right?
> > I agree we should make sure in ZK mode, all brokers are completed the
> > stopPartitions before moving to "Disabled" state because ZK node watcher
> is
> > working in a separate thread. But not sure about KRaft mode.
> >
> > Thanks.
> > Luke
> >
> >
> > On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <ch...@gmail.com>
> > wrote:
> >
> > > Heya everyone!
> > >
> > > re: Doguscan
> > >
> > > I believe the answer to 101 needs a bit more discussion. As far as I
> > know,
> > > tiered storage today has methods to update a metadata of a segment to
> say
> > > "hey, I would like this deleted", but actual deletion is left to plugin
> > > implementations (or any background cleaners). In other words, there is
> no
> > > "immediate" deletion. In this KIP, we would like to continue doing the
> > same
> > > if the retention policy is set to delete. So I believe the answer is
> > > actually that a) we will update the metadata of the segments to mark
> them
> > > as deleted and b) we will advance the log start offset. Any deletion of
> > > actual files will still be delegated to plugin implementations. I
> believe
> > > this is further supported by "*remote.log.disable.policy=delete:* Logs
> > that
> > > are archived in the remote storage will not be part of the contiguous
> > > "active" log and will be deleted asynchronously as part of the
> > disablement
> > > process"
> > >
> > > Following from the above, I believe for 102 it is fine to allow setting
> > of
> > > remote.log.disable.policy on a disabled topic in much the same way we
> > allow
> > > other remote-related configurations to be set on a topic (i.e.
> > > local.retention.*) - it just won't have an effect. Granted, I do
> believe
> > we
> > > should restrict the policy being changed while a disablement is
> ongoing.
> > >
> > > re: Satish and Kamal
> > >
> > > 104, 1 and 2 are fair asks, I will work with Doguscan to update the KIP
> > > with the information!
> > >
> > > Best,
> > > Christo
> > >
> > > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <namal.doguscan@gmail.com
> >
> > > wrote:
> > >
> > > > Hi Satish, I will try to answer as much as I can and the others could
> > > chime
> > > > in with further details.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *101. For remote.log.disable.policy=delete: Does it delete the remote
> > log
> > > > data immediately and the data in remote storage will not be taken
> into
> > > > account by any replica? That means log-start-offset is moved to the
> > > earlier
> > > > local-log-start-offset.*
> > > > *Exactly. RemoteLogData will be deleted immediately. *
> > > > *So before the deletion starts we move LogStart offset to
> > > > LocalLogStartOffset to ensure that no RemoteLog will be accessed
> after
> > > that
> > > > point.*
> > > >
> > > >
> > > > * 102. Can we update the remote.log.disable.policy after tiered
> storage
> > > is
> > > > disabled on a topic?*
> > > >
> > > > *This is a good point. I think we should not allow modifying this
> > > > configuration*
> > > > *because changing the policy from Deletion to Retain when there is an
> > > > ongoing Deletion will result in an undefined behaviour and where we
> > > retain
> > > > half of the remote log and delete the other half.*
> > > >
> > > > * 103. Do we plan to add any metrics related to this feature?*
> > > >
> > > >
> > > >
> > > > *Any recommendations?*
> > > > *We may emit a gauge showing the enablement state of a topic but we
> > could
> > > > gather that info from the logs as well.*
> > > > *The total duration for remote topic deletion could be added as well
> > but
> > > > this is more of a metric for the RemotePartitionRemover itself.*
> > > >
> > > >
> > > >
> > > > *104. Please add configuration details about copier thread pool,
> > > expiration
> > > > thread pool and the migration of the existing
> > > > RemoteLogManagerScheduledThreadPool.*
> > > >
> > > > *Will add the details.*
> > > >
> > > > 105. How is the behaviour with topic or partition deletion request
> > > > handled when tiered storage disablement request is still being
> > > > processed on a topic?
> > > >
> > > > *If the disablement policy is Delete then a successive topic deletion
> > > > request is going to be a NOOP because RemoteLogs is already being
> > > deleted.*
> > > > *If the disablement policy is Retain, then we only moved the
> > > LogStartOffset
> > > > and didn't touch RemoteLogs anyway, so the delete topic request will
> > > > result*
> > > >
> > > > *in the initiation of RemoteLog deletion.*
> > > >
> > > >
> > > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
> > > > kamal.chandraprakash@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Thanks for the KIP! Overall the KIP looks good and covered most of
> > the
> > > > > items.
> > > > >
> > > > > 1. Could you explain how the brokers will handle the
> > DisableRemoteTopic
> > > > API
> > > > > request?
> > > > >
> > > > > 2. Who will initiate the controller interaction sequence? Does the
> > > > > controller listens for
> > > > > topic config updates and initiate the disablement?
> > > > >
> > > > > --
> > > > > Kamal
> > > > >
> > > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
> > > satish.duggana@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks Mehari, Divij, Christo etal for the KIP.
> > > > > >
> > > > > > I had an initial review of the KIP and left the below comments.
> > > > > >
> > > > > > 101. For remote.log.disable.policy=delete:
> > > > > > Does it delete the remote log data immediately and the data in
> > remote
> > > > > > storage will not be taken into account by any replica? That means
> > > > > > log-start-offset is moved to the earlier local-log-start-offset.
> > > > > >
> > > > > > 102. Can we update the remote.log.disable.policy after tiered
> > storage
> > > > > > is disabled on a topic?
> > > > > >
> > > > > > 103. Do we plan to add any metrics related to this feature?
> > > > > >
> > > > > > 104. Please add configuration details about copier thread pool,
> > > > > > expiration thread pool and the migration of the existing
> > > > > > RemoteLogManagerScheduledThreadPool.
> > > > > >
> > > > > > 105. How is the behaviour with topic or partition deletion
> request
> > > > > > handled when tiered storage disablement request is still being
> > > > > > processed on a topic?
> > > > > >
> > > > > > ~Satish.
> > > > > >
> > > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
> > > namal.doguscan@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi Christo and Luke,
> > > > > > >
> > > > > > > I think the KRaft section of the KIP requires slight
> improvement.
> > > The
> > > > > > metadata propagation in KRaft is handled by the RAFT layer
> instead
> > of
> > > > > > sending Controller -> Broker RPCs. In fact, KIP-631 deprecated
> > these
> > > > > RPCs.
> > > > > > >
> > > > > > > I will come up with some recommendations on how we could
> improve
> > > that
> > > > > > one but until then, @Luke please feel free to review the KIP.
> > > > > > >
> > > > > > > @Satish, if we want this to make it to Kafka 3.8 I believe we
> > need
> > > to
> > > > > > aim to get the KIP approved in the following weeks otherwise it
> > will
> > > > slip
> > > > > > and we can not support it in Zookeeper mode.
> > > > > > >
> > > > > > > I also would like to better understand what is the community's
> > > stand
> > > > > for
> > > > > > adding a new feature for Zookeeper since it is marked as
> deprecated
> > > > > already.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
> > > christololov@gmail.com>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> Heya,
> > > > > > >>
> > > > > > >> I do have some time to put into this, but to be honest I am
> > still
> > > > > after
> > > > > > >> reviews of the KIP itself :)
> > > > > > >>
> > > > > > >> After the latest changes it ought to be detailing both a
> > Zookeeper
> > > > > > approach
> > > > > > >> and a KRaft approach.
> > > > > > >>
> > > > > > >> Do you have any thoughts on how it could be improved or
> should I
> > > > > start a
> > > > > > >> voting thread?
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Christo
> > > > > > >>
> > > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <sh...@gmail.com>
> > > wrote:
> > > > > > >>
> > > > > > >> > Hi Christo,
> > > > > > >> >
> > > > > > >> > Any update with this KIP?
> > > > > > >> > If you don't have time to complete it, I can collaborate
> with
> > > you
> > > > to
> > > > > > work
> > > > > > >> > on it.
> > > > > > >> >
> > > > > > >> > Thanks.
> > > > > > >> > Luke
> > > > > > >> >
> > > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish Duggana <
> > > > > > satish.duggana@gmail.com>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > Hi Christo,
> > > > > > >> > > Thanks for volunteering to contribute to the KIP
> > discussion. I
> > > > > > suggest
> > > > > > >> > > considering this KIP for both ZK and KRaft as it will be
> > > helpful
> > > > > for
> > > > > > >> > > this feature to be available in 3.8.0 running with ZK
> > > clusters.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Satish.
> > > > > > >> > >
> > > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
> > > > > christololov@gmail.com
> > > > > > >
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > Hello!
> > > > > > >> > > >
> > > > > > >> > > > I volunteer to get this KIP moving forward and
> implemented
> > > in
> > > > > > Apache
> > > > > > >> > > Kafka
> > > > > > >> > > > 3.8.
> > > > > > >> > > >
> > > > > > >> > > > I have caught up with Mehari offline and we have agreed
> > that
> > > > > given
> > > > > > >> > Apache
> > > > > > >> > > > Kafka 4.0 being around the corner we would like to
> propose
> > > > this
> > > > > > feature
> > > > > > >> > > > only for KRaft clusters.
> > > > > > >> > > >
> > > > > > >> > > > Any and all reviews and comments are welcome!
> > > > > > >> > > >
> > > > > > >> > > > Best,
> > > > > > >> > > > Christo
> > > > > > >> > > >
> > > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan Namal <
> > > > > > namal.doguscan@gmail.com>
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi everyone, any progress on the status of this KIP?
> > > Overall
> > > > > > looks
> > > > > > >> > > good to
> > > > > > >> > > > > me but I wonder whether we still need to support it
> for
> > > > > > Zookeeper
> > > > > > >> > mode
> > > > > > >> > > > > given that it will be deprecated in the next 3 months.
> > > > > > >> > > > >
> > > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari" wrote:
> > > > > > >> > > > > > Hi everyone,
> > > > > > >> > > > > > I would like to start a discussion on KIP-950:
> Tiered
> > > > > Storage
> > > > > > >> > > Disablement
> > > > > > >> > > > > (
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
> > > > > > >> > > > > ).
> > > > > > >> > > > > >
> > > > > > >> > > > > > This KIP proposes adding the ability to disable and
> > > > > re-enable
> > > > > > >> > tiered
> > > > > > >> > > > > storage on a topic.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Thanks,
> > > > > > >> > > > > > Mehari
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-950: Tiered Storage Disablement

Posted by Christo Lolov <ch...@gmail.com>.

Heya Luke!

I thought a bit more about it and I reached the same conclusion as you for
2 as a follow-up from 1. In other words, in KRaft world I don't think the
controller needs to wait for acknowledgements for the brokers. All we care
about is that the leader (who is responsible for archiving/deleting data in
tiered storage) knows about the change and applies it properly. If there is
a leadership change halfway through the operation then the new leader still
needs to apply the message from the state topic and we know that a
disable-message will be applied before a reenablement-message. I will
change the KIP later today/tomorrow morning to reflect this reasoning.

However, with this I believe that introducing a new API just for
Zookeeper-based clusters (i.e. DisableRemoteTopic) becomes a bit of an
overkill. I would then opt for moving information from DisableRemoteTopic
within the StopReplicas API which will then disappear in KRaft world as it
is already scheduled for deprecation. What do you think?

Best,
Christo

On Wed, 3 Apr 2024 at 07:59, Luke Chen <sh...@gmail.com> wrote:

> Hi Christo,
>
> 1. I agree with Doguscan that in KRaft mode, the controller won't send RPCs
> to the brokers (except in the migration path).
> So, I think we could adopt the similar way we did to `AlterReplicaLogDirs`
> (
> KIP-858
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft#KIP858:HandleJBODbrokerdiskfailureinKRaft-Intra-brokerreplicamovement
> >)
> that let the broker notify controller any update, instead of controller to
> broker. And once the controller receives all the complete requests from
> brokers, it'll enter "Disabled" state. WDYT?
>
> 2. Why should we wait until all brokers to respond before moving to
> "Disabled" state in "KRaft mode"?
> Currently, only the leader node does the remote log upload/fetch tasks, so
> does that mean the controller only need to make sure the leader completes
> the stopPartition?
> If during the leader node stopPartition process triggered leadership
> change, then the new leader should receive and apply the configRecord
> update before the leadership change record based on the KRaft design, which
> means there will be no gap that the follower node becomes the leader and
> starting doing unexpected upload/fetch tasks, right?
> I agree we should make sure in ZK mode, all brokers are completed the
> stopPartitions before moving to "Disabled" state because ZK node watcher is
> working in a separate thread. But not sure about KRaft mode.
>
> Thanks.
> Luke
>
>
> On Fri, Mar 29, 2024 at 4:14 PM Christo Lolov <ch...@gmail.com>
> wrote:
>
> > Heya everyone!
> >
> > re: Doguscan
> >
> > I believe the answer to 101 needs a bit more discussion. As far as I
> know,
> > tiered storage today has methods to update a metadata of a segment to say
> > "hey, I would like this deleted", but actual deletion is left to plugin
> > implementations (or any background cleaners). In other words, there is no
> > "immediate" deletion. In this KIP, we would like to continue doing the
> same
> > if the retention policy is set to delete. So I believe the answer is
> > actually that a) we will update the metadata of the segments to mark them
> > as deleted and b) we will advance the log start offset. Any deletion of
> > actual files will still be delegated to plugin implementations. I believe
> > this is further supported by "*remote.log.disable.policy=delete:* Logs
> that
> > are archived in the remote storage will not be part of the contiguous
> > "active" log and will be deleted asynchronously as part of the
> disablement
> > process"
> >
> > Following from the above, I believe for 102 it is fine to allow setting
> of
> > remote.log.disable.policy on a disabled topic in much the same way we
> allow
> > other remote-related configurations to be set on a topic (i.e.
> > local.retention.*) - it just won't have an effect. Granted, I do believe
> we
> > should restrict the policy being changed while a disablement is ongoing.
> >
> > re: Satish and Kamal
> >
> > 104, 1 and 2 are fair asks, I will work with Doguscan to update the KIP
> > with the information!
> >
> > Best,
> > Christo
> >
> > On Thu, 28 Mar 2024 at 10:31, Doğuşcan Namal <na...@gmail.com>
> > wrote:
> >
> > > Hi Satish, I will try to answer as much as I can and the others could
> > chime
> > > in with further details.
> > >
> > >
> > >
> > >
> > >
> > > *101. For remote.log.disable.policy=delete: Does it delete the remote
> log
> > > data immediately and the data in remote storage will not be taken into
> > > account by any replica? That means log-start-offset is moved to the
> > earlier
> > > local-log-start-offset.*
> > > *Exactly. RemoteLogData will be deleted immediately. *
> > > *So before the deletion starts we move LogStart offset to
> > > LocalLogStartOffset to ensure that no RemoteLog will be accessed after
> > that
> > > point.*
> > >
> > >
> > > * 102. Can we update the remote.log.disable.policy after tiered storage
> > is
> > > disabled on a topic?*
> > >
> > > *This is a good point. I think we should not allow modifying this
> > > configuration*
> > > *because changing the policy from Deletion to Retain when there is an
> > > ongoing Deletion will result in an undefined behaviour and where we
> > retain
> > > half of the remote log and delete the other half.*
> > >
> > > * 103. Do we plan to add any metrics related to this feature?*
> > >
> > >
> > >
> > > *Any recommendations?*
> > > *We may emit a gauge showing the enablement state of a topic but we
> could
> > > gather that info from the logs as well.*
> > > *The total duration for remote topic deletion could be added as well
> but
> > > this is more of a metric for the RemotePartitionRemover itself.*
> > >
> > >
> > >
> > > *104. Please add configuration details about copier thread pool,
> > expiration
> > > thread pool and the migration of the existing
> > > RemoteLogManagerScheduledThreadPool.*
> > >
> > > *Will add the details.*
> > >
> > > 105. How is the behaviour with topic or partition deletion request
> > > handled when tiered storage disablement request is still being
> > > processed on a topic?
> > >
> > > *If the disablement policy is Delete then a successive topic deletion
> > > request is going to be a NOOP because RemoteLogs is already being
> > deleted.*
> > > *If the disablement policy is Retain, then we only moved the
> > LogStartOffset
> > > and didn't touch RemoteLogs anyway, so the delete topic request will
> > > result*
> > >
> > > *in the initiation of RemoteLog deletion.*
> > >
> > >
> > > On Tue, 26 Mar 2024 at 18:21, Kamal Chandraprakash <
> > > kamal.chandraprakash@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Thanks for the KIP! Overall the KIP looks good and covered most of
> the
> > > > items.
> > > >
> > > > 1. Could you explain how the brokers will handle the
> DisableRemoteTopic
> > > API
> > > > request?
> > > >
> > > > 2. Who will initiate the controller interaction sequence? Does the
> > > > controller listens for
> > > > topic config updates and initiate the disablement?
> > > >
> > > > --
> > > > Kamal
> > > >
> > > > On Tue, Mar 26, 2024 at 4:40 PM Satish Duggana <
> > satish.duggana@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Thanks Mehari, Divij, Christo etal for the KIP.
> > > > >
> > > > > I had an initial review of the KIP and left the below comments.
> > > > >
> > > > > 101. For remote.log.disable.policy=delete:
> > > > > Does it delete the remote log data immediately and the data in
> remote
> > > > > storage will not be taken into account by any replica? That means
> > > > > log-start-offset is moved to the earlier local-log-start-offset.
> > > > >
> > > > > 102. Can we update the remote.log.disable.policy after tiered
> storage
> > > > > is disabled on a topic?
> > > > >
> > > > > 103. Do we plan to add any metrics related to this feature?
> > > > >
> > > > > 104. Please add configuration details about copier thread pool,
> > > > > expiration thread pool and the migration of the existing
> > > > > RemoteLogManagerScheduledThreadPool.
> > > > >
> > > > > 105. How is the behaviour with topic or partition deletion request
> > > > > handled when tiered storage disablement request is still being
> > > > > processed on a topic?
> > > > >
> > > > > ~Satish.
> > > > >
> > > > > On Mon, 25 Mar 2024 at 13:34, Doğuşcan Namal <
> > namal.doguscan@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > > Hi Christo and Luke,
> > > > > >
> > > > > > I think the KRaft section of the KIP requires slight improvement.
> > The
> > > > > metadata propagation in KRaft is handled by the RAFT layer instead
> of
> > > > > sending Controller -> Broker RPCs. In fact, KIP-631 deprecated
> these
> > > > RPCs.
> > > > > >
> > > > > > I will come up with some recommendations on how we could improve
> > that
> > > > > one but until then, @Luke please feel free to review the KIP.
> > > > > >
> > > > > > @Satish, if we want this to make it to Kafka 3.8 I believe we
> need
> > to
> > > > > aim to get the KIP approved in the following weeks otherwise it
> will
> > > slip
> > > > > and we can not support it in Zookeeper mode.
> > > > > >
> > > > > > I also would like to better understand what is the community's
> > stand
> > > > for
> > > > > adding a new feature for Zookeeper since it is marked as deprecated
> > > > already.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, 18 Mar 2024 at 13:42, Christo Lolov <
> > christololov@gmail.com>
> > > > > wrote:
> > > > > >>
> > > > > >> Heya,
> > > > > >>
> > > > > >> I do have some time to put into this, but to be honest I am
> still
> > > > after
> > > > > >> reviews of the KIP itself :)
> > > > > >>
> > > > > >> After the latest changes it ought to be detailing both a
> Zookeeper
> > > > > approach
> > > > > >> and a KRaft approach.
> > > > > >>
> > > > > >> Do you have any thoughts on how it could be improved or should I
> > > > start a
> > > > > >> voting thread?
> > > > > >>
> > > > > >> Best,
> > > > > >> Christo
> > > > > >>
> > > > > >> On Thu, 14 Mar 2024 at 06:12, Luke Chen <sh...@gmail.com>
> > wrote:
> > > > > >>
> > > > > >> > Hi Christo,
> > > > > >> >
> > > > > >> > Any update with this KIP?
> > > > > >> > If you don't have time to complete it, I can collaborate with
> > you
> > > to
> > > > > work
> > > > > >> > on it.
> > > > > >> >
> > > > > >> > Thanks.
> > > > > >> > Luke
> > > > > >> >
> > > > > >> > On Wed, Jan 17, 2024 at 11:38 PM Satish Duggana <
> > > > > satish.duggana@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Hi Christo,
> > > > > >> > > Thanks for volunteering to contribute to the KIP
> discussion. I
> > > > > suggest
> > > > > >> > > considering this KIP for both ZK and KRaft as it will be
> > helpful
> > > > for
> > > > > >> > > this feature to be available in 3.8.0 running with ZK
> > clusters.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Satish.
> > > > > >> > >
> > > > > >> > > On Wed, 17 Jan 2024 at 19:04, Christo Lolov <
> > > > christololov@gmail.com
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > Hello!
> > > > > >> > > >
> > > > > >> > > > I volunteer to get this KIP moving forward and implemented
> > in
> > > > > Apache
> > > > > >> > > Kafka
> > > > > >> > > > 3.8.
> > > > > >> > > >
> > > > > >> > > > I have caught up with Mehari offline and we have agreed
> that
> > > > given
> > > > > >> > Apache
> > > > > >> > > > Kafka 4.0 being around the corner we would like to propose
> > > this
> > > > > feature
> > > > > >> > > > only for KRaft clusters.
> > > > > >> > > >
> > > > > >> > > > Any and all reviews and comments are welcome!
> > > > > >> > > >
> > > > > >> > > > Best,
> > > > > >> > > > Christo
> > > > > >> > > >
> > > > > >> > > > On Tue, 9 Jan 2024 at 09:44, Doğuşcan Namal <
> > > > > namal.doguscan@gmail.com>
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi everyone, any progress on the status of this KIP?
> > Overall
> > > > > looks
> > > > > >> > > good to
> > > > > >> > > > > me but I wonder whether we still need to support it for
> > > > > Zookeeper
> > > > > >> > mode
> > > > > >> > > > > given that it will be deprecated in the next 3 months.
> > > > > >> > > > >
> > > > > >> > > > > On 2023/07/21 20:16:46 "Beyene, Mehari" wrote:
> > > > > >> > > > > > Hi everyone,
> > > > > >> > > > > > I would like to start a discussion on KIP-950: Tiered
> > > > Storage
> > > > > >> > > Disablement
> > > > > >> > > > > (
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > >
> > > > > >> >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-950%3A++Tiered+Storage+Disablement
> > > > > >> > > > > ).
> > > > > >> > > > > >
> > > > > >> > > > > > This KIP proposes adding the ability to disable and
> > > > re-enable
> > > > > >> > tiered
> > > > > >> > > > > storage on a topic.
> > > > > >> > > > > >
> > > > > >> > > > > > Thanks,
> > > > > >> > > > > > Mehari
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > >
> > > > > >> >
> > > > >
> > > >
> > >
> >
>