You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by David Arthur <da...@confluent.io> on 2020/04/07 19:43:29 UTC

[DISCUSS] KIP-589 Add API to Update Replica State in Controller

Hey everyone,

I'd like to start the discussion for KIP-589, part of the KIP-500 effort

https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller

This is a proposal to use a new RPC instead of ZooKeeper for notifying the
controller of an offline replica. Please give a read and let me know your
thoughts.

Thanks!
David

Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

Posted by Jose Garcia Sancio <js...@confluent.io>.

Hi David,

Thanks for the KIP.

> ReplicaStateEventResponse => ErrorCode [Topic [PartitionId]]
>    ErrorCode => Int32
>    Topic => String
>    PartitionId => Int32
> ...
> Partition-level errors:

Based on my understanding of the response, it doesn't look like the
controller has a way of encoding these partition-level errors.

> INVALID_REQUEST: The update was rejected due to some internal inconsistency (e.g. invalid replicas specified in the ISR)

What do you mean by "invalid replicas specified in the ISR"? There is
no reference to ISR in the request. Related to this, can you mention
how the broker/replica will handle the errors included in the
ReplicaStatusEventResponse?

> In this proposal, we now include the broker ID and epoch in the request, so the controller can safely update its internal replica state based on the request data.
> ...
> If the controller reads the ReplicaStateEvent and encounters a fatal error before handling the subsequent ControllerEvent, a new controller will eventually be elected and the state of all replicas will become known to the new controller.

I think we should mention that the controller will keep it's current
implementation of marking the replicas as offline because of failure
in the LeaderAndIsr response.

-- 
-Jose

Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

Posted by David Arthur <mu...@gmail.com>.

Thanks, Jason. Good feedback

1. I was mostly referring to the fact that the ReplicaManager uses a
background thread to send the ZK notification and it really has no
visibility as to whether the ZK operation succeeded or not. We'll most
likely want to continue using a background thread for batching purposes
with the new RPC. Retries make sense as well.

2. Yes, I'll change that

3. Thanks, I neglected to mention this. Indeed I was considering
ControlledShutdown when originally thinking about this KIP. A Future Work
section is a good idea, I'll add one.

On Tue, May 19, 2020 at 2:58 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hi David,
>
> This looks good. I just have a few comments:
>
> 1. I'm not sure it's totally fair to describe the current notification
> mechanism as "best-effort." At least it guarantees that the controller will
> eventually see the event. In any case, I think we might want a stronger
> contract going forward. As long as the broker remains the leader for
> partitions in offline log directories, it seems like we should retry the
> AlterReplicaState requests.
> 2. Should we consider a new name for `UNKNOWN_REPLICA_EVENT_TYPE`? Maybe
> `UNKOWN_REPLICA_STATE`?
> 3. Mostly an observation, but there is some overlap with this API and
> ControlledShutdown. From the controller's perspective, the intent is mostly
> the same. I guess we could treat a null array in the request as an intent
> to shutdown all replicas if we wanted to try and converge these APIs. One
> of the differences is that ControlledShutdown is a synchronous API, but I
> think it would have actually been better as an asynchronous API since
> historically we have run into problems with timeouts. Anyway, this is
> outside the scope of this KIP, but might be worth mentioning as "Future
> work" somewhere.
>
> Thanks,
> Jason
>
>
> On Mon, May 18, 2020 at 10:09 AM David Arthur <mu...@gmail.com> wrote:
>
> > I've updated the KIP with the feedback from this discussion
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > .
> > I'll send out the vote thread shortly.
> >
> > Thanks again,
> > David
> >
> > On Tue, May 5, 2020 at 10:34 AM Tom Bentley <tb...@redhat.com> wrote:
> >
> > > Hi Colin,
> > >
> > > Yeah, that makes sense, thanks. I was thinking, longer term, that there
> > are
> > > other benefits to having the log dir information available to the
> > > controller. For example it would allow the possibility for CREATE_TOPIC
> > > requests to include the intended log dir for each replica. But that's
> > > obviously completely out of scope for this KIP.
> > >
> > > Kind regards,
> > >
> > > Tom
> > >
> > > On Mon, May 4, 2020 at 10:11 PM Colin McCabe <cm...@apache.org>
> wrote:
> > >
> > > > Hi Tom,
> > > >
> > > > As you said, the controller doesn't know about log directories,
> > although
> > > > individual brokers do.  So the brokers do currently have to enumerate
> > all
> > > > the partitions that need to be removed to the controllers explicitly.
> > So
> > > > this KIP isn't changing anything in that regard.
> > > >
> > > > The current flow is:
> > > > 1. ping ZK back-channel
> > > > 2. controller sends a full LeaderAndIsrRequest to the broker
> > > > 3. the broker sends a full response containing error codes for all
> > > > partitions.  Partitions on the failed storage have a nonzero error
> > code;
> > > > the others have 0.
> > > >
> > > > The new flow is:
> > > > 1. the broker sends an RPC with all the failed partitions
> > > >
> > > > So the new flow actually substantially reduces the amount of network
> > > > traffic, since previously we sent a full LeaderAndIsrRequest
> containing
> > > all
> > > > of the partitions.  Now we just send all the partitions in the failed
> > > > storage directory.  That could still be a lot, but certainly only be
> a
> > > > fraction of what a full LeaderAndIsrRequest would have.
> > > >
> > > > Sorry if I'm repeating stuff you already figured out, but I just
> wanted
> > > to
> > > > be more clear about this (I found it confusing too until David
> > explained
> > > it
> > > > to me originally...)
> > > >
> > > > best,
> > > > Colin
> > > >
> > > >
> > > > On Sat, May 2, 2020, at 10:30, Tom Bentley wrote:
> > > > > Hi David,
> > > > >
> > > > > > In the rejecting the alternative of having an RPC for log dir
> > > failures
> > > > > > you say
> > > > > >
> > > > > > I guess what I really mean here is that I wanted to avoid
> exposing
> > > the
> > > > > > notion of a log dir to the controller. I can update the
> description
> > > to
> > > > > > reflect this.
> > > > > >
> > > > >
> > > > > Ah, I think I see now. While each broker knows about its log dirs
> > this
> > > > > isn't something that's stored in zookeeper or known to the
> > controller.
> > > > >
> > > > >
> > > > > > > It's also not completely clear that the cost of having to
> > enumerate
> > > > all
> > > > > > the partitions on a log dir was weighed against the perceived
> > benefit
> > > > of a
> > > > > > more flexible RPC.
> > > > > >
> > > > > > The enumeration isn't strictly required. In the "RPC semantics"
> > > > section, I
> > > > > > mention that if no topics are present in the RPC request, then
> all
> > > > topics
> > > > > > on the broker are implied. And if a topic is given with no
> > > partitions,
> > > > all
> > > > > > partitions for that topic (on the broker) are implied. Does this
> > make
> > > > > > sense?
> > > > > >
> > > > >
> > > > > So the no-topics-present optimisation wouldn't be available to a
> > broker
> > > > > with >1 log dirs where only some of the log dirs failed. I don't
> > > suppose
> > > > > that's a problem though.
> > > > >
> > > > > Thanks again,
> > > > >
> > > > > Tom
> > > > >
> > > > >
> > > > > On Fri, May 1, 2020 at 5:48 PM David Arthur <mu...@gmail.com>
> > wrote:
> > > > >
> > > > > > Jose/Colin/Tom, thanks for the feedback!
> > > > > >
> > > > > > > Partition level errors
> > > > > >
> > > > > > This was an oversight on my part, I meant to include these in the
> > > > response
> > > > > > RPC. I'll update that.
> > > > > >
> > > > > > > INVALID_REQUEST
> > > > > >
> > > > > > I'll update this text description, that was a copy/paste left
> over
> > > > > >
> > > > > > > I think we should mention that the controller will keep it's
> > > current
> > > > > > implementation of marking the replicas as offline because of
> > failure
> > > > in the
> > > > > > LeaderAndIsr response.
> > > > > >
> > > > > > Good suggestions, I'll add that.
> > > > > >
> > > > > > > Does EventType need to be an Int32?
> > > > > >
> > > > > > No, it doesn't. I'll update to Int8. Do we have an example of the
> > > enum
> > > > > > paradigm in our RPC today? I'm curious if we actually map it to a
> > > real
> > > > Java
> > > > > > enum in the AbstractRequest/Response classes.
> > > > > >
> > > > > > > AlterReplicaStates
> > > > > >
> > > > > > Sounds good to me.
> > > > > >
> > > > > > > In the rejecting the alternative of having an RPC for log dir
> > > > failures
> > > > > > you say
> > > > > >
> > > > > > I guess what I really mean here is that I wanted to avoid
> exposing
> > > the
> > > > > > notion of a log dir to the controller. I can update the
> description
> > > to
> > > > > > reflect this.
> > > > > >
> > > > > > > It's also not completely clear that the cost of having to
> > enumerate
> > > > all
> > > > > > the partitions on a log dir was weighed against the perceived
> > benefit
> > > > of a
> > > > > > more flexible RPC.
> > > > > >
> > > > > > The enumeration isn't strictly required. In the "RPC semantics"
> > > > section, I
> > > > > > mention that if no topics are present in the RPC request, then
> all
> > > > topics
> > > > > > on the broker are implied. And if a topic is given with no
> > > partitions,
> > > > all
> > > > > > partitions for that topic (on the broker) are implied. Does this
> > make
> > > > > > sense?
> > > > > >
> > > > > > Thanks again! I'll update the KIP and leave a message here once
> > it's
> > > > > > revised.
> > > > > >
> > > > > > David
> > > > > >
> > > > > > On Wed, Apr 29, 2020 at 11:20 AM Tom Bentley <
> tbentley@redhat.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi David,
> > > > > > >
> > > > > > > Thanks for the KIP!
> > > > > > >
> > > > > > > In the rejecting the alternative of having an RPC for log dir
> > > > failures
> > > > > > you
> > > > > > > say:
> > > > > > >
> > > > > > > It was also rejected to prevent "leaking" the notion of a log
> dir
> > > to
> > > > the
> > > > > > > > public API.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not quite sure I follow that argument, since we already
> have
> > > > RPCs for
> > > > > > > changing replica log dirs. So in a general sense log dirs
> already
> > > > exist
> > > > > > in
> > > > > > > the API. I suspect you were using public API to mean something
> > more
> > > > > > > specific; could you elaborate?
> > > > > > >
> > > > > > > It's also not completely clear that the cost of having to
> > enumerate
> > > > all
> > > > > > the
> > > > > > > partitions on a log dir was weighed against the perceived
> benefit
> > > of
> > > > a
> > > > > > more
> > > > > > > flexible RPC. (I'm sure it was, but it would be good to say
> so).
> > > > > > >
> > > > > > > Many thanks,
> > > > > > >
> > > > > > > Tom
> > > > > > >
> > > > > > > On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe <
> > cmccabe@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi David,
> > > > > > > >
> > > > > > > > Thanks for the KIP!
> > > > > > > >
> > > > > > > > I think the ReplicaStateEventResponse should have a separate
> > > error
> > > > code
> > > > > > > > for each partition.
> > > > > > > >  Currently it just has one error code for the whole
> > > > request/response,
> > > > > > if
> > > > > > > > I'm reading this right.  I think Jose made a similar point as
> > > > well.  We
> > > > > > > > should plan for scenarios where some replica states can be
> > > changed
> > > > and
> > > > > > > some
> > > > > > > > can't.
> > > > > > > >
> > > > > > > > Does EventType need to be an Int32?  For enums, we usually
> use
> > > the
> > > > > > > > smallest reasonable type, which would be Int8 here.  We can
> > > always
> > > > > > change
> > > > > > > > the schema later if needed.  UNKNOWN_REPLICA_EVENT_TYPE seems
> > > > > > unnecessary
> > > > > > > > since INVALID_REQUEST covers this case.
> > > > > > > >
> > > > > > > > I'd also suggest "AlterReplicaStates[Request,Response]" as a
> > > > slightly
> > > > > > > > better name for this RPC.
> > > > > > > >
> > > > > > > > cheers,
> > > > > > > > Colin
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Apr 7, 2020, at 12:43, David Arthur wrote:
> > > > > > > > > Hey everyone,
> > > > > > > > >
> > > > > > > > > I'd like to start the discussion for KIP-589, part of the
> > > KIP-500
> > > > > > > effort
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > > > > > > > >
> > > > > > > > > This is a proposal to use a new RPC instead of ZooKeeper
> for
> > > > > > notifying
> > > > > > > > the
> > > > > > > > > controller of an offline replica. Please give a read and
> let
> > me
> > > > know
> > > > > > > your
> > > > > > > > > thoughts.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > David
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > David Arthur
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
> > --
> > David Arthur
> >
>


-- 
David Arthur

Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

Posted by Jason Gustafson <ja...@confluent.io>.

Hi David,

This looks good. I just have a few comments:

1. I'm not sure it's totally fair to describe the current notification
mechanism as "best-effort." At least it guarantees that the controller will
eventually see the event. In any case, I think we might want a stronger
contract going forward. As long as the broker remains the leader for
partitions in offline log directories, it seems like we should retry the
AlterReplicaState requests.
2. Should we consider a new name for `UNKNOWN_REPLICA_EVENT_TYPE`? Maybe
`UNKOWN_REPLICA_STATE`?
3. Mostly an observation, but there is some overlap with this API and
ControlledShutdown. From the controller's perspective, the intent is mostly
the same. I guess we could treat a null array in the request as an intent
to shutdown all replicas if we wanted to try and converge these APIs. One
of the differences is that ControlledShutdown is a synchronous API, but I
think it would have actually been better as an asynchronous API since
historically we have run into problems with timeouts. Anyway, this is
outside the scope of this KIP, but might be worth mentioning as "Future
work" somewhere.

Thanks,
Jason


On Mon, May 18, 2020 at 10:09 AM David Arthur <mu...@gmail.com> wrote:

> I've updated the KIP with the feedback from this discussion
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> .
> I'll send out the vote thread shortly.
>
> Thanks again,
> David
>
> On Tue, May 5, 2020 at 10:34 AM Tom Bentley <tb...@redhat.com> wrote:
>
> > Hi Colin,
> >
> > Yeah, that makes sense, thanks. I was thinking, longer term, that there
> are
> > other benefits to having the log dir information available to the
> > controller. For example it would allow the possibility for CREATE_TOPIC
> > requests to include the intended log dir for each replica. But that's
> > obviously completely out of scope for this KIP.
> >
> > Kind regards,
> >
> > Tom
> >
> > On Mon, May 4, 2020 at 10:11 PM Colin McCabe <cm...@apache.org> wrote:
> >
> > > Hi Tom,
> > >
> > > As you said, the controller doesn't know about log directories,
> although
> > > individual brokers do.  So the brokers do currently have to enumerate
> all
> > > the partitions that need to be removed to the controllers explicitly.
> So
> > > this KIP isn't changing anything in that regard.
> > >
> > > The current flow is:
> > > 1. ping ZK back-channel
> > > 2. controller sends a full LeaderAndIsrRequest to the broker
> > > 3. the broker sends a full response containing error codes for all
> > > partitions.  Partitions on the failed storage have a nonzero error
> code;
> > > the others have 0.
> > >
> > > The new flow is:
> > > 1. the broker sends an RPC with all the failed partitions
> > >
> > > So the new flow actually substantially reduces the amount of network
> > > traffic, since previously we sent a full LeaderAndIsrRequest containing
> > all
> > > of the partitions.  Now we just send all the partitions in the failed
> > > storage directory.  That could still be a lot, but certainly only be a
> > > fraction of what a full LeaderAndIsrRequest would have.
> > >
> > > Sorry if I'm repeating stuff you already figured out, but I just wanted
> > to
> > > be more clear about this (I found it confusing too until David
> explained
> > it
> > > to me originally...)
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Sat, May 2, 2020, at 10:30, Tom Bentley wrote:
> > > > Hi David,
> > > >
> > > > > In the rejecting the alternative of having an RPC for log dir
> > failures
> > > > > you say
> > > > >
> > > > > I guess what I really mean here is that I wanted to avoid exposing
> > the
> > > > > notion of a log dir to the controller. I can update the description
> > to
> > > > > reflect this.
> > > > >
> > > >
> > > > Ah, I think I see now. While each broker knows about its log dirs
> this
> > > > isn't something that's stored in zookeeper or known to the
> controller.
> > > >
> > > >
> > > > > > It's also not completely clear that the cost of having to
> enumerate
> > > all
> > > > > the partitions on a log dir was weighed against the perceived
> benefit
> > > of a
> > > > > more flexible RPC.
> > > > >
> > > > > The enumeration isn't strictly required. In the "RPC semantics"
> > > section, I
> > > > > mention that if no topics are present in the RPC request, then all
> > > topics
> > > > > on the broker are implied. And if a topic is given with no
> > partitions,
> > > all
> > > > > partitions for that topic (on the broker) are implied. Does this
> make
> > > > > sense?
> > > > >
> > > >
> > > > So the no-topics-present optimisation wouldn't be available to a
> broker
> > > > with >1 log dirs where only some of the log dirs failed. I don't
> > suppose
> > > > that's a problem though.
> > > >
> > > > Thanks again,
> > > >
> > > > Tom
> > > >
> > > >
> > > > On Fri, May 1, 2020 at 5:48 PM David Arthur <mu...@gmail.com>
> wrote:
> > > >
> > > > > Jose/Colin/Tom, thanks for the feedback!
> > > > >
> > > > > > Partition level errors
> > > > >
> > > > > This was an oversight on my part, I meant to include these in the
> > > response
> > > > > RPC. I'll update that.
> > > > >
> > > > > > INVALID_REQUEST
> > > > >
> > > > > I'll update this text description, that was a copy/paste left over
> > > > >
> > > > > > I think we should mention that the controller will keep it's
> > current
> > > > > implementation of marking the replicas as offline because of
> failure
> > > in the
> > > > > LeaderAndIsr response.
> > > > >
> > > > > Good suggestions, I'll add that.
> > > > >
> > > > > > Does EventType need to be an Int32?
> > > > >
> > > > > No, it doesn't. I'll update to Int8. Do we have an example of the
> > enum
> > > > > paradigm in our RPC today? I'm curious if we actually map it to a
> > real
> > > Java
> > > > > enum in the AbstractRequest/Response classes.
> > > > >
> > > > > > AlterReplicaStates
> > > > >
> > > > > Sounds good to me.
> > > > >
> > > > > > In the rejecting the alternative of having an RPC for log dir
> > > failures
> > > > > you say
> > > > >
> > > > > I guess what I really mean here is that I wanted to avoid exposing
> > the
> > > > > notion of a log dir to the controller. I can update the description
> > to
> > > > > reflect this.
> > > > >
> > > > > > It's also not completely clear that the cost of having to
> enumerate
> > > all
> > > > > the partitions on a log dir was weighed against the perceived
> benefit
> > > of a
> > > > > more flexible RPC.
> > > > >
> > > > > The enumeration isn't strictly required. In the "RPC semantics"
> > > section, I
> > > > > mention that if no topics are present in the RPC request, then all
> > > topics
> > > > > on the broker are implied. And if a topic is given with no
> > partitions,
> > > all
> > > > > partitions for that topic (on the broker) are implied. Does this
> make
> > > > > sense?
> > > > >
> > > > > Thanks again! I'll update the KIP and leave a message here once
> it's
> > > > > revised.
> > > > >
> > > > > David
> > > > >
> > > > > On Wed, Apr 29, 2020 at 11:20 AM Tom Bentley <tb...@redhat.com>
> > > wrote:
> > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > Thanks for the KIP!
> > > > > >
> > > > > > In the rejecting the alternative of having an RPC for log dir
> > > failures
> > > > > you
> > > > > > say:
> > > > > >
> > > > > > It was also rejected to prevent "leaking" the notion of a log dir
> > to
> > > the
> > > > > > > public API.
> > > > > > >
> > > > > >
> > > > > > I'm not quite sure I follow that argument, since we already have
> > > RPCs for
> > > > > > changing replica log dirs. So in a general sense log dirs already
> > > exist
> > > > > in
> > > > > > the API. I suspect you were using public API to mean something
> more
> > > > > > specific; could you elaborate?
> > > > > >
> > > > > > It's also not completely clear that the cost of having to
> enumerate
> > > all
> > > > > the
> > > > > > partitions on a log dir was weighed against the perceived benefit
> > of
> > > a
> > > > > more
> > > > > > flexible RPC. (I'm sure it was, but it would be good to say so).
> > > > > >
> > > > > > Many thanks,
> > > > > >
> > > > > > Tom
> > > > > >
> > > > > > On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe <
> cmccabe@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Hi David,
> > > > > > >
> > > > > > > Thanks for the KIP!
> > > > > > >
> > > > > > > I think the ReplicaStateEventResponse should have a separate
> > error
> > > code
> > > > > > > for each partition.
> > > > > > >  Currently it just has one error code for the whole
> > > request/response,
> > > > > if
> > > > > > > I'm reading this right.  I think Jose made a similar point as
> > > well.  We
> > > > > > > should plan for scenarios where some replica states can be
> > changed
> > > and
> > > > > > some
> > > > > > > can't.
> > > > > > >
> > > > > > > Does EventType need to be an Int32?  For enums, we usually use
> > the
> > > > > > > smallest reasonable type, which would be Int8 here.  We can
> > always
> > > > > change
> > > > > > > the schema later if needed.  UNKNOWN_REPLICA_EVENT_TYPE seems
> > > > > unnecessary
> > > > > > > since INVALID_REQUEST covers this case.
> > > > > > >
> > > > > > > I'd also suggest "AlterReplicaStates[Request,Response]" as a
> > > slightly
> > > > > > > better name for this RPC.
> > > > > > >
> > > > > > > cheers,
> > > > > > > Colin
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Apr 7, 2020, at 12:43, David Arthur wrote:
> > > > > > > > Hey everyone,
> > > > > > > >
> > > > > > > > I'd like to start the discussion for KIP-589, part of the
> > KIP-500
> > > > > > effort
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > > > > > > >
> > > > > > > > This is a proposal to use a new RPC instead of ZooKeeper for
> > > > > notifying
> > > > > > > the
> > > > > > > > controller of an offline replica. Please give a read and let
> me
> > > know
> > > > > > your
> > > > > > > > thoughts.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > David
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > David Arthur
> > > > >
> > > >
> > >
> > >
> >
>
>
> --
> David Arthur
>

Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

Posted by David Arthur <mu...@gmail.com>.

I've updated the KIP with the feedback from this discussion
https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller.
I'll send out the vote thread shortly.

Thanks again,
David

On Tue, May 5, 2020 at 10:34 AM Tom Bentley <tb...@redhat.com> wrote:

> Hi Colin,
>
> Yeah, that makes sense, thanks. I was thinking, longer term, that there are
> other benefits to having the log dir information available to the
> controller. For example it would allow the possibility for CREATE_TOPIC
> requests to include the intended log dir for each replica. But that's
> obviously completely out of scope for this KIP.
>
> Kind regards,
>
> Tom
>
> On Mon, May 4, 2020 at 10:11 PM Colin McCabe <cm...@apache.org> wrote:
>
> > Hi Tom,
> >
> > As you said, the controller doesn't know about log directories, although
> > individual brokers do.  So the brokers do currently have to enumerate all
> > the partitions that need to be removed to the controllers explicitly.  So
> > this KIP isn't changing anything in that regard.
> >
> > The current flow is:
> > 1. ping ZK back-channel
> > 2. controller sends a full LeaderAndIsrRequest to the broker
> > 3. the broker sends a full response containing error codes for all
> > partitions.  Partitions on the failed storage have a nonzero error code;
> > the others have 0.
> >
> > The new flow is:
> > 1. the broker sends an RPC with all the failed partitions
> >
> > So the new flow actually substantially reduces the amount of network
> > traffic, since previously we sent a full LeaderAndIsrRequest containing
> all
> > of the partitions.  Now we just send all the partitions in the failed
> > storage directory.  That could still be a lot, but certainly only be a
> > fraction of what a full LeaderAndIsrRequest would have.
> >
> > Sorry if I'm repeating stuff you already figured out, but I just wanted
> to
> > be more clear about this (I found it confusing too until David explained
> it
> > to me originally...)
> >
> > best,
> > Colin
> >
> >
> > On Sat, May 2, 2020, at 10:30, Tom Bentley wrote:
> > > Hi David,
> > >
> > > > In the rejecting the alternative of having an RPC for log dir
> failures
> > > > you say
> > > >
> > > > I guess what I really mean here is that I wanted to avoid exposing
> the
> > > > notion of a log dir to the controller. I can update the description
> to
> > > > reflect this.
> > > >
> > >
> > > Ah, I think I see now. While each broker knows about its log dirs this
> > > isn't something that's stored in zookeeper or known to the controller.
> > >
> > >
> > > > > It's also not completely clear that the cost of having to enumerate
> > all
> > > > the partitions on a log dir was weighed against the perceived benefit
> > of a
> > > > more flexible RPC.
> > > >
> > > > The enumeration isn't strictly required. In the "RPC semantics"
> > section, I
> > > > mention that if no topics are present in the RPC request, then all
> > topics
> > > > on the broker are implied. And if a topic is given with no
> partitions,
> > all
> > > > partitions for that topic (on the broker) are implied. Does this make
> > > > sense?
> > > >
> > >
> > > So the no-topics-present optimisation wouldn't be available to a broker
> > > with >1 log dirs where only some of the log dirs failed. I don't
> suppose
> > > that's a problem though.
> > >
> > > Thanks again,
> > >
> > > Tom
> > >
> > >
> > > On Fri, May 1, 2020 at 5:48 PM David Arthur <mu...@gmail.com> wrote:
> > >
> > > > Jose/Colin/Tom, thanks for the feedback!
> > > >
> > > > > Partition level errors
> > > >
> > > > This was an oversight on my part, I meant to include these in the
> > response
> > > > RPC. I'll update that.
> > > >
> > > > > INVALID_REQUEST
> > > >
> > > > I'll update this text description, that was a copy/paste left over
> > > >
> > > > > I think we should mention that the controller will keep it's
> current
> > > > implementation of marking the replicas as offline because of failure
> > in the
> > > > LeaderAndIsr response.
> > > >
> > > > Good suggestions, I'll add that.
> > > >
> > > > > Does EventType need to be an Int32?
> > > >
> > > > No, it doesn't. I'll update to Int8. Do we have an example of the
> enum
> > > > paradigm in our RPC today? I'm curious if we actually map it to a
> real
> > Java
> > > > enum in the AbstractRequest/Response classes.
> > > >
> > > > > AlterReplicaStates
> > > >
> > > > Sounds good to me.
> > > >
> > > > > In the rejecting the alternative of having an RPC for log dir
> > failures
> > > > you say
> > > >
> > > > I guess what I really mean here is that I wanted to avoid exposing
> the
> > > > notion of a log dir to the controller. I can update the description
> to
> > > > reflect this.
> > > >
> > > > > It's also not completely clear that the cost of having to enumerate
> > all
> > > > the partitions on a log dir was weighed against the perceived benefit
> > of a
> > > > more flexible RPC.
> > > >
> > > > The enumeration isn't strictly required. In the "RPC semantics"
> > section, I
> > > > mention that if no topics are present in the RPC request, then all
> > topics
> > > > on the broker are implied. And if a topic is given with no
> partitions,
> > all
> > > > partitions for that topic (on the broker) are implied. Does this make
> > > > sense?
> > > >
> > > > Thanks again! I'll update the KIP and leave a message here once it's
> > > > revised.
> > > >
> > > > David
> > > >
> > > > On Wed, Apr 29, 2020 at 11:20 AM Tom Bentley <tb...@redhat.com>
> > wrote:
> > > >
> > > > > Hi David,
> > > > >
> > > > > Thanks for the KIP!
> > > > >
> > > > > In the rejecting the alternative of having an RPC for log dir
> > failures
> > > > you
> > > > > say:
> > > > >
> > > > > It was also rejected to prevent "leaking" the notion of a log dir
> to
> > the
> > > > > > public API.
> > > > > >
> > > > >
> > > > > I'm not quite sure I follow that argument, since we already have
> > RPCs for
> > > > > changing replica log dirs. So in a general sense log dirs already
> > exist
> > > > in
> > > > > the API. I suspect you were using public API to mean something more
> > > > > specific; could you elaborate?
> > > > >
> > > > > It's also not completely clear that the cost of having to enumerate
> > all
> > > > the
> > > > > partitions on a log dir was weighed against the perceived benefit
> of
> > a
> > > > more
> > > > > flexible RPC. (I'm sure it was, but it would be good to say so).
> > > > >
> > > > > Many thanks,
> > > > >
> > > > > Tom
> > > > >
> > > > > On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe <cm...@apache.org>
> > > > wrote:
> > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > Thanks for the KIP!
> > > > > >
> > > > > > I think the ReplicaStateEventResponse should have a separate
> error
> > code
> > > > > > for each partition.
> > > > > >  Currently it just has one error code for the whole
> > request/response,
> > > > if
> > > > > > I'm reading this right.  I think Jose made a similar point as
> > well.  We
> > > > > > should plan for scenarios where some replica states can be
> changed
> > and
> > > > > some
> > > > > > can't.
> > > > > >
> > > > > > Does EventType need to be an Int32?  For enums, we usually use
> the
> > > > > > smallest reasonable type, which would be Int8 here.  We can
> always
> > > > change
> > > > > > the schema later if needed.  UNKNOWN_REPLICA_EVENT_TYPE seems
> > > > unnecessary
> > > > > > since INVALID_REQUEST covers this case.
> > > > > >
> > > > > > I'd also suggest "AlterReplicaStates[Request,Response]" as a
> > slightly
> > > > > > better name for this RPC.
> > > > > >
> > > > > > cheers,
> > > > > > Colin
> > > > > >
> > > > > >
> > > > > > On Tue, Apr 7, 2020, at 12:43, David Arthur wrote:
> > > > > > > Hey everyone,
> > > > > > >
> > > > > > > I'd like to start the discussion for KIP-589, part of the
> KIP-500
> > > > > effort
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > > > > > >
> > > > > > > This is a proposal to use a new RPC instead of ZooKeeper for
> > > > notifying
> > > > > > the
> > > > > > > controller of an offline replica. Please give a read and let me
> > know
> > > > > your
> > > > > > > thoughts.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > David
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > David Arthur
> > > >
> > >
> >
> >
>


-- 
David Arthur

Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

Posted by Tom Bentley <tb...@redhat.com>.

Hi Colin,

Yeah, that makes sense, thanks. I was thinking, longer term, that there are
other benefits to having the log dir information available to the
controller. For example it would allow the possibility for CREATE_TOPIC
requests to include the intended log dir for each replica. But that's
obviously completely out of scope for this KIP.

Kind regards,

Tom

On Mon, May 4, 2020 at 10:11 PM Colin McCabe <cm...@apache.org> wrote:

> Hi Tom,
>
> As you said, the controller doesn't know about log directories, although
> individual brokers do.  So the brokers do currently have to enumerate all
> the partitions that need to be removed to the controllers explicitly.  So
> this KIP isn't changing anything in that regard.
>
> The current flow is:
> 1. ping ZK back-channel
> 2. controller sends a full LeaderAndIsrRequest to the broker
> 3. the broker sends a full response containing error codes for all
> partitions.  Partitions on the failed storage have a nonzero error code;
> the others have 0.
>
> The new flow is:
> 1. the broker sends an RPC with all the failed partitions
>
> So the new flow actually substantially reduces the amount of network
> traffic, since previously we sent a full LeaderAndIsrRequest containing all
> of the partitions.  Now we just send all the partitions in the failed
> storage directory.  That could still be a lot, but certainly only be a
> fraction of what a full LeaderAndIsrRequest would have.
>
> Sorry if I'm repeating stuff you already figured out, but I just wanted to
> be more clear about this (I found it confusing too until David explained it
> to me originally...)
>
> best,
> Colin
>
>
> On Sat, May 2, 2020, at 10:30, Tom Bentley wrote:
> > Hi David,
> >
> > > In the rejecting the alternative of having an RPC for log dir failures
> > > you say
> > >
> > > I guess what I really mean here is that I wanted to avoid exposing the
> > > notion of a log dir to the controller. I can update the description to
> > > reflect this.
> > >
> >
> > Ah, I think I see now. While each broker knows about its log dirs this
> > isn't something that's stored in zookeeper or known to the controller.
> >
> >
> > > > It's also not completely clear that the cost of having to enumerate
> all
> > > the partitions on a log dir was weighed against the perceived benefit
> of a
> > > more flexible RPC.
> > >
> > > The enumeration isn't strictly required. In the "RPC semantics"
> section, I
> > > mention that if no topics are present in the RPC request, then all
> topics
> > > on the broker are implied. And if a topic is given with no partitions,
> all
> > > partitions for that topic (on the broker) are implied. Does this make
> > > sense?
> > >
> >
> > So the no-topics-present optimisation wouldn't be available to a broker
> > with >1 log dirs where only some of the log dirs failed. I don't suppose
> > that's a problem though.
> >
> > Thanks again,
> >
> > Tom
> >
> >
> > On Fri, May 1, 2020 at 5:48 PM David Arthur <mu...@gmail.com> wrote:
> >
> > > Jose/Colin/Tom, thanks for the feedback!
> > >
> > > > Partition level errors
> > >
> > > This was an oversight on my part, I meant to include these in the
> response
> > > RPC. I'll update that.
> > >
> > > > INVALID_REQUEST
> > >
> > > I'll update this text description, that was a copy/paste left over
> > >
> > > > I think we should mention that the controller will keep it's current
> > > implementation of marking the replicas as offline because of failure
> in the
> > > LeaderAndIsr response.
> > >
> > > Good suggestions, I'll add that.
> > >
> > > > Does EventType need to be an Int32?
> > >
> > > No, it doesn't. I'll update to Int8. Do we have an example of the enum
> > > paradigm in our RPC today? I'm curious if we actually map it to a real
> Java
> > > enum in the AbstractRequest/Response classes.
> > >
> > > > AlterReplicaStates
> > >
> > > Sounds good to me.
> > >
> > > > In the rejecting the alternative of having an RPC for log dir
> failures
> > > you say
> > >
> > > I guess what I really mean here is that I wanted to avoid exposing the
> > > notion of a log dir to the controller. I can update the description to
> > > reflect this.
> > >
> > > > It's also not completely clear that the cost of having to enumerate
> all
> > > the partitions on a log dir was weighed against the perceived benefit
> of a
> > > more flexible RPC.
> > >
> > > The enumeration isn't strictly required. In the "RPC semantics"
> section, I
> > > mention that if no topics are present in the RPC request, then all
> topics
> > > on the broker are implied. And if a topic is given with no partitions,
> all
> > > partitions for that topic (on the broker) are implied. Does this make
> > > sense?
> > >
> > > Thanks again! I'll update the KIP and leave a message here once it's
> > > revised.
> > >
> > > David
> > >
> > > On Wed, Apr 29, 2020 at 11:20 AM Tom Bentley <tb...@redhat.com>
> wrote:
> > >
> > > > Hi David,
> > > >
> > > > Thanks for the KIP!
> > > >
> > > > In the rejecting the alternative of having an RPC for log dir
> failures
> > > you
> > > > say:
> > > >
> > > > It was also rejected to prevent "leaking" the notion of a log dir to
> the
> > > > > public API.
> > > > >
> > > >
> > > > I'm not quite sure I follow that argument, since we already have
> RPCs for
> > > > changing replica log dirs. So in a general sense log dirs already
> exist
> > > in
> > > > the API. I suspect you were using public API to mean something more
> > > > specific; could you elaborate?
> > > >
> > > > It's also not completely clear that the cost of having to enumerate
> all
> > > the
> > > > partitions on a log dir was weighed against the perceived benefit of
> a
> > > more
> > > > flexible RPC. (I'm sure it was, but it would be good to say so).
> > > >
> > > > Many thanks,
> > > >
> > > > Tom
> > > >
> > > > On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe <cm...@apache.org>
> > > wrote:
> > > >
> > > > > Hi David,
> > > > >
> > > > > Thanks for the KIP!
> > > > >
> > > > > I think the ReplicaStateEventResponse should have a separate error
> code
> > > > > for each partition.
> > > > >  Currently it just has one error code for the whole
> request/response,
> > > if
> > > > > I'm reading this right.  I think Jose made a similar point as
> well.  We
> > > > > should plan for scenarios where some replica states can be changed
> and
> > > > some
> > > > > can't.
> > > > >
> > > > > Does EventType need to be an Int32?  For enums, we usually use the
> > > > > smallest reasonable type, which would be Int8 here.  We can always
> > > change
> > > > > the schema later if needed.  UNKNOWN_REPLICA_EVENT_TYPE seems
> > > unnecessary
> > > > > since INVALID_REQUEST covers this case.
> > > > >
> > > > > I'd also suggest "AlterReplicaStates[Request,Response]" as a
> slightly
> > > > > better name for this RPC.
> > > > >
> > > > > cheers,
> > > > > Colin
> > > > >
> > > > >
> > > > > On Tue, Apr 7, 2020, at 12:43, David Arthur wrote:
> > > > > > Hey everyone,
> > > > > >
> > > > > > I'd like to start the discussion for KIP-589, part of the KIP-500
> > > > effort
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > > > > >
> > > > > > This is a proposal to use a new RPC instead of ZooKeeper for
> > > notifying
> > > > > the
> > > > > > controller of an offline replica. Please give a read and let me
> know
> > > > your
> > > > > > thoughts.
> > > > > >
> > > > > > Thanks!
> > > > > > David
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > David Arthur
> > >
> >
>
>

Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

Posted by Colin McCabe <cm...@apache.org>.

Hi Tom,

As you said, the controller doesn't know about log directories, although individual brokers do.  So the brokers do currently have to enumerate all the partitions that need to be removed to the controllers explicitly.  So this KIP isn't changing anything in that regard.

The current flow is:
1. ping ZK back-channel
2. controller sends a full LeaderAndIsrRequest to the broker
3. the broker sends a full response containing error codes for all partitions.  Partitions on the failed storage have a nonzero error code; the others have 0.

The new flow is:
1. the broker sends an RPC with all the failed partitions

So the new flow actually substantially reduces the amount of network traffic, since previously we sent a full LeaderAndIsrRequest containing all of the partitions.  Now we just send all the partitions in the failed storage directory.  That could still be a lot, but certainly only be a fraction of what a full LeaderAndIsrRequest would have.

Sorry if I'm repeating stuff you already figured out, but I just wanted to be more clear about this (I found it confusing too until David explained it to me originally...)

best,
Colin


On Sat, May 2, 2020, at 10:30, Tom Bentley wrote:
> Hi David,
> 
> > In the rejecting the alternative of having an RPC for log dir failures
> > you say
> >
> > I guess what I really mean here is that I wanted to avoid exposing the
> > notion of a log dir to the controller. I can update the description to
> > reflect this.
> >
> 
> Ah, I think I see now. While each broker knows about its log dirs this
> isn't something that's stored in zookeeper or known to the controller.
> 
> 
> > > It's also not completely clear that the cost of having to enumerate all
> > the partitions on a log dir was weighed against the perceived benefit of a
> > more flexible RPC.
> >
> > The enumeration isn't strictly required. In the "RPC semantics" section, I
> > mention that if no topics are present in the RPC request, then all topics
> > on the broker are implied. And if a topic is given with no partitions, all
> > partitions for that topic (on the broker) are implied. Does this make
> > sense?
> >
> 
> So the no-topics-present optimisation wouldn't be available to a broker
> with >1 log dirs where only some of the log dirs failed. I don't suppose
> that's a problem though.
> 
> Thanks again,
> 
> Tom
> 
> 
> On Fri, May 1, 2020 at 5:48 PM David Arthur <mu...@gmail.com> wrote:
> 
> > Jose/Colin/Tom, thanks for the feedback!
> >
> > > Partition level errors
> >
> > This was an oversight on my part, I meant to include these in the response
> > RPC. I'll update that.
> >
> > > INVALID_REQUEST
> >
> > I'll update this text description, that was a copy/paste left over
> >
> > > I think we should mention that the controller will keep it's current
> > implementation of marking the replicas as offline because of failure in the
> > LeaderAndIsr response.
> >
> > Good suggestions, I'll add that.
> >
> > > Does EventType need to be an Int32?
> >
> > No, it doesn't. I'll update to Int8. Do we have an example of the enum
> > paradigm in our RPC today? I'm curious if we actually map it to a real Java
> > enum in the AbstractRequest/Response classes.
> >
> > > AlterReplicaStates
> >
> > Sounds good to me.
> >
> > > In the rejecting the alternative of having an RPC for log dir failures
> > you say
> >
> > I guess what I really mean here is that I wanted to avoid exposing the
> > notion of a log dir to the controller. I can update the description to
> > reflect this.
> >
> > > It's also not completely clear that the cost of having to enumerate all
> > the partitions on a log dir was weighed against the perceived benefit of a
> > more flexible RPC.
> >
> > The enumeration isn't strictly required. In the "RPC semantics" section, I
> > mention that if no topics are present in the RPC request, then all topics
> > on the broker are implied. And if a topic is given with no partitions, all
> > partitions for that topic (on the broker) are implied. Does this make
> > sense?
> >
> > Thanks again! I'll update the KIP and leave a message here once it's
> > revised.
> >
> > David
> >
> > On Wed, Apr 29, 2020 at 11:20 AM Tom Bentley <tb...@redhat.com> wrote:
> >
> > > Hi David,
> > >
> > > Thanks for the KIP!
> > >
> > > In the rejecting the alternative of having an RPC for log dir failures
> > you
> > > say:
> > >
> > > It was also rejected to prevent "leaking" the notion of a log dir to the
> > > > public API.
> > > >
> > >
> > > I'm not quite sure I follow that argument, since we already have RPCs for
> > > changing replica log dirs. So in a general sense log dirs already exist
> > in
> > > the API. I suspect you were using public API to mean something more
> > > specific; could you elaborate?
> > >
> > > It's also not completely clear that the cost of having to enumerate all
> > the
> > > partitions on a log dir was weighed against the perceived benefit of a
> > more
> > > flexible RPC. (I'm sure it was, but it would be good to say so).
> > >
> > > Many thanks,
> > >
> > > Tom
> > >
> > > On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe <cm...@apache.org>
> > wrote:
> > >
> > > > Hi David,
> > > >
> > > > Thanks for the KIP!
> > > >
> > > > I think the ReplicaStateEventResponse should have a separate error code
> > > > for each partition.
> > > >  Currently it just has one error code for the whole request/response,
> > if
> > > > I'm reading this right.  I think Jose made a similar point as well.  We
> > > > should plan for scenarios where some replica states can be changed and
> > > some
> > > > can't.
> > > >
> > > > Does EventType need to be an Int32?  For enums, we usually use the
> > > > smallest reasonable type, which would be Int8 here.  We can always
> > change
> > > > the schema later if needed.  UNKNOWN_REPLICA_EVENT_TYPE seems
> > unnecessary
> > > > since INVALID_REQUEST covers this case.
> > > >
> > > > I'd also suggest "AlterReplicaStates[Request,Response]" as a slightly
> > > > better name for this RPC.
> > > >
> > > > cheers,
> > > > Colin
> > > >
> > > >
> > > > On Tue, Apr 7, 2020, at 12:43, David Arthur wrote:
> > > > > Hey everyone,
> > > > >
> > > > > I'd like to start the discussion for KIP-589, part of the KIP-500
> > > effort
> > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > > > >
> > > > > This is a proposal to use a new RPC instead of ZooKeeper for
> > notifying
> > > > the
> > > > > controller of an offline replica. Please give a read and let me know
> > > your
> > > > > thoughts.
> > > > >
> > > > > Thanks!
> > > > > David
> > > > >
> > > >
> > > >
> > >
> >
> >
> > --
> > David Arthur
> >
>

Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

Posted by Tom Bentley <tb...@redhat.com>.

Hi David,

> In the rejecting the alternative of having an RPC for log dir failures
> you say
>
> I guess what I really mean here is that I wanted to avoid exposing the
> notion of a log dir to the controller. I can update the description to
> reflect this.
>

Ah, I think I see now. While each broker knows about its log dirs this
isn't something that's stored in zookeeper or known to the controller.


> > It's also not completely clear that the cost of having to enumerate all
> the partitions on a log dir was weighed against the perceived benefit of a
> more flexible RPC.
>
> The enumeration isn't strictly required. In the "RPC semantics" section, I
> mention that if no topics are present in the RPC request, then all topics
> on the broker are implied. And if a topic is given with no partitions, all
> partitions for that topic (on the broker) are implied. Does this make
> sense?
>

So the no-topics-present optimisation wouldn't be available to a broker
with >1 log dirs where only some of the log dirs failed. I don't suppose
that's a problem though.

Thanks again,

Tom


On Fri, May 1, 2020 at 5:48 PM David Arthur <mu...@gmail.com> wrote:

> Jose/Colin/Tom, thanks for the feedback!
>
> > Partition level errors
>
> This was an oversight on my part, I meant to include these in the response
> RPC. I'll update that.
>
> > INVALID_REQUEST
>
> I'll update this text description, that was a copy/paste left over
>
> > I think we should mention that the controller will keep it's current
> implementation of marking the replicas as offline because of failure in the
> LeaderAndIsr response.
>
> Good suggestions, I'll add that.
>
> > Does EventType need to be an Int32?
>
> No, it doesn't. I'll update to Int8. Do we have an example of the enum
> paradigm in our RPC today? I'm curious if we actually map it to a real Java
> enum in the AbstractRequest/Response classes.
>
> > AlterReplicaStates
>
> Sounds good to me.
>
> > In the rejecting the alternative of having an RPC for log dir failures
> you say
>
> I guess what I really mean here is that I wanted to avoid exposing the
> notion of a log dir to the controller. I can update the description to
> reflect this.
>
> > It's also not completely clear that the cost of having to enumerate all
> the partitions on a log dir was weighed against the perceived benefit of a
> more flexible RPC.
>
> The enumeration isn't strictly required. In the "RPC semantics" section, I
> mention that if no topics are present in the RPC request, then all topics
> on the broker are implied. And if a topic is given with no partitions, all
> partitions for that topic (on the broker) are implied. Does this make
> sense?
>
> Thanks again! I'll update the KIP and leave a message here once it's
> revised.
>
> David
>
> On Wed, Apr 29, 2020 at 11:20 AM Tom Bentley <tb...@redhat.com> wrote:
>
> > Hi David,
> >
> > Thanks for the KIP!
> >
> > In the rejecting the alternative of having an RPC for log dir failures
> you
> > say:
> >
> > It was also rejected to prevent "leaking" the notion of a log dir to the
> > > public API.
> > >
> >
> > I'm not quite sure I follow that argument, since we already have RPCs for
> > changing replica log dirs. So in a general sense log dirs already exist
> in
> > the API. I suspect you were using public API to mean something more
> > specific; could you elaborate?
> >
> > It's also not completely clear that the cost of having to enumerate all
> the
> > partitions on a log dir was weighed against the perceived benefit of a
> more
> > flexible RPC. (I'm sure it was, but it would be good to say so).
> >
> > Many thanks,
> >
> > Tom
> >
> > On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe <cm...@apache.org>
> wrote:
> >
> > > Hi David,
> > >
> > > Thanks for the KIP!
> > >
> > > I think the ReplicaStateEventResponse should have a separate error code
> > > for each partition.
> > >  Currently it just has one error code for the whole request/response,
> if
> > > I'm reading this right.  I think Jose made a similar point as well.  We
> > > should plan for scenarios where some replica states can be changed and
> > some
> > > can't.
> > >
> > > Does EventType need to be an Int32?  For enums, we usually use the
> > > smallest reasonable type, which would be Int8 here.  We can always
> change
> > > the schema later if needed.  UNKNOWN_REPLICA_EVENT_TYPE seems
> unnecessary
> > > since INVALID_REQUEST covers this case.
> > >
> > > I'd also suggest "AlterReplicaStates[Request,Response]" as a slightly
> > > better name for this RPC.
> > >
> > > cheers,
> > > Colin
> > >
> > >
> > > On Tue, Apr 7, 2020, at 12:43, David Arthur wrote:
> > > > Hey everyone,
> > > >
> > > > I'd like to start the discussion for KIP-589, part of the KIP-500
> > effort
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > > >
> > > > This is a proposal to use a new RPC instead of ZooKeeper for
> notifying
> > > the
> > > > controller of an offline replica. Please give a read and let me know
> > your
> > > > thoughts.
> > > >
> > > > Thanks!
> > > > David
> > > >
> > >
> > >
> >
>
>
> --
> David Arthur
>

Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

Posted by David Arthur <mu...@gmail.com>.

Jose/Colin/Tom, thanks for the feedback!

> Partition level errors

This was an oversight on my part, I meant to include these in the response
RPC. I'll update that.

> INVALID_REQUEST

I'll update this text description, that was a copy/paste left over

> I think we should mention that the controller will keep it's current
implementation of marking the replicas as offline because of failure in the
LeaderAndIsr response.

Good suggestions, I'll add that.

> Does EventType need to be an Int32?

No, it doesn't. I'll update to Int8. Do we have an example of the enum
paradigm in our RPC today? I'm curious if we actually map it to a real Java
enum in the AbstractRequest/Response classes.

> AlterReplicaStates

Sounds good to me.

> In the rejecting the alternative of having an RPC for log dir failures
you say

I guess what I really mean here is that I wanted to avoid exposing the
notion of a log dir to the controller. I can update the description to
reflect this.

> It's also not completely clear that the cost of having to enumerate all
the partitions on a log dir was weighed against the perceived benefit of a
more flexible RPC.

The enumeration isn't strictly required. In the "RPC semantics" section, I
mention that if no topics are present in the RPC request, then all topics
on the broker are implied. And if a topic is given with no partitions, all
partitions for that topic (on the broker) are implied. Does this make sense?

Thanks again! I'll update the KIP and leave a message here once it's
revised.

David

On Wed, Apr 29, 2020 at 11:20 AM Tom Bentley <tb...@redhat.com> wrote:

> Hi David,
>
> Thanks for the KIP!
>
> In the rejecting the alternative of having an RPC for log dir failures you
> say:
>
> It was also rejected to prevent "leaking" the notion of a log dir to the
> > public API.
> >
>
> I'm not quite sure I follow that argument, since we already have RPCs for
> changing replica log dirs. So in a general sense log dirs already exist in
> the API. I suspect you were using public API to mean something more
> specific; could you elaborate?
>
> It's also not completely clear that the cost of having to enumerate all the
> partitions on a log dir was weighed against the perceived benefit of a more
> flexible RPC. (I'm sure it was, but it would be good to say so).
>
> Many thanks,
>
> Tom
>
> On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe <cm...@apache.org> wrote:
>
> > Hi David,
> >
> > Thanks for the KIP!
> >
> > I think the ReplicaStateEventResponse should have a separate error code
> > for each partition.
> >  Currently it just has one error code for the whole request/response, if
> > I'm reading this right.  I think Jose made a similar point as well.  We
> > should plan for scenarios where some replica states can be changed and
> some
> > can't.
> >
> > Does EventType need to be an Int32?  For enums, we usually use the
> > smallest reasonable type, which would be Int8 here.  We can always change
> > the schema later if needed.  UNKNOWN_REPLICA_EVENT_TYPE seems unnecessary
> > since INVALID_REQUEST covers this case.
> >
> > I'd also suggest "AlterReplicaStates[Request,Response]" as a slightly
> > better name for this RPC.
> >
> > cheers,
> > Colin
> >
> >
> > On Tue, Apr 7, 2020, at 12:43, David Arthur wrote:
> > > Hey everyone,
> > >
> > > I'd like to start the discussion for KIP-589, part of the KIP-500
> effort
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > >
> > > This is a proposal to use a new RPC instead of ZooKeeper for notifying
> > the
> > > controller of an offline replica. Please give a read and let me know
> your
> > > thoughts.
> > >
> > > Thanks!
> > > David
> > >
> >
> >
>


-- 
David Arthur

Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

Posted by Tom Bentley <tb...@redhat.com>.

Hi David,

Thanks for the KIP!

In the rejecting the alternative of having an RPC for log dir failures you
say:

It was also rejected to prevent "leaking" the notion of a log dir to the
> public API.
>

I'm not quite sure I follow that argument, since we already have RPCs for
changing replica log dirs. So in a general sense log dirs already exist in
the API. I suspect you were using public API to mean something more
specific; could you elaborate?

It's also not completely clear that the cost of having to enumerate all the
partitions on a log dir was weighed against the perceived benefit of a more
flexible RPC. (I'm sure it was, but it would be good to say so).

Many thanks,

Tom

On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe <cm...@apache.org> wrote:

> Hi David,
>
> Thanks for the KIP!
>
> I think the ReplicaStateEventResponse should have a separate error code
> for each partition.
>  Currently it just has one error code for the whole request/response, if
> I'm reading this right.  I think Jose made a similar point as well.  We
> should plan for scenarios where some replica states can be changed and some
> can't.
>
> Does EventType need to be an Int32?  For enums, we usually use the
> smallest reasonable type, which would be Int8 here.  We can always change
> the schema later if needed.  UNKNOWN_REPLICA_EVENT_TYPE seems unnecessary
> since INVALID_REQUEST covers this case.
>
> I'd also suggest "AlterReplicaStates[Request,Response]" as a slightly
> better name for this RPC.
>
> cheers,
> Colin
>
>
> On Tue, Apr 7, 2020, at 12:43, David Arthur wrote:
> > Hey everyone,
> >
> > I'd like to start the discussion for KIP-589, part of the KIP-500 effort
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> >
> > This is a proposal to use a new RPC instead of ZooKeeper for notifying
> the
> > controller of an offline replica. Please give a read and let me know your
> > thoughts.
> >
> > Thanks!
> > David
> >
>
>

Re: [DISCUSS] KIP-589 Add API to Update Replica State in Controller

Posted by Colin McCabe <cm...@apache.org>.

Hi David,

Thanks for the KIP!

I think the ReplicaStateEventResponse should have a separate error code for each partition. 
 Currently it just has one error code for the whole request/response, if I'm reading this right.  I think Jose made a similar point as well.  We should plan for scenarios where some replica states can be changed and some can't.

Does EventType need to be an Int32?  For enums, we usually use the smallest reasonable type, which would be Int8 here.  We can always change the schema later if needed.  UNKNOWN_REPLICA_EVENT_TYPE seems unnecessary since INVALID_REQUEST covers this case.

I'd also suggest "AlterReplicaStates[Request,Response]" as a slightly better name for this RPC.

cheers,
Colin

On Tue, Apr 7, 2020, at 12:43, David Arthur wrote:
> Hey everyone,
> 
> I'd like to start the discussion for KIP-589, part of the KIP-500 effort
> 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> 
> This is a proposal to use a new RPC instead of ZooKeeper for notifying the
> controller of an offline replica. Please give a read and let me know your
> thoughts.
> 
> Thanks!
> David
>