You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Jason Gustafson <ja...@confluent.io> on 2020/08/03 18:02:44 UTC

[VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Hi All, I'd like to start a vote on this proposal:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum.
The discussion has been active for a bit more than 3 months and I think the
main points have been addressed. We have also moved some of the pieces into
follow-up proposals, such as KIP-630.

Please keep in mind that the details are bound to change as all of
the pieces start coming together. As usual, we will keep this thread
notified of such changes.

For me personally, this is super exciting since we have been thinking about
this work ever since I started working on Kafka! I am +1 of course.

Best,
Jason

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Posted by Jun Rao <ju...@confluent.io>.

Hi, Jason,

Thanks for the KIP. +1

Just to confirm. For those newly added request types, will we expose the
existing latency metrics (total, local, remote, etc) with a new tag
request=[request-type]?

Jun

On Tue, Aug 4, 2020 at 3:00 PM Boyang Chen <re...@gmail.com>
wrote:

> Thanks for the KIP Jason, +1 (binding) from me as well for sure :)
>
>
> On Tue, Aug 4, 2020 at 2:46 PM Colin McCabe <cm...@apache.org> wrote:
>
> > On Mon, Aug 3, 2020, at 20:55, Jason Gustafson wrote:
> > > Hi Colin,
> > >
> > > Thanks for the responses.
> > >
> > > > I have a few lingering questions.  I still don't like the fact that
> the
> > > > leader epoch / fetch epoch is 31 bits.  What happens when this rolls
> > over?
> > > > Can we just make this 63 bits now so that we never have to worry
> about
> > it
> > > > again?  ZK has some awful bugs surrounding 32 bit rollover, due to a
> > > > similar decision to use a 32 bit counter in their log structure.
> > Doesn't
> > > > seem like a good tradeoff.
> > >
> > > This is a bit difficult to do at the moment since the leader epoch is 4
> > > bytes in the message format. One option that I have considered is
> > toggling
> > > a batch attribute that lets us turn the producerId into an 8-byte
> leader
> > > epoch instead since we do not have a use for it in the metadata quorum.
> > We
> > > would need another solution if we ever wanted to use Raft for partition
> > > replication, but perhaps by then we can make the case for a new message
> > > format.
> > >
> >
> > Hi Jason,
> >
> > Thanks for the explanation.  I suspected that there was a technical
> > limitation like this lurking somewhere.  I think a hack like the one you
> > suggested would be OK for now.  I just really want to avoid thinking
> about
> > rollover :)
> >
> > Regarding the epoch overflow, some offline discussions among Jason,
> Guozhang, Jose and I reached some conclusions:
>
> 1. The current default election timeout is 10 seconds, which means it takes
> hundreds of years to be exhausted if just bumping through election timeout.
> Even if the user sets it to 1 second, it still needs years to exhaust.
>
> 2. The most common case for fast epoch bumps is due to network partition.
> If a certain voter couldn't connect to the quorum, it will repeatedly start
> elections and do the epoch bump. To mitigate this concern, we already
> planned a follow-up KIP to add the `pre-vote` feature to Kafka Raft
> implementation as described in the literature to avoid rapid epoch
> increments in the algorithm level.
>
> 3. As you suggested, the leader epoch overflow was a common problem not
> just for Raft. We could kick off a separate KIP to address changing epoch
> from 4 bytes to 8 bytes through message format upgrade, to solve the issue
> for Kafka in a holistic manner.
>
>
>
> > >
> > > > Just like in bootstrap.servers, I don't think we want to manually
> > assign
> > > > IDs per hostname.  The hosts know their own IDs, after all.  Having
> to
> > > > manually specify the IDs also opens up the possibility of
> > > > misconfigurations: what I say the foobar server is node 2, but it's
> > > > actually node 3? This would make the logs extremely confusing.  I
> > realize
> > > > this may require a little finesse to do, but there's got to be a way
> > we can
> > > > avoid hard-coding IDs
> > >
> > > Fine. We can move this to KIP-631, but I think it would be a mistake to
> > > take IDs out of this configuration. For safety, the one thing that the
> > > configuration needs to tell us is what the IDs of the voters are.
> Without
> > > that, it's really easy for a cluster to get into a state where none of
> > > the quorum members agree on what the proper set of voters is. I think
> > > perhaps you are confused on the usage of these IDs. It is what enables
> > > validation of voter requests. Without it, a voter would have to accept
> a
> > > vote request from any ID. There is a reason that other consensus
> systems
> > > like zookeeper and etcd require ids when configured statically.
> > >
> >
> > I hadn't considered the fact that we need to validate incoming voter
> > requests.  The fact that nodes can have multiple DNS addresses does make
> > this difficult to do with just a list of hostnames.
> >
> > I guess you're right that we should keep the IDs.  But let's be careful
> to
> > validate that the node's ID really is what we think it is, and consider
> > that peer failed if it's not.
> >
> > >
> > > > Also, here's another case where we are saying "broker" when we mean
> > > > "controller."  It's really hard to break old habits.  :)
> > >
> > > I think we still have this basic disagreement on the KIP-500 vision :).
> > I'm
> > > not sure I understand why you are so eager to force users to think
> about
> > > the controller as a separate system. It's almost like Zookeeper is not
> > > going anywhere!
> > >
> >
> > Well, KIP-500 clearly does identify the controller as a separate system,
> > not as part of the broker, even if it runs in the same JVM.  :) A system
> > where all the nodes had the same role would need a fundamentally
> different
> > design, like Cassandra or something.
> >
> > I know you're joking, but just so that others understand, it's not fair
> to
> > say that "it's almost like ZK is not going anyway."  KIP-500 clusters
> will
> > have simpler deployment and support a lot of interesting use-cases like
> > single-JVM clusters, that would not be possible with the current setup.
> >
> > At the same time, saying "broker" when you mean "controller" confuses
> > people.  For example, I had someone ask a question recently about why we
> > needed BrokerHeartbeat when Raft already specifies a mechanism for leader
> > change.  I had to explain the different between broker nodes and
> controller
> > nodes.
> >
> > Anyway, +1 (binding).  Excited to see Raftka going forward!
> >
> > best,
> > Colin
> >
> > >
> > > -Jason
> > >
> > >
> > >
> > >
> > > On Mon, Aug 3, 2020 at 4:36 PM Jose Garcia Sancio <
> jsancio@confluent.io>
> > > wrote:
> > >
> > > > +1.
> > > >
> > > > Thanks for the detailed KIP!
> > > >
> > > > On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > > >
> > > > > Hi All, I'd like to start a vote on this proposal:
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > > > .
> > > > > The discussion has been active for a bit more than 3 months and I
> > think
> > > > the
> > > > > main points have been addressed. We have also moved some of the
> > pieces
> > > > into
> > > > > follow-up proposals, such as KIP-630.
> > > > >
> > > > > Please keep in mind that the details are bound to change as all of
> > > > > the pieces start coming together. As usual, we will keep this
> thread
> > > > > notified of such changes.
> > > > >
> > > > > For me personally, this is super exciting since we have been
> thinking
> > > > about
> > > > > this work ever since I started working on Kafka! I am +1 of course.
> > > > >
> > > > > Best,
> > > > > Jason
> > > >
> > > >
> > > >
> > > > --
> > > > -Jose
> > > >
> > >
> >
>

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Posted by Boyang Chen <re...@gmail.com>.

Thanks for the KIP Jason, +1 (binding) from me as well for sure :)


On Tue, Aug 4, 2020 at 2:46 PM Colin McCabe <cm...@apache.org> wrote:

> On Mon, Aug 3, 2020, at 20:55, Jason Gustafson wrote:
> > Hi Colin,
> >
> > Thanks for the responses.
> >
> > > I have a few lingering questions.  I still don't like the fact that the
> > > leader epoch / fetch epoch is 31 bits.  What happens when this rolls
> over?
> > > Can we just make this 63 bits now so that we never have to worry about
> it
> > > again?  ZK has some awful bugs surrounding 32 bit rollover, due to a
> > > similar decision to use a 32 bit counter in their log structure.
> Doesn't
> > > seem like a good tradeoff.
> >
> > This is a bit difficult to do at the moment since the leader epoch is 4
> > bytes in the message format. One option that I have considered is
> toggling
> > a batch attribute that lets us turn the producerId into an 8-byte leader
> > epoch instead since we do not have a use for it in the metadata quorum.
> We
> > would need another solution if we ever wanted to use Raft for partition
> > replication, but perhaps by then we can make the case for a new message
> > format.
> >
>
> Hi Jason,
>
> Thanks for the explanation.  I suspected that there was a technical
> limitation like this lurking somewhere.  I think a hack like the one you
> suggested would be OK for now.  I just really want to avoid thinking about
> rollover :)
>
> Regarding the epoch overflow, some offline discussions among Jason,
Guozhang, Jose and I reached some conclusions:

1. The current default election timeout is 10 seconds, which means it takes
hundreds of years to be exhausted if just bumping through election timeout.
Even if the user sets it to 1 second, it still needs years to exhaust.

2. The most common case for fast epoch bumps is due to network partition.
If a certain voter couldn't connect to the quorum, it will repeatedly start
elections and do the epoch bump. To mitigate this concern, we already
planned a follow-up KIP to add the `pre-vote` feature to Kafka Raft
implementation as described in the literature to avoid rapid epoch
increments in the algorithm level.

3. As you suggested, the leader epoch overflow was a common problem not
just for Raft. We could kick off a separate KIP to address changing epoch
from 4 bytes to 8 bytes through message format upgrade, to solve the issue
for Kafka in a holistic manner.



> >
> > > Just like in bootstrap.servers, I don't think we want to manually
> assign
> > > IDs per hostname.  The hosts know their own IDs, after all.  Having to
> > > manually specify the IDs also opens up the possibility of
> > > misconfigurations: what I say the foobar server is node 2, but it's
> > > actually node 3? This would make the logs extremely confusing.  I
> realize
> > > this may require a little finesse to do, but there's got to be a way
> we can
> > > avoid hard-coding IDs
> >
> > Fine. We can move this to KIP-631, but I think it would be a mistake to
> > take IDs out of this configuration. For safety, the one thing that the
> > configuration needs to tell us is what the IDs of the voters are. Without
> > that, it's really easy for a cluster to get into a state where none of
> > the quorum members agree on what the proper set of voters is. I think
> > perhaps you are confused on the usage of these IDs. It is what enables
> > validation of voter requests. Without it, a voter would have to accept a
> > vote request from any ID. There is a reason that other consensus systems
> > like zookeeper and etcd require ids when configured statically.
> >
>
> I hadn't considered the fact that we need to validate incoming voter
> requests.  The fact that nodes can have multiple DNS addresses does make
> this difficult to do with just a list of hostnames.
>
> I guess you're right that we should keep the IDs.  But let's be careful to
> validate that the node's ID really is what we think it is, and consider
> that peer failed if it's not.
>
> >
> > > Also, here's another case where we are saying "broker" when we mean
> > > "controller."  It's really hard to break old habits.  :)
> >
> > I think we still have this basic disagreement on the KIP-500 vision :).
> I'm
> > not sure I understand why you are so eager to force users to think about
> > the controller as a separate system. It's almost like Zookeeper is not
> > going anywhere!
> >
>
> Well, KIP-500 clearly does identify the controller as a separate system,
> not as part of the broker, even if it runs in the same JVM.  :) A system
> where all the nodes had the same role would need a fundamentally different
> design, like Cassandra or something.
>
> I know you're joking, but just so that others understand, it's not fair to
> say that "it's almost like ZK is not going anyway."  KIP-500 clusters will
> have simpler deployment and support a lot of interesting use-cases like
> single-JVM clusters, that would not be possible with the current setup.
>
> At the same time, saying "broker" when you mean "controller" confuses
> people.  For example, I had someone ask a question recently about why we
> needed BrokerHeartbeat when Raft already specifies a mechanism for leader
> change.  I had to explain the different between broker nodes and controller
> nodes.
>
> Anyway, +1 (binding).  Excited to see Raftka going forward!
>
> best,
> Colin
>
> >
> > -Jason
> >
> >
> >
> >
> > On Mon, Aug 3, 2020 at 4:36 PM Jose Garcia Sancio <js...@confluent.io>
> > wrote:
> >
> > > +1.
> > >
> > > Thanks for the detailed KIP!
> > >
> > > On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > > >
> > > > Hi All, I'd like to start a vote on this proposal:
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > > .
> > > > The discussion has been active for a bit more than 3 months and I
> think
> > > the
> > > > main points have been addressed. We have also moved some of the
> pieces
> > > into
> > > > follow-up proposals, such as KIP-630.
> > > >
> > > > Please keep in mind that the details are bound to change as all of
> > > > the pieces start coming together. As usual, we will keep this thread
> > > > notified of such changes.
> > > >
> > > > For me personally, this is super exciting since we have been thinking
> > > about
> > > > this work ever since I started working on Kafka! I am +1 of course.
> > > >
> > > > Best,
> > > > Jason
> > >
> > >
> > >
> > > --
> > > -Jose
> > >
> >
>

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Posted by Colin McCabe <cm...@apache.org>.

On Mon, Aug 3, 2020, at 20:55, Jason Gustafson wrote:
> Hi Colin,
> 
> Thanks for the responses.
> 
> > I have a few lingering questions.  I still don't like the fact that the
> > leader epoch / fetch epoch is 31 bits.  What happens when this rolls over?
> > Can we just make this 63 bits now so that we never have to worry about it
> > again?  ZK has some awful bugs surrounding 32 bit rollover, due to a
> > similar decision to use a 32 bit counter in their log structure.  Doesn't
> > seem like a good tradeoff.
> 
> This is a bit difficult to do at the moment since the leader epoch is 4
> bytes in the message format. One option that I have considered is toggling
> a batch attribute that lets us turn the producerId into an 8-byte leader
> epoch instead since we do not have a use for it in the metadata quorum. We
> would need another solution if we ever wanted to use Raft for partition
> replication, but perhaps by then we can make the case for a new message
> format.
> 

Hi Jason,

Thanks for the explanation.  I suspected that there was a technical limitation like this lurking somewhere.  I think a hack like the one you suggested would be OK for now.  I just really want to avoid thinking about rollover :)

>
> > Just like in bootstrap.servers, I don't think we want to manually assign
> > IDs per hostname.  The hosts know their own IDs, after all.  Having to
> > manually specify the IDs also opens up the possibility of
> > misconfigurations: what I say the foobar server is node 2, but it's
> > actually node 3? This would make the logs extremely confusing.  I realize
> > this may require a little finesse to do, but there's got to be a way we can
> > avoid hard-coding IDs
> 
> Fine. We can move this to KIP-631, but I think it would be a mistake to
> take IDs out of this configuration. For safety, the one thing that the
> configuration needs to tell us is what the IDs of the voters are. Without
> that, it's really easy for a cluster to get into a state where none of
> the quorum members agree on what the proper set of voters is. I think
> perhaps you are confused on the usage of these IDs. It is what enables
> validation of voter requests. Without it, a voter would have to accept a
> vote request from any ID. There is a reason that other consensus systems
> like zookeeper and etcd require ids when configured statically.
>

I hadn't considered the fact that we need to validate incoming voter requests.  The fact that nodes can have multiple DNS addresses does make this difficult to do with just a list of hostnames.

I guess you're right that we should keep the IDs.  But let's be careful to validate that the node's ID really is what we think it is, and consider that peer failed if it's not.

>
> > Also, here's another case where we are saying "broker" when we mean
> > "controller."  It's really hard to break old habits.  :)
> 
> I think we still have this basic disagreement on the KIP-500 vision :). I'm
> not sure I understand why you are so eager to force users to think about
> the controller as a separate system. It's almost like Zookeeper is not
> going anywhere!
>

Well, KIP-500 clearly does identify the controller as a separate system, not as part of the broker, even if it runs in the same JVM.  :) A system where all the nodes had the same role would need a fundamentally different design, like Cassandra or something.

I know you're joking, but just so that others understand, it's not fair to say that "it's almost like ZK is not going anyway."  KIP-500 clusters will have simpler deployment and support a lot of interesting use-cases like single-JVM clusters, that would not be possible with the current setup.

At the same time, saying "broker" when you mean "controller" confuses people.  For example, I had someone ask a question recently about why we needed BrokerHeartbeat when Raft already specifies a mechanism for leader change.  I had to explain the different between broker nodes and controller nodes. 

Anyway, +1 (binding).  Excited to see Raftka going forward!

best,
Colin

>
> -Jason
> 
> 
> 
> 
> On Mon, Aug 3, 2020 at 4:36 PM Jose Garcia Sancio <js...@confluent.io>
> wrote:
> 
> > +1.
> >
> > Thanks for the detailed KIP!
> >
> > On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io>
> > wrote:
> > >
> > > Hi All, I'd like to start a vote on this proposal:
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > .
> > > The discussion has been active for a bit more than 3 months and I think
> > the
> > > main points have been addressed. We have also moved some of the pieces
> > into
> > > follow-up proposals, such as KIP-630.
> > >
> > > Please keep in mind that the details are bound to change as all of
> > > the pieces start coming together. As usual, we will keep this thread
> > > notified of such changes.
> > >
> > > For me personally, this is super exciting since we have been thinking
> > about
> > > this work ever since I started working on Kafka! I am +1 of course.
> > >
> > > Best,
> > > Jason
> >
> >
> >
> > --
> > -Jose
> >
>

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Posted by Jason Gustafson <ja...@confluent.io>.

Hi Colin,

Thanks for the responses.

> I have a few lingering questions.  I still don't like the fact that the
leader epoch / fetch epoch is 31 bits.  What happens when this rolls over?
Can we just make this 63 bits now so that we never have to worry about it
again?  ZK has some awful bugs surrounding 32 bit rollover, due to a
similar decision to use a 32 bit counter in their log structure.  Doesn't
seem like a good tradeoff.

This is a bit difficult to do at the moment since the leader epoch is 4
bytes in the message format. One option that I have considered is toggling
a batch attribute that lets us turn the producerId into an 8-byte leader
epoch instead since we do not have a use for it in the metadata quorum. We
would need another solution if we ever wanted to use Raft for partition
replication, but perhaps by then we can make the case for a new message
format.

> Just like in bootstrap.servers, I don't think we want to manually assign
IDs per hostname.  The hosts know their own IDs, after all.  Having to
manually specify the IDs also opens up the possibility of
misconfigurations: what I say the foobar server is node 2, but it's
actually node 3? This would make the logs extremely confusing.  I realize
this may require a little finesse to do, but there's got to be a way we can
avoid hard-coding IDs

Fine. We can move this to KIP-631, but I think it would be a mistake to
take IDs out of this configuration. For safety, the one thing that the
configuration needs to tell us is what the IDs of the voters are. Without
that, it's really easy for a cluster to get into a state where none of
the quorum members agree on what the proper set of voters is. I think
perhaps you are confused on the usage of these IDs. It is what enables
validation of voter requests. Without it, a voter would have to accept a
vote request from any ID. There is a reason that other consensus systems
like zookeeper and etcd require ids when configured statically.

> Also, here's another case where we are saying "broker" when we mean
"controller."  It's really hard to break old habits.  :)

I think we still have this basic disagreement on the KIP-500 vision :). I'm
not sure I understand why you are so eager to force users to think about
the controller as a separate system. It's almost like Zookeeper is not
going anywhere!

-Jason

On Mon, Aug 3, 2020 at 4:36 PM Jose Garcia Sancio <js...@confluent.io>
wrote:

> +1.
>
> Thanks for the detailed KIP!
>
> On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io>
> wrote:
> >
> > Hi All, I'd like to start a vote on this proposal:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> .
> > The discussion has been active for a bit more than 3 months and I think
> the
> > main points have been addressed. We have also moved some of the pieces
> into
> > follow-up proposals, such as KIP-630.
> >
> > Please keep in mind that the details are bound to change as all of
> > the pieces start coming together. As usual, we will keep this thread
> > notified of such changes.
> >
> > For me personally, this is super exciting since we have been thinking
> about
> > this work ever since I started working on Kafka! I am +1 of course.
> >
> > Best,
> > Jason
>
>
>
> --
> -Jose
>

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Posted by Jose Garcia Sancio <js...@confluent.io>.

+1.

Thanks for the detailed KIP!

On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io> wrote:
>
> Hi All, I'd like to start a vote on this proposal:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum.
> The discussion has been active for a bit more than 3 months and I think the
> main points have been addressed. We have also moved some of the pieces into
> follow-up proposals, such as KIP-630.
>
> Please keep in mind that the details are bound to change as all of
> the pieces start coming together. As usual, we will keep this thread
> notified of such changes.
>
> For me personally, this is super exciting since we have been thinking about
> this work ever since I started working on Kafka! I am +1 of course.
>
> Best,
> Jason



-- 
-Jose

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Posted by Colin McCabe <co...@cmccabe.xyz>.

On Tue, Aug 11, 2020, at 11:30, Ismael Juma wrote:
> Thanks for the KIP, +1 (binding). A couple of comments:
> 
> 1. We have "quorum.voters=1@kafka-1:9092, 2@kafka-2:9092,
> 3@kafka-3:9092". Could
> this be a bit confusing given that the authority part of a url is defined
> as "authority = [userinfo@]host[:port]"?
>

Hmm... I don't think there is much chance for confusion.  The authority field in URLs is kind of obscure at this point.  People used to do stuff like http://myuser:mypassword@example.com, but I'm not sure most users are even aware that this feature used to exist these days.

You could also argue that the voter identity is kind of like the an authority, conceptually.  To the extent that it even makes sense to reuse URL concepts for something that isn't a URL, this seems reasonble.

Separately, I would argue that we should not re-use the URI or URL classes to parse these strings.  We made the mistake of parsing filenames as URLs in Hadoop (and in HDFS) and it created lots of artificial difficulties.  I think filenames with a colon still don't work to this day, and filenames with a pound sign are treated inconsistently.  Sometimes it's best to just parse a string the way you want to parse it for your particular use-case, and not pull in a large library.

>
> 2. With regards to the Quorum State file, do we have anything that helps us
> detect corruption?
> 

Good question, but it's probably better to have a separate KIP about adding checksums to some of our small text files, since this issue seems to exist for many of them.

best,
Colin

>
> Ismael
> 
> 
> On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io> wrote:
> 
> > Hi All, I'd like to start a vote on this proposal:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > .
> > The discussion has been active for a bit more than 3 months and I think the
> > main points have been addressed. We have also moved some of the pieces into
> > follow-up proposals, such as KIP-630.
> >
> > Please keep in mind that the details are bound to change as all of
> > the pieces start coming together. As usual, we will keep this thread
> > notified of such changes.
> >
> > For me personally, this is super exciting since we have been thinking about
> > this work ever since I started working on Kafka! I am +1 of course.
> >
> > Best,
> > Jason
> >
>

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Posted by Jason Gustafson <ja...@confluent.io>.

Thanks everyone for the votes. I am going to close this with +5 binding
(me, Colin, Boyang, Jun, and Ismael) and none against.

@Jun Yes, I think it makes sense to expose the usual request metrics for
the new APIs.

Best,
Jason



On Tue, Aug 11, 2020 at 11:30 AM Ismael Juma <is...@juma.me.uk> wrote:

> Thanks for the KIP, +1 (binding). A couple of comments:
>
> 1. We have "quorum.voters=1@kafka-1:9092, 2@kafka-2:9092,
> 3@kafka-3:9092". Could
> this be a bit confusing given that the authority part of a url is defined
> as "authority = [userinfo@]host[:port]"?
> 2. With regards to the Quorum State file, do we have anything that helps us
> detect corruption?
>
> Ismael
>
>
> On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi All, I'd like to start a vote on this proposal:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > .
> > The discussion has been active for a bit more than 3 months and I think
> the
> > main points have been addressed. We have also moved some of the pieces
> into
> > follow-up proposals, such as KIP-630.
> >
> > Please keep in mind that the details are bound to change as all of
> > the pieces start coming together. As usual, we will keep this thread
> > notified of such changes.
> >
> > For me personally, this is super exciting since we have been thinking
> about
> > this work ever since I started working on Kafka! I am +1 of course.
> >
> > Best,
> > Jason
> >
>

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Posted by Ismael Juma <is...@juma.me.uk>.

Thanks for the KIP, +1 (binding). A couple of comments:

1. We have "quorum.voters=1@kafka-1:9092, 2@kafka-2:9092,
3@kafka-3:9092". Could
this be a bit confusing given that the authority part of a url is defined
as "authority = [userinfo@]host[:port]"?
2. With regards to the Quorum State file, do we have anything that helps us
detect corruption?

Ismael

On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io> wrote:

> Hi All, I'd like to start a vote on this proposal:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> .
> The discussion has been active for a bit more than 3 months and I think the
> main points have been addressed. We have also moved some of the pieces into
> follow-up proposals, such as KIP-630.
>
> Please keep in mind that the details are bound to change as all of
> the pieces start coming together. As usual, we will keep this thread
> notified of such changes.
>
> For me personally, this is super exciting since we have been thinking about
> this work ever since I started working on Kafka! I am +1 of course.
>
> Best,
> Jason
>

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Posted by Colin McCabe <cm...@apache.org>.

Hi Jason,

The KIP looks great.  Thanks for all the work you've put into this.

I have a few lingering questions.  I still don't like the fact that the leader epoch / fetch epoch is 31 bits.  What happens when this rolls over?  Can we just make this 63 bits now so that we never have to worry about it again?  ZK has some awful bugs surrounding 32 bit rollover, due to a similar decision to use a 32 bit counter in their log structure.  Doesn't seem like a good tradeoff.

For fsync, perhaps we can have a per-topic configuration to enable or disable fsync on a per-topic basis.  This would avoid adding a special hack just for the metadata topic.  It would also make life better for JUnit tests, since we could disable fsync on the metadata topic in that case and get a performance boost.  To keep the scope small we could support this config only for Raft topics for now, and only generalize it later if needed.

There are some references to setting the controller ID by using broker.id.  I think we should leave this out of this KIP and discuss it as part of KIP-631 instead.  As you mention earlier in the KIP, the controller design should be out of scope for KIP-595.  It's enough to say that we get this value statically.

> quorum.voters: This is a connection map which contains the IDs of the voters
> and their respective endpoint. We use the following format for each voter in
> the list {broker-id}@{broker-host):{broker-port}. For example, 
> `quorum.voters=1@kafka-1:9092, 2@kafka-2:9092, 3@kafka-3:9092`.

Just like in bootstrap.servers, I don't think we want to manually assign IDs per hostname.  The hosts know their own IDs, after all.  Having to manually specify the IDs also opens up the possibility of misconfigurations: what I say the foobar server is node 2, but it's actually node 3? This would make the logs extremely confusing.  I realize this may require a little finesse to do, but there's got to be a way we can avoid hard-coding IDs.

Also, here's another case where we are saying "broker" when we mean "controller."  It's really hard to break old habits.  :)

best,
Colin

On Mon, Aug 3, 2020, at 11:19, Ben Stopford wrote:
> +1
> 
> On Mon, 3 Aug 2020 at 19:03, Jason Gustafson <ja...@confluent.io> wrote:
> 
> > Hi All, I'd like to start a vote on this proposal:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > .
> > The discussion has been active for a bit more than 3 months and I think the
> > main points have been addressed. We have also moved some of the pieces into
> > follow-up proposals, such as KIP-630.
> >
> > Please keep in mind that the details are bound to change as all of
> > the pieces start coming together. As usual, we will keep this thread
> > notified of such changes.
> >
> > For me personally, this is super exciting since we have been thinking about
> > this work ever since I started working on Kafka! I am +1 of course.
> >
> > Best,
> > Jason
> >
> 
> 
> -- 
> 
> Ben Stopford
> 
> Lead Technologist, Office of the CTO
> 
> <https://www.confluent.io>
>

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Posted by Ben Stopford <be...@confluent.io>.

+1

On Mon, 3 Aug 2020 at 19:03, Jason Gustafson <ja...@confluent.io> wrote:

> Hi All, I'd like to start a vote on this proposal:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> .
> The discussion has been active for a bit more than 3 months and I think the
> main points have been addressed. We have also moved some of the pieces into
> follow-up proposals, such as KIP-630.
>
> Please keep in mind that the details are bound to change as all of
> the pieces start coming together. As usual, we will keep this thread
> notified of such changes.
>
> For me personally, this is super exciting since we have been thinking about
> this work ever since I started working on Kafka! I am +1 of course.
>
> Best,
> Jason
>


-- 

Ben Stopford

Lead Technologist, Office of the CTO

<https://www.confluent.io>