You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Jason Gustafson <ja...@confluent.io> on 2020/08/03 16:19:50 UTC
Re: [DISCUSS] KIP-595: A Raft Protocol for the Metadata Quorum

Hey Jun,

> Thanks for the response. Should we make clusterId a nullable field
consistently in all new requests?

Yes, makes sense. I updated the proposal.

-Jason

On Fri, Jul 31, 2020 at 4:17 PM Jun Rao <ju...@confluent.io> wrote:

> Hi, Jason,
>
> Thanks for the response. Should we make clusterId a nullable field
> consistently in all new requests?
>
> Jun
>
> On Wed, Jul 29, 2020 at 12:20 PM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Jun,
> >
> > I added a section on "Cluster Bootstrapping" which discusses clusterId
> > generation and the process through which brokers find the current leader.
> > The quick summary is that the first controller will be responsible for
> > generating the clusterId and persisting it in the metadata log. Before
> the
> > first leader has been elected, quorum APIs will skip clusterId
> validation.
> > This seems reasonable since this is primarily intended to prevent the
> > damage from misconfiguration after a cluster has been running for some
> > time. Upon startup, brokers begin by sending Fetch requests to find the
> > current leader. This will include the cluster.id from meta.properties if
> > it
> > is present. The broker will shutdown immediately if it receives
> > INVALID_CLUSTER_ID from the Fetch response.
> >
> > I also added some details about our testing strategy, which you asked
> about
> > previously.
> >
> > Thanks,
> > Jason
> >
> > On Mon, Jul 27, 2020 at 10:46 PM Boyang Chen <reluctanthero104@gmail.com
> >
> > wrote:
> >
> > > On Mon, Jul 27, 2020 at 4:58 AM Unmesh Joshi <un...@gmail.com>
> > > wrote:
> > >
> > > > Just checked etcd and zookeeper code, and both support leader to step
> > > down
> > > > as a follower to make sure there are no two leaders if the leader has
> > > been
> > > > disconnected from the majority of the followers
> > > > For etcd this is https://github.com/etcd-io/etcd/issues/3866
> > > > For Zookeeper its
> https://issues.apache.org/jira/browse/ZOOKEEPER-1699
> > > > I was just thinking if it would be difficult to implement in the Pull
> > > based
> > > > model, but I guess not. It is possibly the same way ISR list is
> managed
> > > > currently, if leader of the controller quorum loses majority of the
> > > > followers, it should step down and become follower, that way, telling
> > > > client in time that it was disconnected from the quorum, and not keep
> > on
> > > > sending state metadata to clients.
> > > >
> > > > Thanks,
> > > > Unmesh
> > > >
> > > >
> > > > On Mon, Jul 27, 2020 at 9:31 AM Unmesh Joshi <un...@gmail.com>
> > > > wrote:
> > > >
> > > > > >>Could you clarify on this question? Which part of the raft group
> > > > doesn't
> > > > > >>know about leader dis-connection?
> > > > > The leader of the controller quorum is partitioned from the
> > controller
> > > > > cluster, and a different leader is elected for the remaining
> > controller
> > > > > cluster.
> > > >
> > > I see your concern. For KIP-595 implementation, since there is no
> regular
> > > heartbeats sent
> > > from the leader to the followers, we decided to piggy-back on the fetch
> > > timeout so that if the leader did not receive Fetch
> > > requests from a majority of the quorum for that amount of time, it
> would
> > > begin a new election and
> > > start sending VoteRequest to voter nodes in the cluster to understand
> the
> > > latest quorum. You could
> > > find more details in this section
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote
> > > >
> > > .
> > >
> > >
> > > > > I think there are two things here,
> > > > > 1.  The old leader will not know if it's disconnected from the rest
> > of
> > > > the
> > > > > controller quorum cluster unless it receives BeginQuorumEpoch from
> > the
> > > > new
> > > > > leader. So it will keep on serving stale metadata to the clients
> > > > (Brokers,
> > > > > Producers and Consumers)
> > > > > 2. I assume, the Broker Leases will be managed on the controller
> > quorum
> > > > > leader. This partitioned leader will keep on tracking broker leases
> > it
> > > > has,
> > > > > while the new leader of the quorum will also start managing broker
> > > > leases.
> > > > > So while the quorum leader is partitioned, there will be two
> > membership
> > > > > views of the kafka brokers managed on two leaders.
> > > > > Unless broker heartbeats are also replicated as part of the Raft
> log,
> > > > > there is no way to solve this?
> > > > > I know LogCabin implementation does replicate client heartbeats. I
> > > > suspect
> > > > > that the same issue is there in Zookeeper, which does not replicate
> > > > client
> > > > > Ping requests..
> > > > >
> > > > > Thanks,
> > > > > Unmesh
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 27, 2020 at 6:23 AM Boyang Chen <
> > > reluctanthero104@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Thanks for the questions Unmesh!
> > > > >>
> > > > >> On Sun, Jul 26, 2020 at 6:18 AM Unmesh Joshi <
> unmeshjoshi@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >> > Hi,
> > > > >> >
> > > > >> > In the FetchRequest Handling, how to make sure we handle
> scenarios
> > > > where
> > > > >> > the leader might have been disconnected from the cluster, but
> > > doesn't
> > > > >> know
> > > > >> > yet?
> > > > >> >
> > > > >> Could you clarify on this question? Which part of the raft group
> > > doesn't
> > > > >> know about leader
> > > > >> dis-connection?
> > > > >>
> > > > >>
> > > > >> > As discussed in the Raft Thesis section 6.4, the linearizable
> > > > semantics
> > > > >> of
> > > > >> > read requests is implemented in LogCabin by sending heartbeat to
> > > > >> followers
> > > > >> > and waiting till the heartbeats are successful to make sure that
> > the
> > > > >> leader
> > > > >> > is still the leader.
> > > > >> > I think for the controller quorum to make sure none of the
> > consumers
> > > > get
> > > > >> > stale data, it's important to have linearizable semantics? In
> the
> > > pull
> > > > >> > based model, the leader will need to wait for heartbeats from
> the
> > > > >> followers
> > > > >> > before returning each fetch request from the consumer then? Or
> do
> > we
> > > > >> need
> > > > >> > to introduce some other request?
> > > > >> > (Zookeeper does not have linearizable semantics for read
> requests,
> > > but
> > > > >> as
> > > > >> > of now all the kafka interactions are through writes and
> watches).
> > > > >> >
> > > > >> > This is a very good question. For our v1 implementation we are
> not
> > > > >> aiming
> > > > >> to guarantee linearizable read, which
> > > > >> would be considered as a follow-up effort. Note that today in
> Kafka
> > > > there
> > > > >> is no guarantee on the metadata freshness either,
> > > > >> so no regression is introduced.
> > > > >>
> > > > >>
> > > > >> > Thanks,
> > > > >> > Unmesh
> > > > >> >
> > > > >> > On Fri, Jul 24, 2020 at 11:36 PM Jun Rao <ju...@confluent.io>
> > wrote:
> > > > >> >
> > > > >> > > Hi, Jason,
> > > > >> > >
> > > > >> > > Thanks for the reply.
> > > > >> > >
> > > > >> > > 101. Sounds good. Regarding clusterId, I am not sure storing
> it
> > in
> > > > the
> > > > >> > > metadata log is enough. For example, the vote request includes
> > > > >> clusterId.
> > > > >> > > So, no one can vote until they know the clusterId. Also, it
> > would
> > > be
> > > > >> > useful
> > > > >> > > to support the case when a voter completely loses its disk and
> > > needs
> > > > >> to
> > > > >> > > recover.
> > > > >> > >
> > > > >> > > 210. There is no longer a FindQuorum request. When a follower
> > > > >> restarts,
> > > > >> > how
> > > > >> > > does it discover the leader? Is that based on DescribeQuorum?
> It
> > > > >> would be
> > > > >> > > useful to document this.
> > > > >> > >
> > > > >> > > Jun
> > > > >> > >
> > > > >> > > On Fri, Jul 17, 2020 at 2:15 PM Jason Gustafson <
> > > jason@confluent.io
> > > > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Hi Jun,
> > > > >> > > >
> > > > >> > > > Thanks for the questions.
> > > > >> > > >
> > > > >> > > > 101. I am treating some of the bootstrapping problems as out
> > of
> > > > the
> > > > >> > scope
> > > > >> > > > of this KIP. I am working on a separate proposal which
> > addresses
> > > > >> > > > bootstrapping security credentials specifically. Here is a
> > rough
> > > > >> sketch
> > > > >> > > of
> > > > >> > > > how I am seeing it:
> > > > >> > > >
> > > > >> > > > 1. Dynamic broker configurations including encrypted
> passwords
> > > > will
> > > > >> be
> > > > >> > > > persisted in the metadata log and cached in the broker's
> > > > >> > > `meta.properties`
> > > > >> > > > file.
> > > > >> > > > 2. We will provide a tool which allows users to directly
> > > override
> > > > >> the
> > > > >> > > > values in `meta.properties` without requiring access to the
> > > > quorum.
> > > > >> > This
> > > > >> > > > can be used to bootstrap the credentials of the voter set
> > itself
> > > > >> before
> > > > >> > > the
> > > > >> > > > cluster has been started.
> > > > >> > > > 3. Some dynamic config changes will only be allowed when a
> > > broker
> > > > is
> > > > >> > > > online. For example, changing a truststore password
> > dynamically
> > > > >> would
> > > > >> > > > prevent that broker from being able to start if it were
> > offline
> > > > when
> > > > >> > the
> > > > >> > > > change was made.
> > > > >> > > > 4. I am still thinking a little bit about SCRAM credentials,
> > but
> > > > >> most
> > > > >> > > > likely they will be handled with an approach similar to
> > > > >> > > `meta.properties`.
> > > > >> > > >
> > > > >> > > > 101.3 As for the question about `clusterId`, I think the way
> > we
> > > > >> would
> > > > >> > do
> > > > >> > > > this is to have the first elected leader generate a UUID and
> > > write
> > > > >> it
> > > > >> > to
> > > > >> > > > the metadata log. Let me add some detail to the proposal
> about
> > > > this.
> > > > >> > > >
> > > > >> > > > A few additional answers below:
> > > > >> > > >
> > > > >> > > > 203. Yes, that is correct.
> > > > >> > > >
> > > > >> > > > 204. That is a good question. What happens in this case is
> > that
> > > > all
> > > > >> > > voters
> > > > >> > > > advance their epoch to the one designated by the candidate
> > even
> > > if
> > > > >> they
> > > > >> > > > reject its vote request. Assuming the candidate fails to be
> > > > elected,
> > > > >> > the
> > > > >> > > > election will be retried until a leader emerges.
> > > > >> > > >
> > > > >> > > > 205. I had some discussion with Colin offline about this
> > > problem.
> > > > I
> > > > >> > think
> > > > >> > > > the answer should be "yes," but it probably needs a little
> > more
> > > > >> > thought.
> > > > >> > > > Handling JBOD failures is tricky. For an observer, we can
> > > > replicate
> > > > >> the
> > > > >> > > > metadata log from scratch safely in a new log dir. But if
> the
> > > log
> > > > >> dir
> > > > >> > of
> > > > >> > > a
> > > > >> > > > voter fails, I do not think it is generally safe to start
> from
> > > an
> > > > >> empty
> > > > >> > > > state.
> > > > >> > > >
> > > > >> > > > 206. Yes, that is discussed in KIP-631 I believe.
> > > > >> > > >
> > > > >> > > > 207. Good suggestion. I will work on this.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > > Jason
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Thu, Jul 16, 2020 at 3:44 PM Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi, Jason,
> > > > >> > > > >
> > > > >> > > > > Thanks for the updated KIP. Looks good overall. A few more
> > > > >> comments
> > > > >> > > > below.
> > > > >> > > > >
> > > > >> > > > > 101. I still don't see a section on bootstrapping related
> > > > issues.
> > > > >> It
> > > > >> > > > would
> > > > >> > > > > be useful to document if/how the following is supported.
> > > > >> > > > > 101.1 Currently, we support auto broker id generation. Is
> > this
> > > > >> > > supported
> > > > >> > > > > for bootstrap brokers?
> > > > >> > > > > 101.2 As Colin mentioned, sometimes we may need to load
> the
> > > > >> security
> > > > >> > > > > credentials to be broker before it can be connected to.
> > Could
> > > > you
> > > > >> > > > provide a
> > > > >> > > > > bit more detail on how this will work?
> > > > >> > > > > 101.3 Currently, we use ZK to generate clusterId on a new
> > > > cluster.
> > > > >> > With
> > > > >> > > > > Raft, how does every broker generate the same clusterId
> in a
> > > > >> > > distributed
> > > > >> > > > > way?
> > > > >> > > > >
> > > > >> > > > > 200. It would be useful to document if the various special
> > > > offsets
> > > > >> > (log
> > > > >> > > > > start offset, recovery point, HWM, etc) for the Raft log
> are
> > > > >> stored
> > > > >> > in
> > > > >> > > > the
> > > > >> > > > > same existing checkpoint files or not.
> > > > >> > > > > 200.1 Since the Raft log flushes every append, does that
> > allow
> > > > us
> > > > >> to
> > > > >> > > > > recover from a recovery point within the active segment or
> > do
> > > we
> > > > >> > still
> > > > >> > > > need
> > > > >> > > > > to scan the full segment including the recovery point? The
> > > > former
> > > > >> can
> > > > >> > > be
> > > > >> > > > > tricky since multiple records can fall into the same disk
> > page
> > > > >> and a
> > > > >> > > > > subsequent flush may corrupt a page with previously
> flushed
> > > > >> records.
> > > > >> > > > >
> > > > >> > > > > 201. Configurations.
> > > > >> > > > > 201.1 How do the Raft brokers get security related configs
> > for
> > > > >> inter
> > > > >> > > > broker
> > > > >> > > > > communication? Is that based on the existing
> > > > >> > > > > inter.broker.security.protocol?
> > > > >> > > > > 201.2 We have quorum.retry.backoff.max.ms and
> > > > >> > quorum.retry.backoff.ms,
> > > > >> > > > but
> > > > >> > > > > only quorum.election.backoff.max.ms. This seems a bit
> > > > >> inconsistent.
> > > > >> > > > >
> > > > >> > > > > 202. Metrics:
> > > > >> > > > > 202.1 TotalTimeMs, InboundQueueTimeMs, HandleTimeMs,
> > > > >> > > OutboundQueueTimeMs:
> > > > >> > > > > Are those the same as existing totalTime,
> requestQueueTime,
> > > > >> > localTime,
> > > > >> > > > > responseQueueTime? Could we reuse the existing ones with
> the
> > > tag
> > > > >> > > > > request=[request-type]?
> > > > >> > > > > 202.2. Could you explain what InboundChannelSize and
> > > > >> > > OutboundChannelSize
> > > > >> > > > > are?
> > > > >> > > > > 202.3 ElectionLatencyMax/Avg: It seems that both should be
> > > > >> windowed?
> > > > >> > > > >
> > > > >> > > > > 203. Quorum State: I assume that LeaderId will be kept
> > > > >> consistently
> > > > >> > > with
> > > > >> > > > > LeaderEpoch. For example, if a follower transitions to
> > > candidate
> > > > >> and
> > > > >> > > > bumps
> > > > >> > > > > up LeaderEpoch, it will set leaderId to -1 and persist
> both
> > in
> > > > the
> > > > >> > > Quorum
> > > > >> > > > > state file. Is that correct?
> > > > >> > > > >
> > > > >> > > > > 204. I was thinking about a corner case when a Raft broker
> > is
> > > > >> > > partitioned
> > > > >> > > > > off. This broker will then be in a continuous loop of
> > bumping
> > > up
> > > > >> the
> > > > >> > > > leader
> > > > >> > > > > epoch, but failing to get enough votes. When the
> > partitioning
> > > is
> > > > >> > > removed,
> > > > >> > > > > this broker's high leader epoch will force a leader
> > election.
> > > I
> > > > >> > assume
> > > > >> > > > > other Raft brokers can immediately advance their leader
> > epoch
> > > > >> passing
> > > > >> > > the
> > > > >> > > > > already bumped epoch such that leader election won't be
> > > delayed.
> > > > >> Is
> > > > >> > > that
> > > > >> > > > > right?
> > > > >> > > > >
> > > > >> > > > > 205. In a JBOD setting, could we use the existing tool to
> > move
> > > > the
> > > > >> > Raft
> > > > >> > > > log
> > > > >> > > > > from one disk to another?
> > > > >> > > > >
> > > > >> > > > > 206. The KIP doesn't mention the local metadata store
> > derived
> > > > from
> > > > >> > the
> > > > >> > > > Raft
> > > > >> > > > > log. Will that be covered in a separate KIP?
> > > > >> > > > >
> > > > >> > > > > 207. Since this is a critical component. Could we add a
> > > section
> > > > on
> > > > >> > the
> > > > >> > > > > testing plan for correctness?
> > > > >> > > > >
> > > > >> > > > > 208. Performance. Do we plan to do group commit (e.g.
> buffer
> > > > >> pending
> > > > >> > > > > appends during a flush and then flush all accumulated
> > pending
> > > > >> records
> > > > >> > > > > together in the next flush) for better throughput?
> > > > >> > > > >
> > > > >> > > > > 209. "the leader can actually defer fsync until it knows
> > > > >> > "quorum.size -
> > > > >> > > > 1"
> > > > >> > > > > has get to a certain entry offset." Why is that
> > "quorum.size -
> > > > 1"
> > > > >> > > instead
> > > > >> > > > > of the majority of the quorum?
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > >
> > > > >> > > > > Jun
> > > > >> > > > >
> > > > >> > > > > On Mon, Jul 13, 2020 at 9:43 AM Jason Gustafson <
> > > > >> jason@confluent.io>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi All,
> > > > >> > > > > >
> > > > >> > > > > > Just a quick update on the proposal. We have decided to
> > move
> > > > >> quorum
> > > > >> > > > > > reassignment to a separate KIP:
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-642%3A+Dynamic+quorum+reassignment
> > > > >> > > > > > .
> > > > >> > > > > > The way this ties into cluster bootstrapping is
> > complicated,
> > > > so
> > > > >> we
> > > > >> > > felt
> > > > >> > > > > we
> > > > >> > > > > > needed a bit more time for validation. That leaves the
> > core
> > > of
> > > > >> this
> > > > >> > > > > > proposal as quorum-based replication. If there are no
> > > further
> > > > >> > > comments,
> > > > >> > > > > we
> > > > >> > > > > > will plan to start a vote later this week.
> > > > >> > > > > >
> > > > >> > > > > > Thanks,
> > > > >> > > > > > Jason
> > > > >> > > > > >
> > > > >> > > > > > On Wed, Jun 24, 2020 at 10:43 AM Guozhang Wang <
> > > > >> wangguoz@gmail.com
> > > > >> > >
> > > > >> > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > @Jun Rao <ju...@gmail.com>
> > > > >> > > > > > >
> > > > >> > > > > > > Regarding your comment about log compaction. After
> some
> > > > >> > deep-diving
> > > > >> > > > > into
> > > > >> > > > > > > this we've decided to propose a new snapshot-based log
> > > > >> cleaning
> > > > >> > > > > mechanism
> > > > >> > > > > > > which would be used to replace the current compaction
> > > > >> mechanism
> > > > >> > for
> > > > >> > > > > this
> > > > >> > > > > > > meta log. A new KIP will be proposed specifically for
> > this
> > > > >> idea.
> > > > >> > > > > > >
> > > > >> > > > > > > All,
> > > > >> > > > > > >
> > > > >> > > > > > > I've updated the KIP wiki a bit updating one config "
> > > > >> > > > > > > election.jitter.max.ms"
> > > > >> > > > > > > to "election.backoff.max.ms" to make it more clear
> > about
> > > > the
> > > > >> > > usage:
> > > > >> > > > > the
> > > > >> > > > > > > configured value will be the upper bound of the binary
> > > > >> > exponential
> > > > >> > > > > > backoff
> > > > >> > > > > > > time after a failed election, before starting a new
> one.
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > Guozhang
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > On Fri, Jun 12, 2020 at 9:26 AM Boyang Chen <
> > > > >> > > > > reluctanthero104@gmail.com>
> > > > >> > > > > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Thanks for the suggestions Guozhang.
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Thu, Jun 11, 2020 at 2:51 PM Guozhang Wang <
> > > > >> > > wangguoz@gmail.com>
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Hello Boyang,
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Thanks for the updated information. A few
> questions
> > > > here:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 1) Should the quorum-file also update to support
> > > > >> multi-raft?
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > I'm neutral about this, as we don't know yet how
> the
> > > > >> > multi-raft
> > > > >> > > > > > modules
> > > > >> > > > > > > > would behave. If
> > > > >> > > > > > > > we have different threads operating different raft
> > > groups,
> > > > >> > > > > > consolidating
> > > > >> > > > > > > > the `checkpoint` files seems
> > > > >> > > > > > > > not reasonable. We could always add
> > `multi-quorum-file`
> > > > >> later
> > > > >> > if
> > > > >> > > > > > > possible.
> > > > >> > > > > > > >
> > > > >> > > > > > > > 2) In the previous proposal, there's fields in the
> > > > >> > > > FetchQuorumRecords
> > > > >> > > > > > > like
> > > > >> > > > > > > > > latestDirtyOffset, is that dropped intentionally?
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > I dropped the latestDirtyOffset since it is
> > associated
> > > > >> with
> > > > >> > the
> > > > >> > > > log
> > > > >> > > > > > > > compaction discussion. This is beyond this KIP scope
> > and
> > > > we
> > > > >> > could
> > > > >> > > > > > > > potentially get a separate KIP to talk about it.
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > > 3) I think we also need to elaborate a bit more
> > > details
> > > > >> > > regarding
> > > > >> > > > > > when
> > > > >> > > > > > > to
> > > > >> > > > > > > > > send metadata request and discover-brokers;
> > currently
> > > we
> > > > >> only
> > > > >> > > > > > discussed
> > > > >> > > > > > > > > during bootstrap how these requests would be
> sent. I
> > > > think
> > > > >> > the
> > > > >> > > > > > > following
> > > > >> > > > > > > > > scenarios would also need these requests
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 3.a) As long as a broker does not know the current
> > > > quorum
> > > > >> > > > > (including
> > > > >> > > > > > > the
> > > > >> > > > > > > > > leader and the voters), it should continue
> > > periodically
> > > > >> ask
> > > > >> > > other
> > > > >> > > > > > > brokers
> > > > >> > > > > > > > > via "metadata.
> > > > >> > > > > > > > > 3.b) As long as a broker does not know all the
> > current
> > > > >> quorum
> > > > >> > > > > voter's
> > > > >> > > > > > > > > connections, it should continue periodically ask
> > other
> > > > >> > brokers
> > > > >> > > > via
> > > > >> > > > > > > > > "discover-brokers".
> > > > >> > > > > > > > > 3.c) When the leader's fetch timeout elapsed, it
> > > should
> > > > >> send
> > > > >> > > > > metadata
> > > > >> > > > > > > > > request.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Make sense, will add to the KIP.
> > > > >> > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Guozhang
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Wed, Jun 10, 2020 at 5:20 PM Boyang Chen <
> > > > >> > > > > > > reluctanthero104@gmail.com>
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > Hey all,
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > follow-up on the previous email, we made some
> more
> > > > >> updates:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > 1. The Alter/DescribeQuorum RPCs are also
> > > > re-structured
> > > > >> to
> > > > >> > > use
> > > > >> > > > > > > > > multi-raft.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > 2. We add observer status into the
> > > > >> DescribeQuorumResponse
> > > > >> > as
> > > > >> > > we
> > > > >> > > > > see
> > > > >> > > > > > > it
> > > > >> > > > > > > > > is a
> > > > >> > > > > > > > > > low hanging fruit which is very useful for user
> > > > >> debugging
> > > > >> > and
> > > > >> > > > > > > > > reassignment.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > 3. The FindQuorum RPC is replaced with
> > > DiscoverBrokers
> > > > >> RPC,
> > > > >> > > > which
> > > > >> > > > > > is
> > > > >> > > > > > > > > purely
> > > > >> > > > > > > > > > in charge of discovering broker connections in a
> > > > gossip
> > > > >> > > manner.
> > > > >> > > > > The
> > > > >> > > > > > > > > quorum
> > > > >> > > > > > > > > > leader discovery is piggy-back on the Metadata
> RPC
> > > for
> > > > >> the
> > > > >> > > > topic
> > > > >> > > > > > > > > partition
> > > > >> > > > > > > > > > leader, which in our case is the single metadata
> > > > >> partition
> > > > >> > > for
> > > > >> > > > > the
> > > > >> > > > > > > > > version
> > > > >> > > > > > > > > > one.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Let me know if you have any questions.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Boyang
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > On Tue, Jun 9, 2020 at 11:01 PM Boyang Chen <
> > > > >> > > > > > > > reluctanthero104@gmail.com>
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > > Hey all,
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Thanks for the great discussions so far. I'm
> > > posting
> > > > >> some
> > > > >> > > KIP
> > > > >> > > > > > > updates
> > > > >> > > > > > > > > > from
> > > > >> > > > > > > > > > > our working group discussion:
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > 1. We will be changing the core RPCs from
> > > > single-raft
> > > > >> API
> > > > >> > > to
> > > > >> > > > > > > > > multi-raft.
> > > > >> > > > > > > > > > > This means all protocols will be "batch" in
> the
> > > > first
> > > > >> > > > version,
> > > > >> > > > > > but
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > KIP
> > > > >> > > > > > > > > > > itself only illustrates the design for a
> single
> > > > >> metadata
> > > > >> > > > topic
> > > > >> > > > > > > > > partition.
> > > > >> > > > > > > > > > > The reason is to "keep the door open" for
> future
> > > > >> > extensions
> > > > >> > > > of
> > > > >> > > > > > this
> > > > >> > > > > > > > > piece
> > > > >> > > > > > > > > > > of module such as a sharded controller or
> > general
> > > > >> quorum
> > > > >> > > > based
> > > > >> > > > > > > topic
> > > > >> > > > > > > > > > > replication, beyond the current Kafka
> > replication
> > > > >> > protocol.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > 2. We will piggy-back on the current Kafka
> Fetch
> > > API
> > > > >> > > instead
> > > > >> > > > of
> > > > >> > > > > > > > > inventing
> > > > >> > > > > > > > > > > a new FetchQuorumRecords RPC. The motivation
> is
> > > > about
> > > > >> the
> > > > >> > > > same
> > > > >> > > > > as
> > > > >> > > > > > > #1
> > > > >> > > > > > > > as
> > > > >> > > > > > > > > > > well as making the integration work easier,
> > > instead
> > > > of
> > > > >> > > > letting
> > > > >> > > > > > two
> > > > >> > > > > > > > > > similar
> > > > >> > > > > > > > > > > RPCs diverge.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > 3. In the EndQuorumEpoch protocol, instead of
> > only
> > > > >> > sending
> > > > >> > > > the
> > > > >> > > > > > > > request
> > > > >> > > > > > > > > to
> > > > >> > > > > > > > > > > the most caught-up voter, we shall broadcast
> the
> > > > >> > > information
> > > > >> > > > to
> > > > >> > > > > > all
> > > > >> > > > > > > > > > voters,
> > > > >> > > > > > > > > > > with a sorted voter list in descending order
> of
> > > > their
> > > > >> > > > > > corresponding
> > > > >> > > > > > > > > > > replicated offset. In this way, the top voter
> > will
> > > > >> > become a
> > > > >> > > > > > > candidate
> > > > >> > > > > > > > > > > immediately, while the other voters shall wait
> > for
> > > > an
> > > > >> > > > > exponential
> > > > >> > > > > > > > > > back-off
> > > > >> > > > > > > > > > > to trigger elections, which helps ensure the
> top
> > > > voter
> > > > >> > gets
> > > > >> > > > > > > elected,
> > > > >> > > > > > > > > and
> > > > >> > > > > > > > > > > the election eventually happens when the top
> > voter
> > > > is
> > > > >> not
> > > > >> > > > > > > responsive.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Please see the updated KIP and post any
> > questions
> > > or
> > > > >> > > concerns
> > > > >> > > > > on
> > > > >> > > > > > > the
> > > > >> > > > > > > > > > > mailing thread.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Boyang
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > On Fri, May 8, 2020 at 5:26 PM Jun Rao <
> > > > >> jun@confluent.io
> > > > >> > >
> > > > >> > > > > wrote:
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > >> Hi, Guozhang and Jason,
> > > > >> > > > > > > > > > >>
> > > > >> > > > > > > > > > >> Thanks for the reply. A couple of more
> replies.
> > > > >> > > > > > > > > > >>
> > > > >> > > > > > > > > > >> 102. Still not sure about this. How is the
> > > > tombstone
> > > > >> > issue
> > > > >> > > > > > > addressed
> > > > >> > > > > > > > > in
> > > > >> > > > > > > > > > >> the
> > > > >> > > > > > > > > > >> non-voter and the observer.  They can die at
> > any
> > > > >> point
> > > > >> > and
> > > > >> > > > > > restart
> > > > >> > > > > > > > at
> > > > >> > > > > > > > > an
> > > > >> > > > > > > > > > >> arbitrary later time, and the advancing of
> the
> > > > >> > firstDirty
> > > > >> > > > > offset
> > > > >> > > > > > > and
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > >> removal of the tombstone can happen
> > > independently.
> > > > >> > > > > > > > > > >>
> > > > >> > > > > > > > > > >> 106. I agree that it would be less confusing
> if
> > > we
> > > > >> used
> > > > >> > > > > "epoch"
> > > > >> > > > > > > > > instead
> > > > >> > > > > > > > > > of
> > > > >> > > > > > > > > > >> "leader epoch" consistently.
> > > > >> > > > > > > > > > >>
> > > > >> > > > > > > > > > >> Jun
> > > > >> > > > > > > > > > >>
> > > > >> > > > > > > > > > >> On Thu, May 7, 2020 at 4:04 PM Guozhang Wang
> <
> > > > >> > > > > > wangguoz@gmail.com>
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > > >>
> > > > >> > > > > > > > > > >> > Thanks Jun. Further replies are in-lined.
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> > On Mon, May 4, 2020 at 11:58 AM Jun Rao <
> > > > >> > > jun@confluent.io
> > > > >> > > > >
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> > > Hi, Guozhang,
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > Thanks for the reply. A few more replies
> > > > inlined
> > > > >> > > below.
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > On Sun, May 3, 2020 at 6:33 PM Guozhang
> > Wang
> > > <
> > > > >> > > > > > > > wangguoz@gmail.com>
> > > > >> > > > > > > > > > >> wrote:
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > > Hello Jun,
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > > Thanks for your comments! I'm replying
> > > inline
> > > > >> > below:
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > > On Fri, May 1, 2020 at 12:36 PM Jun
> Rao <
> > > > >> > > > > jun@confluent.io
> > > > >> > > > > > >
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > > > 101. Bootstrapping related issues.
> > > > >> > > > > > > > > > >> > > > > 101.1 Currently, we support auto
> broker
> > > id
> > > > >> > > > generation.
> > > > >> > > > > > Is
> > > > >> > > > > > > > this
> > > > >> > > > > > > > > > >> > > supported
> > > > >> > > > > > > > > > >> > > > > for bootstrap brokers?
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > > The vote ids would just be the broker
> > ids.
> > > > >> > > > > > > "bootstrap.servers"
> > > > >> > > > > > > > > > >> would be
> > > > >> > > > > > > > > > >> > > > similar to what client configs have
> > today,
> > > > >> where
> > > > >> > > > > > > > "quorum.voters"
> > > > >> > > > > > > > > > >> would
> > > > >> > > > > > > > > > >> > be
> > > > >> > > > > > > > > > >> > > > pre-defined config values.
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > My question was on the auto generated
> > broker
> > > > id.
> > > > >> > > > > Currently,
> > > > >> > > > > > > the
> > > > >> > > > > > > > > > broker
> > > > >> > > > > > > > > > >> > can
> > > > >> > > > > > > > > > >> > > choose to have its broker Id auto
> > generated.
> > > > The
> > > > >> > > > > generation
> > > > >> > > > > > is
> > > > >> > > > > > > > > done
> > > > >> > > > > > > > > > >> > through
> > > > >> > > > > > > > > > >> > > ZK to guarantee uniqueness. Without ZK,
> > it's
> > > > not
> > > > >> > clear
> > > > >> > > > how
> > > > >> > > > > > the
> > > > >> > > > > > > > > > broker
> > > > >> > > > > > > > > > >> id
> > > > >> > > > > > > > > > >> > is
> > > > >> > > > > > > > > > >> > > auto generated. "quorum.voters" also
> can't
> > be
> > > > set
> > > > >> > > > > statically
> > > > >> > > > > > > if
> > > > >> > > > > > > > > > broker
> > > > >> > > > > > > > > > >> > ids
> > > > >> > > > > > > > > > >> > > are auto generated.
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > Jason has explained some ideas that we've
> > > > >> discussed
> > > > >> > so
> > > > >> > > > > far,
> > > > >> > > > > > > the
> > > > >> > > > > > > > > > >> reason we
> > > > >> > > > > > > > > > >> > intentional did not include them so far is
> > that
> > > > we
> > > > >> > feel
> > > > >> > > it
> > > > >> > > > > is
> > > > >> > > > > > > > > out-side
> > > > >> > > > > > > > > > >> the
> > > > >> > > > > > > > > > >> > scope of KIP-595. Under the umbrella of
> > KIP-500
> > > > we
> > > > >> > > should
> > > > >> > > > > > > > definitely
> > > > >> > > > > > > > > > >> > address them though.
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> > On the high-level, our belief is that
> > "joining
> > > a
> > > > >> > quorum"
> > > > >> > > > and
> > > > >> > > > > > > > > "joining
> > > > >> > > > > > > > > > >> (or
> > > > >> > > > > > > > > > >> > more specifically, registering brokers in)
> > the
> > > > >> > cluster"
> > > > >> > > > > would
> > > > >> > > > > > be
> > > > >> > > > > > > > > > >> > de-coupled a bit, where the former should
> be
> > > > >> completed
> > > > >> > > > > before
> > > > >> > > > > > we
> > > > >> > > > > > > > do
> > > > >> > > > > > > > > > the
> > > > >> > > > > > > > > > >> > latter. More specifically, assuming the
> > quorum
> > > is
> > > > >> > > already
> > > > >> > > > up
> > > > >> > > > > > and
> > > > >> > > > > > > > > > >> running,
> > > > >> > > > > > > > > > >> > after the newly started broker found the
> > leader
> > > > of
> > > > >> the
> > > > >> > > > > quorum
> > > > >> > > > > > it
> > > > >> > > > > > > > can
> > > > >> > > > > > > > > > >> send a
> > > > >> > > > > > > > > > >> > specific RegisterBroker request including
> its
> > > > >> > listener /
> > > > >> > > > > > > protocol
> > > > >> > > > > > > > /
> > > > >> > > > > > > > > > etc,
> > > > >> > > > > > > > > > >> > and upon handling it the leader can send
> back
> > > the
> > > > >> > > uniquely
> > > > >> > > > > > > > generated
> > > > >> > > > > > > > > > >> broker
> > > > >> > > > > > > > > > >> > id to the new broker, while also executing
> > the
> > > > >> > > > > > "startNewBroker"
> > > > >> > > > > > > > > > >> callback as
> > > > >> > > > > > > > > > >> > the controller.
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > > > 102. Log compaction. One weak spot of
> > log
> > > > >> > > compaction
> > > > >> > > > > is
> > > > >> > > > > > > for
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > >> > > consumer
> > > > >> > > > > > > > > > >> > > > to
> > > > >> > > > > > > > > > >> > > > > deal with deletes. When a key is
> > deleted,
> > > > >> it's
> > > > >> > > > > retained
> > > > >> > > > > > > as a
> > > > >> > > > > > > > > > >> > tombstone
> > > > >> > > > > > > > > > >> > > > > first and then physically removed.
> If a
> > > > >> client
> > > > >> > > > misses
> > > > >> > > > > > the
> > > > >> > > > > > > > > > >> tombstone
> > > > >> > > > > > > > > > >> > > > > (because it's physically removed), it
> > may
> > > > >> not be
> > > > >> > > > able
> > > > >> > > > > to
> > > > >> > > > > > > > > update
> > > > >> > > > > > > > > > >> its
> > > > >> > > > > > > > > > >> > > > > metadata properly. The way we solve
> > this
> > > in
> > > > >> > Kafka
> > > > >> > > is
> > > > >> > > > > > based
> > > > >> > > > > > > > on
> > > > >> > > > > > > > > a
> > > > >> > > > > > > > > > >> > > > > configuration (
> > > > >> log.cleaner.delete.retention.ms)
> > > > >> > > and
> > > > >> > > > > we
> > > > >> > > > > > > > > expect a
> > > > >> > > > > > > > > > >> > > consumer
> > > > >> > > > > > > > > > >> > > > > having seen an old key to finish
> > reading
> > > > the
> > > > >> > > > deletion
> > > > >> > > > > > > > > tombstone
> > > > >> > > > > > > > > > >> > within
> > > > >> > > > > > > > > > >> > > > that
> > > > >> > > > > > > > > > >> > > > > time. There is no strong guarantee
> for
> > > that
> > > > >> > since
> > > > >> > > a
> > > > >> > > > > > broker
> > > > >> > > > > > > > > could
> > > > >> > > > > > > > > > >> be
> > > > >> > > > > > > > > > >> > > down
> > > > >> > > > > > > > > > >> > > > > for a long time. It would be better
> if
> > we
> > > > can
> > > > >> > > have a
> > > > >> > > > > > more
> > > > >> > > > > > > > > > reliable
> > > > >> > > > > > > > > > >> > way
> > > > >> > > > > > > > > > >> > > of
> > > > >> > > > > > > > > > >> > > > > dealing with deletes.
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > > We propose to capture this in the
> > > > >> > "FirstDirtyOffset"
> > > > >> > > > > field
> > > > >> > > > > > > of
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > >> > quorum
> > > > >> > > > > > > > > > >> > > > record fetch response: the offset is
> the
> > > > >> maximum
> > > > >> > > > offset
> > > > >> > > > > > that
> > > > >> > > > > > > > log
> > > > >> > > > > > > > > > >> > > compaction
> > > > >> > > > > > > > > > >> > > > has reached up to. If the follower has
> > > > fetched
> > > > >> > > beyond
> > > > >> > > > > this
> > > > >> > > > > > > > > offset
> > > > >> > > > > > > > > > it
> > > > >> > > > > > > > > > >> > > means
> > > > >> > > > > > > > > > >> > > > itself is safe hence it has seen all
> > > records
> > > > >> up to
> > > > >> > > > that
> > > > >> > > > > > > > offset.
> > > > >> > > > > > > > > On
> > > > >> > > > > > > > > > >> > > getting
> > > > >> > > > > > > > > > >> > > > the response, the follower can then
> > decide
> > > if
> > > > >> its
> > > > >> > > end
> > > > >> > > > > > offset
> > > > >> > > > > > > > > > >> actually
> > > > >> > > > > > > > > > >> > > below
> > > > >> > > > > > > > > > >> > > > that dirty offset (and hence may miss
> > some
> > > > >> > > > tombstones).
> > > > >> > > > > If
> > > > >> > > > > > > > > that's
> > > > >> > > > > > > > > > >> the
> > > > >> > > > > > > > > > >> > > case:
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > > 1) Naively, it could re-bootstrap
> > metadata
> > > > log
> > > > >> > from
> > > > >> > > > the
> > > > >> > > > > > very
> > > > >> > > > > > > > > > >> beginning
> > > > >> > > > > > > > > > >> > to
> > > > >> > > > > > > > > > >> > > > catch up.
> > > > >> > > > > > > > > > >> > > > 2) During that time, it would refrain
> > > itself
> > > > >> from
> > > > >> > > > > > answering
> > > > >> > > > > > > > > > >> > > MetadataRequest
> > > > >> > > > > > > > > > >> > > > from any clients.
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > I am not sure that the "FirstDirtyOffset"
> > > field
> > > > >> > fully
> > > > >> > > > > > > addresses
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > >> > issue.
> > > > >> > > > > > > > > > >> > > Currently, the deletion tombstone is not
> > > > removed
> > > > >> > > > > immediately
> > > > >> > > > > > > > > after a
> > > > >> > > > > > > > > > >> > round
> > > > >> > > > > > > > > > >> > > of cleaning. It's removed after a delay
> in
> > a
> > > > >> > > subsequent
> > > > >> > > > > > round
> > > > >> > > > > > > of
> > > > >> > > > > > > > > > >> > cleaning.
> > > > >> > > > > > > > > > >> > > Consider an example where a key insertion
> > is
> > > at
> > > > >> > offset
> > > > >> > > > 200
> > > > >> > > > > > > and a
> > > > >> > > > > > > > > > >> deletion
> > > > >> > > > > > > > > > >> > > tombstone of the key is at 400.
> Initially,
> > > > >> > > > > FirstDirtyOffset
> > > > >> > > > > > is
> > > > >> > > > > > > > at
> > > > >> > > > > > > > > > >> 300. A
> > > > >> > > > > > > > > > >> > > follower/observer fetches from offset 0
> > and
> > > > >> fetches
> > > > >> > > the
> > > > >> > > > > key
> > > > >> > > > > > > at
> > > > >> > > > > > > > > > offset
> > > > >> > > > > > > > > > >> > 200.
> > > > >> > > > > > > > > > >> > > A few rounds of cleaning happen.
> > > > >> FirstDirtyOffset is
> > > > >> > > at
> > > > >> > > > > 500
> > > > >> > > > > > > and
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > >> > > tombstone at 400 is physically removed.
> The
> > > > >> > > > > > follower/observer
> > > > >> > > > > > > > > > >> continues
> > > > >> > > > > > > > > > >> > the
> > > > >> > > > > > > > > > >> > > fetch, but misses offset 400. It catches
> > all
> > > > the
> > > > >> way
> > > > >> > > to
> > > > >> > > > > > > > > > >> FirstDirtyOffset
> > > > >> > > > > > > > > > >> > > and declares its metadata as ready.
> > However,
> > > > its
> > > > >> > > > metadata
> > > > >> > > > > > > could
> > > > >> > > > > > > > be
> > > > >> > > > > > > > > > >> stale
> > > > >> > > > > > > > > > >> > > since it actually misses the deletion of
> > the
> > > > key.
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > Yeah good question, I should have put
> more
> > > > >> details
> > > > >> > in
> > > > >> > > my
> > > > >> > > > > > > > > explanation
> > > > >> > > > > > > > > > >> :)
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> > The idea is that we will adjust the log
> > > > compaction
> > > > >> for
> > > > >> > > > this
> > > > >> > > > > > raft
> > > > >> > > > > > > > > based
> > > > >> > > > > > > > > > >> > metadata log: before more details to be
> > > > explained,
> > > > >> > since
> > > > >> > > > we
> > > > >> > > > > > have
> > > > >> > > > > > > > two
> > > > >> > > > > > > > > > >> types
> > > > >> > > > > > > > > > >> > of "watermarks" here, whereas in Kafka the
> > > > >> watermark
> > > > >> > > > > indicates
> > > > >> > > > > > > > where
> > > > >> > > > > > > > > > >> every
> > > > >> > > > > > > > > > >> > replica have replicated up to and in Raft
> the
> > > > >> > watermark
> > > > >> > > > > > > indicates
> > > > >> > > > > > > > > > where
> > > > >> > > > > > > > > > >> the
> > > > >> > > > > > > > > > >> > majority of replicas (here only indicating
> > > voters
> > > > >> of
> > > > >> > the
> > > > >> > > > > > quorum,
> > > > >> > > > > > > > not
> > > > >> > > > > > > > > > >> > counting observers) have replicated up to,
> > > let's
> > > > >> call
> > > > >> > > them
> > > > >> > > > > > Kafka
> > > > >> > > > > > > > > > >> watermark
> > > > >> > > > > > > > > > >> > and Raft watermark. For this special log,
> we
> > > > would
> > > > >> > > > maintain
> > > > >> > > > > > both
> > > > >> > > > > > > > > > >> > watermarks.
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> > When log compacting on the leader, we would
> > > only
> > > > >> > compact
> > > > >> > > > up
> > > > >> > > > > to
> > > > >> > > > > > > the
> > > > >> > > > > > > > > > Kafka
> > > > >> > > > > > > > > > >> > watermark, i.e. if there is at least one
> > voter
> > > > who
> > > > >> > have
> > > > >> > > > not
> > > > >> > > > > > > > > replicated
> > > > >> > > > > > > > > > >> an
> > > > >> > > > > > > > > > >> > entry, it would not be compacted. The
> > > > >> "dirty-offset"
> > > > >> > is
> > > > >> > > > the
> > > > >> > > > > > > offset
> > > > >> > > > > > > > > > that
> > > > >> > > > > > > > > > >> > we've compacted up to and is communicated
> to
> > > > other
> > > > >> > > voters,
> > > > >> > > > > and
> > > > >> > > > > > > the
> > > > >> > > > > > > > > > other
> > > > >> > > > > > > > > > >> > voters would also compact up to this value
> > ---
> > > > i.e.
> > > > >> > the
> > > > >> > > > > > > difference
> > > > >> > > > > > > > > > here
> > > > >> > > > > > > > > > >> is
> > > > >> > > > > > > > > > >> > that instead of letting each replica doing
> > log
> > > > >> > > compaction
> > > > >> > > > > > > > > > independently,
> > > > >> > > > > > > > > > >> > we'll have the leader to decide upon which
> > > offset
> > > > >> to
> > > > >> > > > compact
> > > > >> > > > > > to,
> > > > >> > > > > > > > and
> > > > >> > > > > > > > > > >> > propagate this value to others to follow,
> in
> > a
> > > > more
> > > > >> > > > > > coordinated
> > > > >> > > > > > > > > > manner.
> > > > >> > > > > > > > > > >> > Also note when there are new voters joining
> > the
> > > > >> quorum
> > > > >> > > who
> > > > >> > > > > has
> > > > >> > > > > > > not
> > > > >> > > > > > > > > > >> > replicated up to the dirty-offset, of
> because
> > > of
> > > > >> other
> > > > >> > > > > issues
> > > > >> > > > > > > they
> > > > >> > > > > > > > > > >> > truncated their logs to below the
> > dirty-offset,
> > > > >> they'd
> > > > >> > > > have
> > > > >> > > > > to
> > > > >> > > > > > > > > > >> re-bootstrap
> > > > >> > > > > > > > > > >> > from the beginning, and during this period
> of
> > > > time
> > > > >> the
> > > > >> > > > > leader
> > > > >> > > > > > > > > learned
> > > > >> > > > > > > > > > >> about
> > > > >> > > > > > > > > > >> > this lagging voter would not advance the
> > > > watermark
> > > > >> > (also
> > > > >> > > > it
> > > > >> > > > > > > would
> > > > >> > > > > > > > > not
> > > > >> > > > > > > > > > >> > decrement it), and hence not compacting
> > either,
> > > > >> until
> > > > >> > > the
> > > > >> > > > > > > voter(s)
> > > > >> > > > > > > > > has
> > > > >> > > > > > > > > > >> > caught up to that dirty-offset.
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> > So back to your example above, before the
> > > > bootstrap
> > > > >> > > voter
> > > > >> > > > > gets
> > > > >> > > > > > > to
> > > > >> > > > > > > > > 300
> > > > >> > > > > > > > > > no
> > > > >> > > > > > > > > > >> > log compaction would happen on the leader;
> > and
> > > > >> until
> > > > >> > > later
> > > > >> > > > > > when
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > >> voter
> > > > >> > > > > > > > > > >> > have got to beyond 400 and hence replicated
> > > that
> > > > >> > > > tombstone,
> > > > >> > > > > > the
> > > > >> > > > > > > > log
> > > > >> > > > > > > > > > >> > compaction would possibly get to that
> > tombstone
> > > > and
> > > > >> > > remove
> > > > >> > > > > it.
> > > > >> > > > > > > Say
> > > > >> > > > > > > > > > >> later it
> > > > >> > > > > > > > > > >> > the leader's log compaction reaches 500, it
> > can
> > > > >> send
> > > > >> > > this
> > > > >> > > > > back
> > > > >> > > > > > > to
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > >> voter
> > > > >> > > > > > > > > > >> > who can then also compact locally up to
> 500.
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> > > > > 105. Quorum State: In addition to
> > > VotedId,
> > > > >> do we
> > > > >> > > > need
> > > > >> > > > > > the
> > > > >> > > > > > > > > epoch
> > > > >> > > > > > > > > > >> > > > > corresponding to VotedId? Over time,
> > the
> > > > same
> > > > >> > > broker
> > > > >> > > > > Id
> > > > >> > > > > > > > could
> > > > >> > > > > > > > > be
> > > > >> > > > > > > > > > >> > voted
> > > > >> > > > > > > > > > >> > > in
> > > > >> > > > > > > > > > >> > > > > different generations with different
> > > epoch.
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > > Hmm, this is a good point. Originally I
> > > think
> > > > >> the
> > > > >> > > > > > > > "LeaderEpoch"
> > > > >> > > > > > > > > > >> field
> > > > >> > > > > > > > > > >> > in
> > > > >> > > > > > > > > > >> > > > that file is corresponding to the
> "latest
> > > > known
> > > > >> > > leader
> > > > >> > > > > > > epoch",
> > > > >> > > > > > > > > not
> > > > >> > > > > > > > > > >> the
> > > > >> > > > > > > > > > >> > > > "current leader epoch". For example, if
> > the
> > > > >> > current
> > > > >> > > > > epoch
> > > > >> > > > > > is
> > > > >> > > > > > > > N,
> > > > >> > > > > > > > > > and
> > > > >> > > > > > > > > > >> > then
> > > > >> > > > > > > > > > >> > > a
> > > > >> > > > > > > > > > >> > > > vote-request with epoch N+1 is received
> > and
> > > > the
> > > > >> > > voter
> > > > >> > > > > > > granted
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > >> vote
> > > > >> > > > > > > > > > >> > > for
> > > > >> > > > > > > > > > >> > > > it, then it means for this voter it
> knows
> > > the
> > > > >> > > "latest
> > > > >> > > > > > epoch"
> > > > >> > > > > > > > is
> > > > >> > > > > > > > > N
> > > > >> > > > > > > > > > +
> > > > >> > > > > > > > > > >> 1
> > > > >> > > > > > > > > > >> > > > although it is unknown if that sending
> > > > >> candidate
> > > > >> > > will
> > > > >> > > > > > indeed
> > > > >> > > > > > > > > > become
> > > > >> > > > > > > > > > >> the
> > > > >> > > > > > > > > > >> > > new
> > > > >> > > > > > > > > > >> > > > leader (which would only be notified
> via
> > > > >> > > begin-quorum
> > > > >> > > > > > > > request).
> > > > >> > > > > > > > > > >> > However,
> > > > >> > > > > > > > > > >> > > > when persisting the quorum state, we
> > would
> > > > >> encode
> > > > >> > > > > > > leader-epoch
> > > > >> > > > > > > > > to
> > > > >> > > > > > > > > > >> N+1,
> > > > >> > > > > > > > > > >> > > > while the leaderId to be the older
> > leader.
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > > But now thinking about this a bit
> more, I
> > > > feel
> > > > >> we
> > > > >> > > > should
> > > > >> > > > > > use
> > > > >> > > > > > > > two
> > > > >> > > > > > > > > > >> > separate
> > > > >> > > > > > > > > > >> > > > epochs, one for the "lates known" and
> one
> > > for
> > > > >> the
> > > > >> > > > > > "current"
> > > > >> > > > > > > to
> > > > >> > > > > > > > > > pair
> > > > >> > > > > > > > > > >> > with
> > > > >> > > > > > > > > > >> > > > the leaderId. I will update the wiki
> > page.
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > Hmm, it's kind of weird to bump up the
> > leader
> > > > >> epoch
> > > > >> > > > before
> > > > >> > > > > > the
> > > > >> > > > > > > > new
> > > > >> > > > > > > > > > >> leader
> > > > >> > > > > > > > > > >> > > is actually elected, right.
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > > > 106. "OFFSET_OUT_OF_RANGE: Used in
> the
> > > > >> > > > > > FetchQuorumRecords
> > > > >> > > > > > > > API
> > > > >> > > > > > > > > to
> > > > >> > > > > > > > > > >> > > indicate
> > > > >> > > > > > > > > > >> > > > > that the follower has fetched from an
> > > > invalid
> > > > >> > > offset
> > > > >> > > > > and
> > > > >> > > > > > > > > should
> > > > >> > > > > > > > > > >> > > truncate
> > > > >> > > > > > > > > > >> > > > to
> > > > >> > > > > > > > > > >> > > > > the offset/epoch indicated in the
> > > > response."
> > > > >> > > > Observers
> > > > >> > > > > > > can't
> > > > >> > > > > > > > > > >> truncate
> > > > >> > > > > > > > > > >> > > > their
> > > > >> > > > > > > > > > >> > > > > logs. What should they do with
> > > > >> > > OFFSET_OUT_OF_RANGE?
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > > I'm not sure if I understand your
> > question?
> > > > >> > > Observers
> > > > >> > > > > > should
> > > > >> > > > > > > > > still
> > > > >> > > > > > > > > > >> be
> > > > >> > > > > > > > > > >> > > able
> > > > >> > > > > > > > > > >> > > > to truncate their logs as well.
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > Hmm, I thought only the quorum nodes have
> > > local
> > > > >> logs
> > > > >> > > and
> > > > >> > > > > > > > observers
> > > > >> > > > > > > > > > >> don't?
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > > 107. "The leader will continue sending
> > > > >> > > > BeginQuorumEpoch
> > > > >> > > > > to
> > > > >> > > > > > > > each
> > > > >> > > > > > > > > > >> known
> > > > >> > > > > > > > > > >> > > > voter
> > > > >> > > > > > > > > > >> > > > > until it has received its
> endorsement."
> > > If
> > > > a
> > > > >> > voter
> > > > >> > > > is
> > > > >> > > > > > down
> > > > >> > > > > > > > > for a
> > > > >> > > > > > > > > > >> long
> > > > >> > > > > > > > > > >> > > > time,
> > > > >> > > > > > > > > > >> > > > > sending BeginQuorumEpoch seems to add
> > > > >> > unnecessary
> > > > >> > > > > > > overhead.
> > > > >> > > > > > > > > > >> > Similarly,
> > > > >> > > > > > > > > > >> > > > if a
> > > > >> > > > > > > > > > >> > > > > follower stops sending
> > > FetchQuorumRecords,
> > > > >> does
> > > > >> > > the
> > > > >> > > > > > leader
> > > > >> > > > > > > > > keep
> > > > >> > > > > > > > > > >> > sending
> > > > >> > > > > > > > > > >> > > > > BeginQuorumEpoch?
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > > Regarding BeginQuorumEpoch: that is a
> > good
> > > > >> point.
> > > > >> > > The
> > > > >> > > > > > > > > > >> > begin-quorum-epoch
> > > > >> > > > > > > > > > >> > > > request is for voters to quickly get
> the
> > > new
> > > > >> > leader
> > > > >> > > > > > > > information;
> > > > >> > > > > > > > > > >> > however
> > > > >> > > > > > > > > > >> > > > even if they do not get them they can
> > still
> > > > >> > > eventually
> > > > >> > > > > > learn
> > > > >> > > > > > > > > about
> > > > >> > > > > > > > > > >> that
> > > > >> > > > > > > > > > >> > > > from others via gossiping FindQuorum. I
> > > think
> > > > >> we
> > > > >> > can
> > > > >> > > > > > adjust
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > >> logic
> > > > >> > > > > > > > > > >> > to
> > > > >> > > > > > > > > > >> > > > e.g. exponential back-off or with a
> > limited
> > > > >> > > > num.retries.
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > > Regarding FetchQuorumRecords: if the
> > > follower
> > > > >> > sends
> > > > >> > > > > > > > > > >> FetchQuorumRecords
> > > > >> > > > > > > > > > >> > > > already, it means that follower already
> > > knows
> > > > >> that
> > > > >> > > the
> > > > >> > > > > > > broker
> > > > >> > > > > > > > is
> > > > >> > > > > > > > > > the
> > > > >> > > > > > > > > > >> > > > leader, and hence we can stop retrying
> > > > >> > > > BeginQuorumEpoch;
> > > > >> > > > > > > > however
> > > > >> > > > > > > > > > it
> > > > >> > > > > > > > > > >> is
> > > > >> > > > > > > > > > >> > > > possible that after a follower sends
> > > > >> > > > FetchQuorumRecords
> > > > >> > > > > > > > already,
> > > > >> > > > > > > > > > >> > suddenly
> > > > >> > > > > > > > > > >> > > > it stops send it (possibly because it
> > > learned
> > > > >> > about
> > > > >> > > a
> > > > >> > > > > > higher
> > > > >> > > > > > > > > epoch
> > > > >> > > > > > > > > > >> > > leader),
> > > > >> > > > > > > > > > >> > > > and hence this broker may be a "zombie"
> > > > leader
> > > > >> and
> > > > >> > > we
> > > > >> > > > > > > propose
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > > use
> > > > >> > > > > > > > > > >> > the
> > > > >> > > > > > > > > > >> > > > fetch.timeout to let the leader to try
> to
> > > > >> verify
> > > > >> > if
> > > > >> > > it
> > > > >> > > > > has
> > > > >> > > > > > > > > already
> > > > >> > > > > > > > > > >> been
> > > > >> > > > > > > > > > >> > > > stale.
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > It just seems that we should handle these
> > two
> > > > >> cases
> > > > >> > > in a
> > > > >> > > > > > > > > consistent
> > > > >> > > > > > > > > > >> way?
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > Yes I agree, on the leader's side, the
> > > > >> > > > FetchQuorumRecords
> > > > >> > > > > > > from a
> > > > >> > > > > > > > > > >> follower
> > > > >> > > > > > > > > > >> > could mean that we no longer needs to send
> > > > >> > > > BeginQuorumEpoch
> > > > >> > > > > > > > anymore
> > > > >> > > > > > > > > > ---
> > > > >> > > > > > > > > > >> and
> > > > >> > > > > > > > > > >> > it is already part of our current
> > > implementations
> > > > >> in
> > > > >> > > > > > > > > > >> >
> > > > >> > > https://github.com/confluentinc/kafka/commits/kafka-raft
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> > > Thanks,
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > Jun
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > > > Jun
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > > > On Wed, Apr 29, 2020 at 8:56 PM
> > Guozhang
> > > > >> Wang <
> > > > >> > > > > > > > > > wangguoz@gmail.com
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> > > > wrote:
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > > > > Hello Leonard,
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > Thanks for your comments, I'm
> relying
> > > in
> > > > >> line
> > > > >> > > > below:
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > On Wed, Apr 29, 2020 at 1:58 AM
> Wang
> > > > >> (Leonard)
> > > > >> > > Ge
> > > > >> > > > <
> > > > >> > > > > > > > > > >> > wge@confluent.io>
> > > > >> > > > > > > > > > >> > > > > > wrote:
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > > Hi Kafka developers,
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > It's great to see this proposal
> and
> > > it
> > > > >> took
> > > > >> > me
> > > > >> > > > > some
> > > > >> > > > > > > time
> > > > >> > > > > > > > > to
> > > > >> > > > > > > > > > >> > finish
> > > > >> > > > > > > > > > >> > > > > > reading
> > > > >> > > > > > > > > > >> > > > > > > it.
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > And I have the following
> questions
> > > > about
> > > > >> the
> > > > >> > > > > > Proposal:
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > >    - How do we plan to test this
> > > design
> > > > >> to
> > > > >> > > > ensure
> > > > >> > > > > > its
> > > > >> > > > > > > > > > >> > correctness?
> > > > >> > > > > > > > > > >> > > Or
> > > > >> > > > > > > > > > >> > > > > > more
> > > > >> > > > > > > > > > >> > > > > > >    broadly, how do we ensure that
> > our
> > > > new
> > > > >> > > ‘pull’
> > > > >> > > > > > based
> > > > >> > > > > > > > > model
> > > > >> > > > > > > > > > >> is
> > > > >> > > > > > > > > > >> > > > > > functional
> > > > >> > > > > > > > > > >> > > > > > > and
> > > > >> > > > > > > > > > >> > > > > > >    correct given that it is
> > different
> > > > >> from
> > > > >> > the
> > > > >> > > > > > > original
> > > > >> > > > > > > > > RAFT
> > > > >> > > > > > > > > > >> > > > > > implementation
> > > > >> > > > > > > > > > >> > > > > > >    which has formal proof of
> > > > correctness?
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > We have two planned verifications
> on
> > > the
> > > > >> > > > correctness
> > > > >> > > > > > and
> > > > >> > > > > > > > > > >> liveness
> > > > >> > > > > > > > > > >> > of
> > > > >> > > > > > > > > > >> > > > the
> > > > >> > > > > > > > > > >> > > > > > design. One is via model
> verification
> > > > >> (TLA+)
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > https://github.com/guozhangwang/kafka-specification
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > Another is via the concurrent
> > > simulation
> > > > >> tests
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >>
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/confluentinc/kafka/commit/5c0c054597d2d9f458cad0cad846b0671efa2d91
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > >    - Have we considered any
> sensible
> > > > >> defaults
> > > > >> > > for
> > > > >> > > > > the
> > > > >> > > > > > > > > > >> > configuration,
> > > > >> > > > > > > > > > >> > > > i.e.
> > > > >> > > > > > > > > > >> > > > > > >    all the election timeout,
> fetch
> > > time
> > > > >> out,
> > > > >> > > > etc.?
> > > > >> > > > > > Or
> > > > >> > > > > > > we
> > > > >> > > > > > > > > > want
> > > > >> > > > > > > > > > >> to
> > > > >> > > > > > > > > > >> > > > leave
> > > > >> > > > > > > > > > >> > > > > > > this to
> > > > >> > > > > > > > > > >> > > > > > >    a later stage when we do the
> > > > >> performance
> > > > >> > > > > testing,
> > > > >> > > > > > > > etc.
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > This is a good question, the reason
> > we
> > > > did
> > > > >> not
> > > > >> > > set
> > > > >> > > > > any
> > > > >> > > > > > > > > default
> > > > >> > > > > > > > > > >> > values
> > > > >> > > > > > > > > > >> > > > for
> > > > >> > > > > > > > > > >> > > > > > the timeout configurations is that
> we
> > > > >> think it
> > > > >> > > may
> > > > >> > > > > > take
> > > > >> > > > > > > > some
> > > > >> > > > > > > > > > >> > > > benchmarking
> > > > >> > > > > > > > > > >> > > > > > experiments to get these defaults
> > > right.
> > > > >> Some
> > > > >> > > > > > high-level
> > > > >> > > > > > > > > > >> principles
> > > > >> > > > > > > > > > >> > > to
> > > > >> > > > > > > > > > >> > > > > > consider: 1) the fetch.timeout
> should
> > > be
> > > > >> > around
> > > > >> > > > the
> > > > >> > > > > > same
> > > > >> > > > > > > > > scale
> > > > >> > > > > > > > > > >> with
> > > > >> > > > > > > > > > >> > > zk
> > > > >> > > > > > > > > > >> > > > > > session timeout, which is now 18
> > > seconds
> > > > by
> > > > >> > > > default
> > > > >> > > > > --
> > > > >> > > > > > > in
> > > > >> > > > > > > > > > >> practice
> > > > >> > > > > > > > > > >> > > > we've
> > > > >> > > > > > > > > > >> > > > > > seen unstable networks having more
> > than
> > > > 10
> > > > >> > secs
> > > > >> > > of
> > > > >> > > > > > > > transient
> > > > >> > > > > > > > > > >> > > > > connectivity,
> > > > >> > > > > > > > > > >> > > > > > 2) the election.timeout, however,
> > > should
> > > > be
> > > > >> > > > smaller
> > > > >> > > > > > than
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > >> fetch
> > > > >> > > > > > > > > > >> > > > > timeout
> > > > >> > > > > > > > > > >> > > > > > as is also suggested as a practical
> > > > >> > optimization
> > > > >> > > > in
> > > > >> > > > > > > > > > literature:
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > >
> > > > >> https://www.cl.cam.ac.uk/~ms705/pub/papers/2015-osr-raft.pdf
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > Some more discussions can be found
> > > here:
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > >
> > https://github.com/confluentinc/kafka/pull/301/files#r415420081
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > >    - Have we considered
> > piggybacking
> > > > >> > > > > > > `BeginQuorumEpoch`
> > > > >> > > > > > > > > with
> > > > >> > > > > > > > > > >> the
> > > > >> > > > > > > > > > >> > `
> > > > >> > > > > > > > > > >> > > > > > >    FetchQuorumRecords`? I might
> be
> > > > >> missing
> > > > >> > > > > something
> > > > >> > > > > > > > > obvious
> > > > >> > > > > > > > > > >> but
> > > > >> > > > > > > > > > >> > I
> > > > >> > > > > > > > > > >> > > am
> > > > >> > > > > > > > > > >> > > > > > just
> > > > >> > > > > > > > > > >> > > > > > >    wondering why don’t we just
> use
> > > the
> > > > >> > > > > `FindQuorum`
> > > > >> > > > > > > and
> > > > >> > > > > > > > > > >> > > > > > > `FetchQuorumRecords`
> > > > >> > > > > > > > > > >> > > > > > >    APIs and remove the
> > > > `BeginQuorumEpoch`
> > > > >> > API?
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > Note that Begin/EndQuorumEpoch is
> > sent
> > > > from
> > > > >> > > leader
> > > > >> > > > > ->
> > > > >> > > > > > > > other
> > > > >> > > > > > > > > > >> voter
> > > > >> > > > > > > > > > >> > > > > > followers, while FindQuorum / Fetch
> > are
> > > > >> sent
> > > > >> > > from
> > > > >> > > > > > > follower
> > > > >> > > > > > > > > to
> > > > >> > > > > > > > > > >> > leader.
> > > > >> > > > > > > > > > >> > > > > > Arguably one can eventually realize
> > the
> > > > new
> > > > >> > > leader
> > > > >> > > > > and
> > > > >> > > > > > > > epoch
> > > > >> > > > > > > > > > via
> > > > >> > > > > > > > > > >> > > > > gossiping
> > > > >> > > > > > > > > > >> > > > > > FindQuorum, but that could in
> > practice
> > > > >> > require a
> > > > >> > > > > long
> > > > >> > > > > > > > delay.
> > > > >> > > > > > > > > > >> > Having a
> > > > >> > > > > > > > > > >> > > > > > leader -> other voters request
> helps
> > > the
> > > > >> new
> > > > >> > > > leader
> > > > >> > > > > > > epoch
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > > be
> > > > >> > > > > > > > > > >> > > > > propagated
> > > > >> > > > > > > > > > >> > > > > > faster under a pull model.
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > >    - And about the
> > > `FetchQuorumRecords`
> > > > >> > > response
> > > > >> > > > > > > schema,
> > > > >> > > > > > > > > in
> > > > >> > > > > > > > > > >> the
> > > > >> > > > > > > > > > >> > > > > `Records`
> > > > >> > > > > > > > > > >> > > > > > >    field of the response, is it
> > just
> > > > one
> > > > >> > > record
> > > > >> > > > or
> > > > >> > > > > > all
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > >> > records
> > > > >> > > > > > > > > > >> > > > > > starting
> > > > >> > > > > > > > > > >> > > > > > >    from the FetchOffset? It
> seems a
> > > lot
> > > > >> more
> > > > >> > > > > > efficient
> > > > >> > > > > > > > if
> > > > >> > > > > > > > > we
> > > > >> > > > > > > > > > >> sent
> > > > >> > > > > > > > > > >> > > all
> > > > >> > > > > > > > > > >> > > > > the
> > > > >> > > > > > > > > > >> > > > > > >    records during the
> bootstrapping
> > > of
> > > > >> the
> > > > >> > > > > brokers.
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > Yes the fetching is batched:
> > > FetchOffset
> > > > is
> > > > >> > just
> > > > >> > > > the
> > > > >> > > > > > > > > starting
> > > > >> > > > > > > > > > >> > offset
> > > > >> > > > > > > > > > >> > > of
> > > > >> > > > > > > > > > >> > > > > the
> > > > >> > > > > > > > > > >> > > > > > batch of records.
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > >    - Regarding the disruptive
> > broker
> > > > >> issues,
> > > > >> > > > does
> > > > >> > > > > > our
> > > > >> > > > > > > > pull
> > > > >> > > > > > > > > > >> based
> > > > >> > > > > > > > > > >> > > > model
> > > > >> > > > > > > > > > >> > > > > > >    suffer from it? If so, have we
> > > > >> considered
> > > > >> > > the
> > > > >> > > > > > > > Pre-Vote
> > > > >> > > > > > > > > > >> stage?
> > > > >> > > > > > > > > > >> > If
> > > > >> > > > > > > > > > >> > > > > not,
> > > > >> > > > > > > > > > >> > > > > > > why?
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > The disruptive broker is stated in
> > the
> > > > >> > original
> > > > >> > > > Raft
> > > > >> > > > > > > paper
> > > > >> > > > > > > > > > >> which is
> > > > >> > > > > > > > > > >> > > the
> > > > >> > > > > > > > > > >> > > > > > result of the push model design.
> Our
> > > > >> analysis
> > > > >> > > > showed
> > > > >> > > > > > > that
> > > > >> > > > > > > > > with
> > > > >> > > > > > > > > > >> the
> > > > >> > > > > > > > > > >> > > pull
> > > > >> > > > > > > > > > >> > > > > > model it is no longer an issue.
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > > Thanks a lot for putting this up,
> > > and I
> > > > >> hope
> > > > >> > > > that
> > > > >> > > > > my
> > > > >> > > > > > > > > > questions
> > > > >> > > > > > > > > > >> > can
> > > > >> > > > > > > > > > >> > > be
> > > > >> > > > > > > > > > >> > > > > of
> > > > >> > > > > > > > > > >> > > > > > > some value to make this KIP
> better.
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > Hope to hear from you soon!
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > Best wishes,
> > > > >> > > > > > > > > > >> > > > > > > Leonard
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > On Wed, Apr 29, 2020 at 1:46 AM
> > Colin
> > > > >> > McCabe <
> > > > >> > > > > > > > > > >> cmccabe@apache.org
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> > > > > wrote:
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > Hi Jason,
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > It's amazing to see this coming
> > > > >> together
> > > > >> > :)
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > I haven't had a chance to read
> in
> > > > >> detail,
> > > > >> > > but
> > > > >> > > > I
> > > > >> > > > > > read
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > >> > outline
> > > > >> > > > > > > > > > >> > > > and
> > > > >> > > > > > > > > > >> > > > > a
> > > > >> > > > > > > > > > >> > > > > > > few
> > > > >> > > > > > > > > > >> > > > > > > > things jumped out at me.
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > First, for every epoch that is
> 32
> > > > bits
> > > > >> > > rather
> > > > >> > > > > than
> > > > >> > > > > > > > 64, I
> > > > >> > > > > > > > > > >> sort
> > > > >> > > > > > > > > > >> > of
> > > > >> > > > > > > > > > >> > > > > wonder
> > > > >> > > > > > > > > > >> > > > > > > if
> > > > >> > > > > > > > > > >> > > > > > > > that's a good long-term choice.
> > I
> > > > keep
> > > > >> > > > reading
> > > > >> > > > > > > about
> > > > >> > > > > > > > > > stuff
> > > > >> > > > > > > > > > >> > like
> > > > >> > > > > > > > > > >> > > > > this:
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> > > > >> > > > > > > > .
> > > > >> > > > > > > > > > >> > > Obviously,
> > > > >> > > > > > > > > > >> > > > > > that
> > > > >> > > > > > > > > > >> > > > > > > > JIRA is about zxid, which
> > > increments
> > > > >> much
> > > > >> > > > faster
> > > > >> > > > > > > than
> > > > >> > > > > > > > we
> > > > >> > > > > > > > > > >> expect
> > > > >> > > > > > > > > > >> > > > these
> > > > >> > > > > > > > > > >> > > > > > > > leader epochs to, but it would
> > > still
> > > > be
> > > > >> > good
> > > > >> > > > to
> > > > >> > > > > > see
> > > > >> > > > > > > > some
> > > > >> > > > > > > > > > >> rough
> > > > >> > > > > > > > > > >> > > > > > > calculations
> > > > >> > > > > > > > > > >> > > > > > > > about how long 32 bits (or
> > really,
> > > 31
> > > > >> > bits)
> > > > >> > > > will
> > > > >> > > > > > > last
> > > > >> > > > > > > > us
> > > > >> > > > > > > > > > in
> > > > >> > > > > > > > > > >> the
> > > > >> > > > > > > > > > >> > > > cases
> > > > >> > > > > > > > > > >> > > > > > > where
> > > > >> > > > > > > > > > >> > > > > > > > we're using it, and what the
> > space
> > > > >> savings
> > > > >> > > > we're
> > > > >> > > > > > > > getting
> > > > >> > > > > > > > > > >> really
> > > > >> > > > > > > > > > >> > > is.
> > > > >> > > > > > > > > > >> > > > > It
> > > > >> > > > > > > > > > >> > > > > > > > seems like in most cases the
> > > tradeoff
> > > > >> may
> > > > >> > > not
> > > > >> > > > be
> > > > >> > > > > > > worth
> > > > >> > > > > > > > > it?
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > Another thing I've been
> thinking
> > > > about
> > > > >> is
> > > > >> > > how
> > > > >> > > > we
> > > > >> > > > > > do
> > > > >> > > > > > > > > > >> > > > bootstrapping.  I
> > > > >> > > > > > > > > > >> > > > > > > > would prefer to be in a world
> > where
> > > > >> > > > formatting a
> > > > >> > > > > > new
> > > > >> > > > > > > > > Kafka
> > > > >> > > > > > > > > > >> node
> > > > >> > > > > > > > > > >> > > > was a
> > > > >> > > > > > > > > > >> > > > > > > first
> > > > >> > > > > > > > > > >> > > > > > > > class operation explicitly
> > > initiated
> > > > by
> > > > >> > the
> > > > >> > > > > admin,
> > > > >> > > > > > > > > rather
> > > > >> > > > > > > > > > >> than
> > > > >> > > > > > > > > > >> > > > > > something
> > > > >> > > > > > > > > > >> > > > > > > > that happened implicitly when
> you
> > > > >> started
> > > > >> > up
> > > > >> > > > the
> > > > >> > > > > > > > broker
> > > > >> > > > > > > > > > and
> > > > >> > > > > > > > > > >> > > things
> > > > >> > > > > > > > > > >> > > > > > > "looked
> > > > >> > > > > > > > > > >> > > > > > > > blank."
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > The first problem is that
> things
> > > can
> > > > >> "look
> > > > >> > > > > blank"
> > > > >> > > > > > > > > > >> accidentally
> > > > >> > > > > > > > > > >> > if
> > > > >> > > > > > > > > > >> > > > the
> > > > >> > > > > > > > > > >> > > > > > > > storage system is having a bad
> > day.
> > > > >> > Clearly
> > > > >> > > > in
> > > > >> > > > > > the
> > > > >> > > > > > > > > > non-Raft
> > > > >> > > > > > > > > > >> > > world,
> > > > >> > > > > > > > > > >> > > > > > this
> > > > >> > > > > > > > > > >> > > > > > > > leads to data loss if the
> broker
> > > that
> > > > >> is
> > > > >> > > > > > (re)started
> > > > >> > > > > > > > > this
> > > > >> > > > > > > > > > >> way
> > > > >> > > > > > > > > > >> > was
> > > > >> > > > > > > > > > >> > > > the
> > > > >> > > > > > > > > > >> > > > > > > > leader for some partitions.
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > The second problem is that we
> > have
> > > a
> > > > >> bit
> > > > >> > of
> > > > >> > > a
> > > > >> > > > > > > chicken
> > > > >> > > > > > > > > and
> > > > >> > > > > > > > > > >> egg
> > > > >> > > > > > > > > > >> > > > problem
> > > > >> > > > > > > > > > >> > > > > > > with
> > > > >> > > > > > > > > > >> > > > > > > > certain configuration keys.
> For
> > > > >> example,
> > > > >> > > > maybe
> > > > >> > > > > > you
> > > > >> > > > > > > > want
> > > > >> > > > > > > > > > to
> > > > >> > > > > > > > > > >> > > > configure
> > > > >> > > > > > > > > > >> > > > > > > some
> > > > >> > > > > > > > > > >> > > > > > > > connection security settings in
> > > your
> > > > >> > > cluster,
> > > > >> > > > > but
> > > > >> > > > > > > you
> > > > >> > > > > > > > > > don't
> > > > >> > > > > > > > > > >> > want
> > > > >> > > > > > > > > > >> > > > them
> > > > >> > > > > > > > > > >> > > > > > to
> > > > >> > > > > > > > > > >> > > > > > > > ever be stored in a plaintext
> > > config
> > > > >> file.
> > > > >> > > > (For
> > > > >> > > > > > > > > example,
> > > > >> > > > > > > > > > >> SCRAM
> > > > >> > > > > > > > > > >> > > > > > > passwords,
> > > > >> > > > > > > > > > >> > > > > > > > etc.)  You could use a broker
> API
> > > to
> > > > >> set
> > > > >> > the
> > > > >> > > > > > > > > > configuration,
> > > > >> > > > > > > > > > >> but
> > > > >> > > > > > > > > > >> > > > that
> > > > >> > > > > > > > > > >> > > > > > > brings
> > > > >> > > > > > > > > > >> > > > > > > > up the chicken and egg problem.
> > > The
> > > > >> > broker
> > > > >> > > > > needs
> > > > >> > > > > > to
> > > > >> > > > > > > > be
> > > > >> > > > > > > > > > >> > > configured
> > > > >> > > > > > > > > > >> > > > to
> > > > >> > > > > > > > > > >> > > > > > > know
> > > > >> > > > > > > > > > >> > > > > > > > how to talk to you, but you
> need
> > to
> > > > >> > > configure
> > > > >> > > > it
> > > > >> > > > > > > > before
> > > > >> > > > > > > > > > you
> > > > >> > > > > > > > > > >> can
> > > > >> > > > > > > > > > >> > > > talk
> > > > >> > > > > > > > > > >> > > > > to
> > > > >> > > > > > > > > > >> > > > > > > > it.  Using an external secret
> > > manager
> > > > >> like
> > > > >> > > > Vault
> > > > >> > > > > > is
> > > > >> > > > > > > > one
> > > > >> > > > > > > > > > way
> > > > >> > > > > > > > > > >> to
> > > > >> > > > > > > > > > >> > > > solve
> > > > >> > > > > > > > > > >> > > > > > > this,
> > > > >> > > > > > > > > > >> > > > > > > > but not everyone uses an
> external
> > > > >> secret
> > > > >> > > > > manager.
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > quorum.voters seems like a
> > similar
> > > > >> > > > configuration
> > > > >> > > > > > > key.
> > > > >> > > > > > > > > In
> > > > >> > > > > > > > > > >> the
> > > > >> > > > > > > > > > >> > > > current
> > > > >> > > > > > > > > > >> > > > > > > KIP,
> > > > >> > > > > > > > > > >> > > > > > > > this is only read if there is
> no
> > > > other
> > > > >> > > > > > configuration
> > > > >> > > > > > > > > > >> specifying
> > > > >> > > > > > > > > > >> > > the
> > > > >> > > > > > > > > > >> > > > > > > quorum
> > > > >> > > > > > > > > > >> > > > > > > > voter set.  If we had a
> > kafka.mkfs
> > > > >> > command,
> > > > >> > > we
> > > > >> > > > > > > > wouldn't
> > > > >> > > > > > > > > > need
> > > > >> > > > > > > > > > >> > this
> > > > >> > > > > > > > > > >> > > > key
> > > > >> > > > > > > > > > >> > > > > > > > because we could assume that
> > there
> > > > was
> > > > >> > > always
> > > > >> > > > > > quorum
> > > > >> > > > > > > > > > >> > information
> > > > >> > > > > > > > > > >> > > > > stored
> > > > >> > > > > > > > > > >> > > > > > > > locally.
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > best,
> > > > >> > > > > > > > > > >> > > > > > > > Colin
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > On Thu, Apr 16, 2020, at 16:44,
> > > Jason
> > > > >> > > > Gustafson
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > > > > >> > > > > > > > > Hi All,
> > > > >> > > > > > > > > > >> > > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > > I'd like to start a
> discussion
> > on
> > > > >> > KIP-595:
> > > > >> > > > > > > > > > >> > > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >>
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > > > >> > > > > > > > > > >> > > > > > > > .
> > > > >> > > > > > > > > > >> > > > > > > > > This proposal specifies a
> Raft
> > > > >> protocol
> > > > >> > to
> > > > >> > > > > > > > ultimately
> > > > >> > > > > > > > > > >> replace
> > > > >> > > > > > > > > > >> > > > > > Zookeeper
> > > > >> > > > > > > > > > >> > > > > > > > > as
> > > > >> > > > > > > > > > >> > > > > > > > > documented in KIP-500. Please
> > > take
> > > > a
> > > > >> > look
> > > > >> > > > and
> > > > >> > > > > > > share
> > > > >> > > > > > > > > your
> > > > >> > > > > > > > > > >> > > > thoughts.
> > > > >> > > > > > > > > > >> > > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > > A few minor notes to set the
> > > stage
> > > > a
> > > > >> > > little
> > > > >> > > > > bit:
> > > > >> > > > > > > > > > >> > > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > > - This KIP does not specify
> the
> > > > >> > structure
> > > > >> > > of
> > > > >> > > > > the
> > > > >> > > > > > > > > > messages
> > > > >> > > > > > > > > > >> > used
> > > > >> > > > > > > > > > >> > > to
> > > > >> > > > > > > > > > >> > > > > > > > represent
> > > > >> > > > > > > > > > >> > > > > > > > > metadata in Kafka, nor does
> it
> > > > >> specify
> > > > >> > the
> > > > >> > > > > > > internal
> > > > >> > > > > > > > > API
> > > > >> > > > > > > > > > >> that
> > > > >> > > > > > > > > > >> > > will
> > > > >> > > > > > > > > > >> > > > > be
> > > > >> > > > > > > > > > >> > > > > > > used
> > > > >> > > > > > > > > > >> > > > > > > > > by the controller. Expect
> these
> > > to
> > > > >> come
> > > > >> > in
> > > > >> > > > > later
> > > > >> > > > > > > > > > >> proposals.
> > > > >> > > > > > > > > > >> > > Here
> > > > >> > > > > > > > > > >> > > > we
> > > > >> > > > > > > > > > >> > > > > > are
> > > > >> > > > > > > > > > >> > > > > > > > > primarily concerned with the
> > > > >> replication
> > > > >> > > > > > protocol
> > > > >> > > > > > > > and
> > > > >> > > > > > > > > > >> basic
> > > > >> > > > > > > > > > >> > > > > > operational
> > > > >> > > > > > > > > > >> > > > > > > > > mechanics.
> > > > >> > > > > > > > > > >> > > > > > > > > - We expect many details to
> > > change
> > > > >> as we
> > > > >> > > get
> > > > >> > > > > > > closer
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > > >> > > > integration
> > > > >> > > > > > > > > > >> > > > > > with
> > > > >> > > > > > > > > > >> > > > > > > > > the controller. Any changes
> we
> > > make
> > > > >> will
> > > > >> > > be
> > > > >> > > > > made
> > > > >> > > > > > > > > either
> > > > >> > > > > > > > > > as
> > > > >> > > > > > > > > > >> > > > > amendments
> > > > >> > > > > > > > > > >> > > > > > > to
> > > > >> > > > > > > > > > >> > > > > > > > > this KIP or, in the case of
> > > larger
> > > > >> > > changes,
> > > > >> > > > as
> > > > >> > > > > > new
> > > > >> > > > > > > > > > >> proposals.
> > > > >> > > > > > > > > > >> > > > > > > > > - We have a prototype
> > > > implementation
> > > > >> > > which I
> > > > >> > > > > > will
> > > > >> > > > > > > > put
> > > > >> > > > > > > > > > >> online
> > > > >> > > > > > > > > > >> > > > within
> > > > >> > > > > > > > > > >> > > > > > the
> > > > >> > > > > > > > > > >> > > > > > > > > next week which may help in
> > > > >> > understanding
> > > > >> > > > some
> > > > >> > > > > > > > > details.
> > > > >> > > > > > > > > > It
> > > > >> > > > > > > > > > >> > has
> > > > >> > > > > > > > > > >> > > > > > > diverged a
> > > > >> > > > > > > > > > >> > > > > > > > > little bit from our proposal,
> > so
> > > I
> > > > am
> > > > >> > > > taking a
> > > > >> > > > > > > > little
> > > > >> > > > > > > > > > >> time to
> > > > >> > > > > > > > > > >> > > > bring
> > > > >> > > > > > > > > > >> > > > > > it
> > > > >> > > > > > > > > > >> > > > > > > in
> > > > >> > > > > > > > > > >> > > > > > > > > line. I'll post an update to
> > this
> > > > >> thread
> > > > >> > > > when
> > > > >> > > > > it
> > > > >> > > > > > > is
> > > > >> > > > > > > > > > >> available
> > > > >> > > > > > > > > > >> > > for
> > > > >> > > > > > > > > > >> > > > > > > review.
> > > > >> > > > > > > > > > >> > > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > > Finally, I want to mention
> that
> > > > this
> > > > >> > > > proposal
> > > > >> > > > > > was
> > > > >> > > > > > > > > > drafted
> > > > >> > > > > > > > > > >> by
> > > > >> > > > > > > > > > >> > > > > myself,
> > > > >> > > > > > > > > > >> > > > > > > > Boyang
> > > > >> > > > > > > > > > >> > > > > > > > > Chen, and Guozhang Wang.
> > > > >> > > > > > > > > > >> > > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > > > Thanks,
> > > > >> > > > > > > > > > >> > > > > > > > > Jason
> > > > >> > > > > > > > > > >> > > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > >
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > > > --
> > > > >> > > > > > > > > > >> > > > > > > Leonard Ge
> > > > >> > > > > > > > > > >> > > > > > > Software Engineer Intern -
> > Confluent
> > > > >> > > > > > > > > > >> > > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > > > --
> > > > >> > > > > > > > > > >> > > > > > -- Guozhang
> > > > >> > > > > > > > > > >> > > > > >
> > > > >> > > > > > > > > > >> > > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > > > --
> > > > >> > > > > > > > > > >> > > > -- Guozhang
> > > > >> > > > > > > > > > >> > > >
> > > > >> > > > > > > > > > >> > >
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >> > --
> > > > >> > > > > > > > > > >> > -- Guozhang
> > > > >> > > > > > > > > > >> >
> > > > >> > > > > > > > > > >>
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > --
> > > > >> > > > > > > > > -- Guozhang
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > --
> > > > >> > > > > > > -- Guozhang
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>