You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Dong Lin <li...@gmail.com> on 2018/06/05 02:03:00 UTC

Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Hey Jun,

It seems that we have made considerable progress on the discussion of
KIP-253 since February. Do you think we should continue the discussion
there, or can we continue the voting for this KIP? I am happy to submit the
PR and move forward the progress for this KIP.

Thanks!
Dong


On Wed, Feb 7, 2018 at 11:42 PM, Dong Lin <li...@gmail.com> wrote:

> Hey Jun,
>
> Sure, I will come up with a KIP this week. I think there is a way to allow
> partition expansion to arbitrary number without introducing new concepts
> such as read-only partition or repartition epoch.
>
> Thanks,
> Dong
>
> On Wed, Feb 7, 2018 at 5:28 PM, Jun Rao <ju...@confluent.io> wrote:
>
>> Hi, Dong,
>>
>> Thanks for the reply. The general idea that you had for adding partitions
>> is similar to what we had in mind. It would be useful to make this more
>> general, allowing adding an arbitrary number of partitions (instead of
>> just
>> doubling) and potentially removing partitions as well. The following is
>> the
>> high level idea from the discussion with Colin, Jason and Ismael.
>>
>> * To change the number of partitions from X to Y in a topic, the
>> controller
>> marks all existing X partitions as read-only and creates Y new partitions.
>> The new partitions are writable and are tagged with a higher repartition
>> epoch (RE).
>>
>> * The controller propagates the new metadata to every broker. Once the
>> leader of a partition is marked as read-only, it rejects the produce
>> requests on this partition. The producer will then refresh the metadata
>> and
>> start publishing to the new writable partitions.
>>
>> * The consumers will then be consuming messages in RE order. The consumer
>> coordinator will only assign partitions in the same RE to consumers. Only
>> after all messages in an RE are consumed, will partitions in a higher RE
>> be
>> assigned to consumers.
>>
>> As Colin mentioned, if we do the above, we could potentially (1) use a
>> globally unique partition id, or (2) use a globally unique topic id to
>> distinguish recreated partitions due to topic deletion.
>>
>> So, perhaps we can sketch out the re-partitioning KIP a bit more and see
>> if
>> there is any overlap with KIP-232. Would you be interested in doing that?
>> If not, we can do that next week.
>>
>> Jun
>>
>>
>> On Tue, Feb 6, 2018 at 11:30 AM, Dong Lin <li...@gmail.com> wrote:
>>
>> > Hey Jun,
>> >
>> > Interestingly I am also planning to sketch a KIP to allow partition
>> > expansion for keyed topics after this KIP. Since you are already doing
>> > that, I guess I will just share my high level idea here in case it is
>> > helpful.
>> >
>> > The motivation for the KIP is that we currently lose order guarantee for
>> > messages with the same key if we expand partitions of keyed topic.
>> >
>> > The solution can probably be built upon the following ideas:
>> >
>> > - Partition number of the keyed topic should always be doubled (or
>> > multiplied by power of 2). Given that we select a partition based on
>> > hash(key) % partitionNum, this should help us ensure that, a message
>> > assigned to an existing partition will not be mapped to another existing
>> > partition after partition expansion.
>> >
>> > - Producer includes in the ProduceRequest some information that helps
>> > ensure that messages produced ti a partition will monotonically
>> increase in
>> > the partitionNum of the topic. In other words, if broker receives a
>> > ProduceRequest and notices that the producer does not know the partition
>> > number has increased, broker should reject this request. That
>> "information"
>> > maybe leaderEpoch, max partitionEpoch of the partitions of the topic, or
>> > simply partitionNum of the topic. The benefit of this property is that
>> we
>> > can keep the new logic for in-order message consumption entirely in how
>> > consumer leader determines the partition -> consumer mapping.
>> >
>> > - When consumer leader determines partition -> consumer mapping, leader
>> > first reads the start position for each partition using
>> OffsetFetchRequest.
>> > If start position are all non-zero, then assignment can be done in its
>> > current manner. The assumption is that, a message in the new partition
>> > should only be consumed after all messages with the same key produced
>> > before it has been consumed. Since some messages in the new partition
>> has
>> > been consumed, we should not worry about consuming messages
>> out-of-order.
>> > This benefit of this approach is that we can avoid unnecessary overhead
>> in
>> > the common case.
>> >
>> > - If the consumer leader finds that the start position for some
>> partition
>> > is 0. Say the current partition number is 18 and the partition index is
>> 12,
>> > then consumer leader should ensure that messages produced to partition
>> 12 -
>> > 18/2 = 3 before the first message of partition 12 is consumed, before it
>> > assigned partition 12 to any consumer in the consumer group. Since we
>> have
>> > a "information" that is monotonically increasing per partition, consumer
>> > can read the value of this information from the first message in
>> partition
>> > 12, get the offset corresponding to this value in partition 3, assign
>> > partition except for partition 12 (and probably other new partitions) to
>> > the existing consumers, waiting for the committed offset to go beyond
>> this
>> > offset for partition 3, and trigger rebalance again so that partition 3
>> can
>> > be reassigned to some consumer.
>> >
>> >
>> > Thanks,
>> > Dong
>> >
>> >
>> > On Tue, Feb 6, 2018 at 10:10 AM, Jun Rao <ju...@confluent.io> wrote:
>> >
>> > > Hi, Dong,
>> > >
>> > > Thanks for the KIP. It looks good overall. We are working on a
>> separate
>> > KIP
>> > > for adding partitions while preserving the ordering guarantees. That
>> may
>> > > require another flavor of partition epoch. It's not very clear whether
>> > that
>> > > partition epoch can be merged with the partition epoch in this KIP.
>> So,
>> > > perhaps you can wait on this a bit until we post the other KIP in the
>> > next
>> > > few days.
>> > >
>> > > Jun
>> > >
>> > >
>> > >
>> > > On Mon, Feb 5, 2018 at 2:43 PM, Becket Qin <be...@gmail.com>
>> wrote:
>> > >
>> > > > +1 on the KIP.
>> > > >
>> > > > I think the KIP is mainly about adding the capability of tracking
>> the
>> > > > system state change lineage. It does not seem necessary to bundle
>> this
>> > > KIP
>> > > > with replacing the topic partition with partition epoch in
>> > produce/fetch.
>> > > > Replacing topic-partition string with partition epoch is
>> essentially a
>> > > > performance improvement on top of this KIP. That can probably be
>> done
>> > > > separately.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jiangjie (Becket) Qin
>> > > >
>> > > > On Mon, Jan 29, 2018 at 11:52 AM, Dong Lin <li...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Hey Colin,
>> > > > >
>> > > > > On Mon, Jan 29, 2018 at 11:23 AM, Colin McCabe <
>> cmccabe@apache.org>
>> > > > wrote:
>> > > > >
>> > > > > > > On Mon, Jan 29, 2018 at 10:35 AM, Dong Lin <
>> lindong28@gmail.com>
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > Hey Colin,
>> > > > > > > >
>> > > > > > > > I understand that the KIP will adds overhead by introducing
>> > > > > > per-partition
>> > > > > > > > partitionEpoch. I am open to alternative solutions that does
>> > not
>> > > > > incur
>> > > > > > > > additional overhead. But I don't see a better way now.
>> > > > > > > >
>> > > > > > > > IMO the overhead in the FetchResponse may not be that much.
>> We
>> > > > > probably
>> > > > > > > > should discuss the percentage increase rather than the
>> absolute
>> > > > > number
>> > > > > > > > increase. Currently after KIP-227, per-partition header has
>> 23
>> > > > bytes.
>> > > > > > This
>> > > > > > > > KIP adds another 4 bytes. Assume the records size is 10KB,
>> the
>> > > > > > percentage
>> > > > > > > > increase is 4 / (23 + 10000) = 0.03%. It seems negligible,
>> > right?
>> > > > > >
>> > > > > > Hi Dong,
>> > > > > >
>> > > > > > Thanks for the response.  I agree that the FetchRequest /
>> > > FetchResponse
>> > > > > > overhead should be OK, now that we have incremental fetch
>> requests
>> > > and
>> > > > > > responses.  However, there are a lot of cases where the
>> percentage
>> > > > > increase
>> > > > > > is much greater.  For example, if a client is doing full
>> > > > > MetadataRequests /
>> > > > > > Responses, we have some math kind of like this per partition:
>> > > > > >
>> > > > > > > UpdateMetadataRequestPartitionState => topic partition
>> > > > > controller_epoch
>> > > > > > leader  leader_epoch partition_epoch isr zk_version replicas
>> > > > > > offline_replicas
>> > > > > > > 14 bytes:  topic => string (assuming about 10 byte topic
>> names)
>> > > > > > > 4 bytes:  partition => int32
>> > > > > > > 4  bytes: conroller_epoch => int32
>> > > > > > > 4  bytes: leader => int32
>> > > > > > > 4  bytes: leader_epoch => int32
>> > > > > > > +4 EXTRA bytes: partition_epoch => int32        <-- NEW
>> > > > > > > 2+4+4+4 bytes: isr => [int32] (assuming 3 in the ISR)
>> > > > > > > 4 bytes: zk_version => int32
>> > > > > > > 2+4+4+4 bytes: replicas => [int32] (assuming 3 replicas)
>> > > > > > > 2  offline_replicas => [int32] (assuming no offline replicas)
>> > > > > >
>> > > > > > Assuming I added that up correctly, the per-partition overhead
>> goes
>> > > > from
>> > > > > > 64 bytes per partition to 68, a 6.2% increase.
>> > > > > >
>> > > > > > We could do similar math for a lot of the other RPCs.  And you
>> will
>> > > > have
>> > > > > a
>> > > > > > similar memory and garbage collection impact on the brokers
>> since
>> > you
>> > > > > have
>> > > > > > to store all this extra state as well.
>> > > > > >
>> > > > >
>> > > > > That is correct. IMO the Metadata is only updated periodically
>> and is
>> > > > > probably not a big deal if we increase it by 6%. The FetchResponse
>> > and
>> > > > > ProduceRequest are probably the only requests that are bounded by
>> the
>> > > > > bandwidth throughput.
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > > >
>> > > > > > > > I agree that we can probably save more space by using
>> partition
>> > > ID
>> > > > so
>> > > > > > that
>> > > > > > > > we no longer needs the string topic name. The similar idea
>> has
>> > > also
>> > > > > > been
>> > > > > > > > put in the Rejected Alternative section in KIP-227. While
>> this
>> > > idea
>> > > > > is
>> > > > > > > > promising, it seems orthogonal to the goal of this KIP.
>> Given
>> > > that
>> > > > > > there is
>> > > > > > > > already many work to do in this KIP, maybe we can do the
>> > > partition
>> > > > ID
>> > > > > > in a
>> > > > > > > > separate KIP?
>> > > > > >
>> > > > > > I guess my thinking is that the goal here is to replace an
>> > identifier
>> > > > > > which can be re-used (the tuple of topic name, partition ID)
>> with
>> > an
>> > > > > > identifier that cannot be re-used (the tuple of topic name,
>> > partition
>> > > > ID,
>> > > > > > partition epoch) in order to gain better semantics.  As long as
>> we
>> > > are
>> > > > > > replacing the identifier, why not replace it with an identifier
>> > that
>> > > > has
>> > > > > > important performance advantages?  The KIP freeze for the next
>> > > release
>> > > > > has
>> > > > > > already passed, so there is time to do this.
>> > > > > >
>> > > > >
>> > > > > In general it can be easier for discussion and implementation if
>> we
>> > can
>> > > > > split a larger task into smaller and independent tasks. For
>> example,
>> > > > > KIP-112 and KIP-113 both deals with the JBOD support. KIP-31,
>> KIP-32
>> > > and
>> > > > > KIP-33 are about timestamp support. The option on this can be
>> subject
>> > > > > though.
>> > > > >
>> > > > > IMO the change to switch from (topic, partition ID) to
>> partitionEpch
>> > in
>> > > > all
>> > > > > request/response requires us to going through all request one by
>> one.
>> > > It
>> > > > > may not be hard but it can be time consuming and tedious. At high
>> > level
>> > > > the
>> > > > > goal and the change for that will be orthogonal to the changes
>> > required
>> > > > in
>> > > > > this KIP. That is the main reason I think we can split them into
>> two
>> > > > KIPs.
>> > > > >
>> > > > >
>> > > > > > On Mon, Jan 29, 2018, at 10:54, Dong Lin wrote:
>> > > > > > > I think it is possible to move to entirely use partitionEpoch
>> > > instead
>> > > > > of
>> > > > > > > (topic, partition) to identify a partition. Client can obtain
>> the
>> > > > > > > partitionEpoch -> (topic, partition) mapping from
>> > MetadataResponse.
>> > > > We
>> > > > > > > probably need to figure out a way to assign partitionEpoch to
>> > > > existing
>> > > > > > > partitions in the cluster. But this should be doable.
>> > > > > > >
>> > > > > > > This is a good idea. I think it will save us some space in the
>> > > > > > > request/response. The actual space saving in percentage
>> probably
>> > > > > depends
>> > > > > > on
>> > > > > > > the amount of data and the number of partitions of the same
>> > topic.
>> > > I
>> > > > > just
>> > > > > > > think we can do it in a separate KIP.
>> > > > > >
>> > > > > > Hmm.  How much extra work would be required?  It seems like we
>> are
>> > > > > already
>> > > > > > changing almost every RPC that involves topics and partitions,
>> > > already
>> > > > > > adding new per-partition state to ZooKeeper, already changing
>> how
>> > > > clients
>> > > > > > interact with partitions.  Is there some other big piece of work
>> > we'd
>> > > > > have
>> > > > > > to do to move to partition IDs that we wouldn't need for
>> partition
>> > > > > epochs?
>> > > > > > I guess we'd have to find a way to support regular
>> expression-based
>> > > > topic
>> > > > > > subscriptions.  If we split this into multiple KIPs, wouldn't we
>> > end
>> > > up
>> > > > > > changing all that RPCs and ZK state a second time?  Also, I'm
>> > curious
>> > > > if
>> > > > > > anyone has done any proof of concept GC, memory, and network
>> usage
>> > > > > > measurements on switching topic names for topic IDs.
>> > > > > >
>> > > > >
>> > > > >
>> > > > > We will need to go over all requests/responses to check how to
>> > replace
>> > > > > (topic, partition ID) with partition epoch. It requires
>> non-trivial
>> > > work
>> > > > > and could take time. As you mentioned, we may want to see how much
>> > > saving
>> > > > > we can get by switching from topic names to partition epoch. That
>> > > itself
>> > > > > requires time and experiment. It seems that the new idea does not
>> > > > rollback
>> > > > > any change proposed in this KIP. So I am not sure we can get much
>> by
>> > > > > putting them into the same KIP.
>> > > > >
>> > > > > Anyway, if more people are interested in seeing the new idea in
>> the
>> > > same
>> > > > > KIP, I can try that.
>> > > > >
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > best,
>> > > > > > Colin
>> > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Mon, Jan 29, 2018 at 10:18 AM, Colin McCabe <
>> > > cmccabe@apache.org
>> > > > >
>> > > > > > wrote:
>> > > > > > > >
>> > > > > > > >> On Fri, Jan 26, 2018, at 12:17, Dong Lin wrote:
>> > > > > > > >> > Hey Colin,
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > On Fri, Jan 26, 2018 at 10:16 AM, Colin McCabe <
>> > > > > cmccabe@apache.org>
>> > > > > > > >> wrote:
>> > > > > > > >> >
>> > > > > > > >> > > On Thu, Jan 25, 2018, at 16:47, Dong Lin wrote:
>> > > > > > > >> > > > Hey Colin,
>> > > > > > > >> > > >
>> > > > > > > >> > > > Thanks for the comment.
>> > > > > > > >> > > >
>> > > > > > > >> > > > On Thu, Jan 25, 2018 at 4:15 PM, Colin McCabe <
>> > > > > > cmccabe@apache.org>
>> > > > > > > >> > > wrote:
>> > > > > > > >> > > >
>> > > > > > > >> > > > > On Wed, Jan 24, 2018, at 21:07, Dong Lin wrote:
>> > > > > > > >> > > > > > Hey Colin,
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > Thanks for reviewing the KIP.
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > If I understand you right, you maybe suggesting
>> that
>> > > we
>> > > > > can
>> > > > > > use
>> > > > > > > >> a
>> > > > > > > >> > > global
>> > > > > > > >> > > > > > metadataEpoch that is incremented every time
>> > > controller
>> > > > > > updates
>> > > > > > > >> > > metadata.
>> > > > > > > >> > > > > > The problem with this solution is that, if a
>> topic
>> > is
>> > > > > > deleted
>> > > > > > > >> and
>> > > > > > > >> > > created
>> > > > > > > >> > > > > > again, user will not know whether that the offset
>> > > which
>> > > > is
>> > > > > > > >> stored
>> > > > > > > >> > > before
>> > > > > > > >> > > > > > the topic deletion is no longer valid. This
>> > motivates
>> > > > the
>> > > > > > idea
>> > > > > > > >> to
>> > > > > > > >> > > include
>> > > > > > > >> > > > > > per-partition partitionEpoch. Does this sound
>> > > > reasonable?
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > Hi Dong,
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > Perhaps we can store the last valid offset of each
>> > > deleted
>> > > > > > topic
>> > > > > > > >> in
>> > > > > > > >> > > > > ZooKeeper.  Then, when a topic with one of those
>> names
>> > > > gets
>> > > > > > > >> > > re-created, we
>> > > > > > > >> > > > > can start the topic at the previous end offset
>> rather
>> > > than
>> > > > > at
>> > > > > > 0.
>> > > > > > > >> This
>> > > > > > > >> > > > > preserves immutability.  It is no more burdensome
>> than
>> > > > > having
>> > > > > > to
>> > > > > > > >> > > preserve a
>> > > > > > > >> > > > > "last epoch" for the deleted partition somewhere,
>> > right?
>> > > > > > > >> > > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > > My concern with this solution is that the number of
>> > > > zookeeper
>> > > > > > nodes
>> > > > > > > >> get
>> > > > > > > >> > > > more and more over time if some users keep deleting
>> and
>> > > > > creating
>> > > > > > > >> topics.
>> > > > > > > >> > > Do
>> > > > > > > >> > > > you think this can be a problem?
>> > > > > > > >> > >
>> > > > > > > >> > > Hi Dong,
>> > > > > > > >> > >
>> > > > > > > >> > > We could expire the "partition tombstones" after an
>> hour
>> > or
>> > > > so.
>> > > > > > In
>> > > > > > > >> > > practice this would solve the issue for clients that
>> like
>> > to
>> > > > > > destroy
>> > > > > > > >> and
>> > > > > > > >> > > re-create topics all the time.  In any case, doesn't
>> the
>> > > > current
>> > > > > > > >> proposal
>> > > > > > > >> > > add per-partition znodes as well that we have to track
>> > even
>> > > > > after
>> > > > > > the
>> > > > > > > >> > > partition is deleted?  Or did I misunderstand that?
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >> > Actually the current KIP does not add per-partition
>> znodes.
>> > > > Could
>> > > > > > you
>> > > > > > > >> > double check? I can fix the KIP wiki if there is anything
>> > > > > > misleading.
>> > > > > > > >>
>> > > > > > > >> Hi Dong,
>> > > > > > > >>
>> > > > > > > >> I double-checked the KIP, and I can see that you are in
>> fact
>> > > > using a
>> > > > > > > >> global counter for initializing partition epochs.  So, you
>> are
>> > > > > > correct, it
>> > > > > > > >> doesn't add per-partition znodes for partitions that no
>> longer
>> > > > > exist.
>> > > > > > > >>
>> > > > > > > >> >
>> > > > > > > >> > If we expire the "partition tomstones" after an hour, and
>> > the
>> > > > > topic
>> > > > > > is
>> > > > > > > >> > re-created after more than an hour since the topic
>> deletion,
>> > > > then
>> > > > > > we are
>> > > > > > > >> > back to the situation where user can not tell whether the
>> > > topic
>> > > > > has
>> > > > > > been
>> > > > > > > >> > re-created or not, right?
>> > > > > > > >>
>> > > > > > > >> Yes, with an expiration period, it would not ensure
>> > > immutability--
>> > > > > you
>> > > > > > > >> could effectively reuse partition names and they would look
>> > the
>> > > > > same.
>> > > > > > > >>
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > >
>> > > > > > > >> > > It's not really clear to me what should happen when a
>> > topic
>> > > is
>> > > > > > > >> destroyed
>> > > > > > > >> > > and re-created with new data.  Should consumers
>> continue
>> > to
>> > > be
>> > > > > > able to
>> > > > > > > >> > > consume?  We don't know where they stopped consuming
>> from
>> > > the
>> > > > > > previous
>> > > > > > > >> > > incarnation of the topic, so messages may have been
>> lost.
>> > > > > > Certainly
>> > > > > > > >> > > consuming data from offset X of the new incarnation of
>> the
>> > > > topic
>> > > > > > may
>> > > > > > > >> give
>> > > > > > > >> > > something totally different from what you would have
>> > gotten
>> > > > from
>> > > > > > > >> offset X
>> > > > > > > >> > > of the previous incarnation of the topic.
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >> > With the current KIP, if a consumer consumes a topic
>> based
>> > on
>> > > > the
>> > > > > > last
>> > > > > > > >> > remembered (offset, partitionEpoch, leaderEpoch), and if
>> the
>> > > > topic
>> > > > > > is
>> > > > > > > >> > re-created, consume will throw
>> > InvalidPartitionEpochException
>> > > > > > because
>> > > > > > > >> the
>> > > > > > > >> > previous partitionEpoch will be different from the
>> current
>> > > > > > > >> partitionEpoch.
>> > > > > > > >> > This is described in the Proposed Changes -> Consumption
>> > after
>> > > > > topic
>> > > > > > > >> > deletion in the KIP. I can improve the KIP if there is
>> > > anything
>> > > > > not
>> > > > > > > >> clear.
>> > > > > > > >>
>> > > > > > > >> Thanks for the clarification.  It sounds like what you
>> really
>> > > want
>> > > > > is
>> > > > > > > >> immutability-- i.e., to never "really" reuse partition
>> > > > identifiers.
>> > > > > > And
>> > > > > > > >> you do this by making the partition name no longer the
>> "real"
>> > > > > > identifier.
>> > > > > > > >>
>> > > > > > > >> My big concern about this KIP is that it seems like an
>> > > > > > anti-scalability
>> > > > > > > >> feature.  Now we are adding 4 extra bytes for every
>> partition
>> > in
>> > > > the
>> > > > > > > >> FetchResponse and Request, for example.  That could be 40
>> kb
>> > per
>> > > > > > request,
>> > > > > > > >> if the user has 10,000 partitions.  And of course, the KIP
>> > also
>> > > > > makes
>> > > > > > > >> massive changes to UpdateMetadataRequest, MetadataResponse,
>> > > > > > > >> OffsetCommitRequest, OffsetFetchResponse,
>> LeaderAndIsrRequest,
>> > > > > > > >> ListOffsetResponse, etc. which will also increase their
>> size
>> > on
>> > > > the
>> > > > > > wire
>> > > > > > > >> and in memory.
>> > > > > > > >>
>> > > > > > > >> One thing that we talked a lot about in the past is
>> replacing
>> > > > > > partition
>> > > > > > > >> names with IDs.  IDs have a lot of really nice features.
>> They
>> > > > take
>> > > > > > up much
>> > > > > > > >> less space in memory than strings (especially 2-byte Java
>> > > > strings).
>> > > > > > They
>> > > > > > > >> can often be allocated on the stack rather than the heap
>> > > > (important
>> > > > > > when
>> > > > > > > >> you are dealing with hundreds of thousands of them).  They
>> can
>> > > be
>> > > > > > > >> efficiently deserialized and serialized.  If we use 64-bit
>> > ones,
>> > > > we
>> > > > > > will
>> > > > > > > >> never run out of IDs, which means that they can always be
>> > unique
>> > > > per
>> > > > > > > >> partition.
>> > > > > > > >>
>> > > > > > > >> Given that the partition name is no longer the "real"
>> > identifier
>> > > > for
>> > > > > > > >> partitions in the current KIP-232 proposal, why not just
>> move
>> > to
>> > > > > using
>> > > > > > > >> partition IDs entirely instead of strings?  You have to
>> change
>> > > all
>> > > > > the
>> > > > > > > >> messages anyway.  There isn't much point any more to
>> carrying
>> > > > around
>> > > > > > the
>> > > > > > > >> partition name in every RPC, since you really need (name,
>> > epoch)
>> > > > to
>> > > > > > > >> identify the partition.
>> > > > > > > >> Probably the metadata response and a few other messages
>> would
>> > > have
>> > > > > to
>> > > > > > > >> still carry the partition name, to allow clients to go from
>> > name
>> > > > to
>> > > > > > id.
>> > > > > > > >> But we could mostly forget about the strings.  And then
>> this
>> > > would
>> > > > > be
>> > > > > > a
>> > > > > > > >> scalability improvement rather than a scalability problem.
>> > > > > > > >>
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > > By choosing to reuse the same (topic, partition,
>> offset)
>> > > > > 3-tuple,
>> > > > > > we
>> > > > > > > >> have
>> > > > > > > >> >
>> > > > > > > >> > chosen to give up immutability.  That was a really bad
>> > > decision.
>> > > > > > And
>> > > > > > > >> now
>> > > > > > > >> > > we have to worry about time dependencies, stale cached
>> > data,
>> > > > and
>> > > > > > all
>> > > > > > > >> the
>> > > > > > > >> > > rest.  We can't completely fix this inside Kafka no
>> matter
>> > > > what
>> > > > > > we do,
>> > > > > > > >> > > because not all that cached data is inside Kafka
>> itself.
>> > > Some
>> > > > > of
>> > > > > > it
>> > > > > > > >> may be
>> > > > > > > >> > > in systems that Kafka has sent data to, such as other
>> > > daemons,
>> > > > > SQL
>> > > > > > > >> > > databases, streams, and so forth.
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >> > The current KIP will uniquely identify a message using
>> > (topic,
>> > > > > > > >> partition,
>> > > > > > > >> > offset, partitionEpoch) 4-tuple. This addresses the
>> message
>> > > > > > immutability
>> > > > > > > >> > issue that you mentioned. Is there any corner case where
>> the
>> > > > > message
>> > > > > > > >> > immutability is still not preserved with the current KIP?
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > >
>> > > > > > > >> > > I guess the idea here is that mirror maker should work
>> as
>> > > > > expected
>> > > > > > > >> when
>> > > > > > > >> > > users destroy a topic and re-create it with the same
>> name.
>> > > > > That's
>> > > > > > > >> kind of
>> > > > > > > >> > > tough, though, since in that scenario, mirror maker
>> > probably
>> > > > > > should
>> > > > > > > >> destroy
>> > > > > > > >> > > and re-create the topic on the other end, too, right?
>> > > > > Otherwise,
>> > > > > > > >> what you
>> > > > > > > >> > > end up with on the other end could be half of one
>> > > incarnation
>> > > > of
>> > > > > > the
>> > > > > > > >> topic,
>> > > > > > > >> > > and half of another.
>> > > > > > > >> > >
>> > > > > > > >> > > What mirror maker really needs is to be able to follow
>> a
>> > > > stream
>> > > > > of
>> > > > > > > >> events
>> > > > > > > >> > > about the kafka cluster itself.  We could have some
>> master
>> > > > topic
>> > > > > > > >> which is
>> > > > > > > >> > > always present and which contains data about all topic
>> > > > > deletions,
>> > > > > > > >> > > creations, etc.  Then MM can simply follow this topic
>> and
>> > do
>> > > > > what
>> > > > > > is
>> > > > > > > >> needed.
>> > > > > > > >> > >
>> > > > > > > >> > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > Then the next question maybe, should we use a
>> global
>> > > > > > > >> metadataEpoch +
>> > > > > > > >> > > > > > per-partition partitionEpoch, instead of using
>> > > > > per-partition
>> > > > > > > >> > > leaderEpoch
>> > > > > > > >> > > > > +
>> > > > > > > >> > > > > > per-partition leaderEpoch. The former solution
>> using
>> > > > > > > >> metadataEpoch
>> > > > > > > >> > > would
>> > > > > > > >> > > > > > not work due to the following scenario (provided
>> by
>> > > > Jun):
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > "Consider the following scenario. In metadata v1,
>> > the
>> > > > > leader
>> > > > > > > >> for a
>> > > > > > > >> > > > > > partition is at broker 1. In metadata v2, leader
>> is
>> > at
>> > > > > > broker
>> > > > > > > >> 2. In
>> > > > > > > >> > > > > > metadata v3, leader is at broker 1 again. The
>> last
>> > > > > committed
>> > > > > > > >> offset
>> > > > > > > >> > > in
>> > > > > > > >> > > > > v1,
>> > > > > > > >> > > > > > v2 and v3 are 10, 20 and 30, respectively. A
>> > consumer
>> > > is
>> > > > > > > >> started and
>> > > > > > > >> > > > > reads
>> > > > > > > >> > > > > > metadata v1 and reads messages from offset 0 to
>> 25
>> > > from
>> > > > > > broker
>> > > > > > > >> 1. My
>> > > > > > > >> > > > > > understanding is that in the current proposal,
>> the
>> > > > > metadata
>> > > > > > > >> version
>> > > > > > > >> > > > > > associated with offset 25 is v1. The consumer is
>> > then
>> > > > > > restarted
>> > > > > > > >> and
>> > > > > > > >> > > > > fetches
>> > > > > > > >> > > > > > metadata v2. The consumer tries to read from
>> broker
>> > 2,
>> > > > > > which is
>> > > > > > > >> the
>> > > > > > > >> > > old
>> > > > > > > >> > > > > > leader with the last offset at 20. In this case,
>> the
>> > > > > > consumer
>> > > > > > > >> will
>> > > > > > > >> > > still
>> > > > > > > >> > > > > > get OffsetOutOfRangeException incorrectly."
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > Regarding your comment "For the second purpose,
>> this
>> > > is
>> > > > > > "soft
>> > > > > > > >> state"
>> > > > > > > >> > > > > > anyway.  If the client thinks X is the leader
>> but Y
>> > is
>> > > > > > really
>> > > > > > > >> the
>> > > > > > > >> > > leader,
>> > > > > > > >> > > > > > the client will talk to X, and X will point out
>> its
>> > > > > mistake
>> > > > > > by
>> > > > > > > >> > > sending
>> > > > > > > >> > > > > back
>> > > > > > > >> > > > > > a NOT_LEADER_FOR_PARTITION.", it is probably no
>> > true.
>> > > > The
>> > > > > > > >> problem
>> > > > > > > >> > > here is
>> > > > > > > >> > > > > > that the old leader X may still think it is the
>> > leader
>> > > > of
>> > > > > > the
>> > > > > > > >> > > partition
>> > > > > > > >> > > > > and
>> > > > > > > >> > > > > > thus it will not send back
>> NOT_LEADER_FOR_PARTITION.
>> > > The
>> > > > > > reason
>> > > > > > > >> is
>> > > > > > > >> > > > > provided
>> > > > > > > >> > > > > > in KAFKA-6262. Can you check if that makes sense?
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > This is solvable with a timeout, right?  If the
>> leader
>> > > > can't
>> > > > > > > >> > > communicate
>> > > > > > > >> > > > > with the controller for a certain period of time,
>> it
>> > > > should
>> > > > > > stop
>> > > > > > > >> > > acting as
>> > > > > > > >> > > > > the leader.  We have to solve this problem,
>> anyway, in
>> > > > order
>> > > > > > to
>> > > > > > > >> fix
>> > > > > > > >> > > all the
>> > > > > > > >> > > > > corner cases.
>> > > > > > > >> > > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > > Not sure if I fully understand your proposal. The
>> > proposal
>> > > > > > seems to
>> > > > > > > >> > > require
>> > > > > > > >> > > > non-trivial changes to our existing leadership
>> election
>> > > > > > mechanism.
>> > > > > > > >> Could
>> > > > > > > >> > > > you provide more detail regarding how it works? For
>> > > example,
>> > > > > how
>> > > > > > > >> should
>> > > > > > > >> > > > user choose this timeout, how leader determines
>> whether
>> > it
>> > > > can
>> > > > > > still
>> > > > > > > >> > > > communicate with controller, and how this triggers
>> > > > controller
>> > > > > to
>> > > > > > > >> elect
>> > > > > > > >> > > new
>> > > > > > > >> > > > leader?
>> > > > > > > >> > >
>> > > > > > > >> > > Before I come up with any proposal, let me make sure I
>> > > > > understand
>> > > > > > the
>> > > > > > > >> > > problem correctly.  My big question was, what prevents
>> > > > > split-brain
>> > > > > > > >> here?
>> > > > > > > >> > >
>> > > > > > > >> > > Let's say I have a partition which is on nodes A, B,
>> and
>> > C,
>> > > > with
>> > > > > > > >> min-ISR
>> > > > > > > >> > > 2.  The controller is D.  At some point, there is a
>> > network
>> > > > > > partition
>> > > > > > > >> > > between A and B and the rest of the cluster.  The
>> > Controller
>> > > > > > > >> re-assigns the
>> > > > > > > >> > > partition to nodes C, D, and E.  But A and B keep
>> chugging
>> > > > away,
>> > > > > > even
>> > > > > > > >> > > though they can no longer communicate with the
>> controller.
>> > > > > > > >> > >
>> > > > > > > >> > > At some point, a client with stale metadata writes to
>> the
>> > > > > > partition.
>> > > > > > > >> It
>> > > > > > > >> > > still thinks the partition is on node A, B, and C, so
>> > that's
>> > > > > > where it
>> > > > > > > >> sends
>> > > > > > > >> > > the data.  It's unable to talk to C, but A and B reply
>> > back
>> > > > that
>> > > > > > all
>> > > > > > > >> is
>> > > > > > > >> > > well.
>> > > > > > > >> > >
>> > > > > > > >> > > Is this not a case where we could lose data due to
>> split
>> > > > brain?
>> > > > > > Or is
>> > > > > > > >> > > there a mechanism for preventing this that I missed?
>> If
>> > it
>> > > > is,
>> > > > > it
>> > > > > > > >> seems
>> > > > > > > >> > > like a pretty serious failure case that we should be
>> > > handling
>> > > > > > with our
>> > > > > > > >> > > metadata rework.  And I think epoch numbers and
>> timeouts
>> > > might
>> > > > > be
>> > > > > > > >> part of
>> > > > > > > >> > > the solution.
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >> > Right, split brain can happen if RF=4 and minIsr=2.
>> > However, I
>> > > > am
>> > > > > > not
>> > > > > > > >> sure
>> > > > > > > >> > it is a pretty serious issue which we need to address
>> today.
>> > > > This
>> > > > > > can be
>> > > > > > > >> > prevented by configuring the Kafka topic so that minIsr >
>> > > RF/2.
>> > > > > > > >> Actually,
>> > > > > > > >> > if user sets minIsr=2, is there anything reason that user
>> > > wants
>> > > > to
>> > > > > > set
>> > > > > > > >> RF=4
>> > > > > > > >> > instead of 4?
>> > > > > > > >> >
>> > > > > > > >> > Introducing timeout in leader election mechanism is
>> > > > non-trivial. I
>> > > > > > > >> think we
>> > > > > > > >> > probably want to do that only if there is good use-case
>> that
>> > > can
>> > > > > not
>> > > > > > > >> > otherwise be addressed with the current mechanism.
>> > > > > > > >>
>> > > > > > > >> I still would like to think about these corner cases more.
>> > But
>> > > > > > perhaps
>> > > > > > > >> it's not directly related to this KIP.
>> > > > > > > >>
>> > > > > > > >> regards,
>> > > > > > > >> Colin
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > > best,
>> > > > > > > >> > > Colin
>> > > > > > > >> > >
>> > > > > > > >> > >
>> > > > > > > >> > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > > > best,
>> > > > > > > >> > > > > Colin
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > Regards,
>> > > > > > > >> > > > > > Dong
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > On Wed, Jan 24, 2018 at 10:39 AM, Colin McCabe <
>> > > > > > > >> cmccabe@apache.org>
>> > > > > > > >> > > > > wrote:
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > > Hi Dong,
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > Thanks for proposing this KIP.  I think a
>> metadata
>> > > > epoch
>> > > > > > is a
>> > > > > > > >> > > really
>> > > > > > > >> > > > > good
>> > > > > > > >> > > > > > > idea.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > I read through the DISCUSS thread, but I still
>> > don't
>> > > > > have
>> > > > > > a
>> > > > > > > >> clear
>> > > > > > > >> > > > > picture
>> > > > > > > >> > > > > > > of why the proposal uses a metadata epoch per
>> > > > partition
>> > > > > > rather
>> > > > > > > >> > > than a
>> > > > > > > >> > > > > > > global metadata epoch.  A metadata epoch per
>> > > partition
>> > > > > is
>> > > > > > > >> kind of
>> > > > > > > >> > > > > > > unpleasant-- it's at least 4 extra bytes per
>> > > partition
>> > > > > > that we
>> > > > > > > >> > > have to
>> > > > > > > >> > > > > send
>> > > > > > > >> > > > > > > over the wire in every full metadata request,
>> > which
>> > > > > could
>> > > > > > > >> become
>> > > > > > > >> > > extra
>> > > > > > > >> > > > > > > kilobytes on the wire when the number of
>> > partitions
>> > > > > > becomes
>> > > > > > > >> large.
>> > > > > > > >> > > > > Plus,
>> > > > > > > >> > > > > > > we have to update all the auxillary classes to
>> > > include
>> > > > > an
>> > > > > > > >> epoch.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > We need to have a global metadata epoch anyway
>> to
>> > > > handle
>> > > > > > > >> partition
>> > > > > > > >> > > > > > > addition and deletion.  For example, if I give
>> you
>> > > > > > > >> > > > > > > MetadataResponse{part1,epoch 1, part2, epoch 1}
>> > and
>> > > > > > {part1,
>> > > > > > > >> > > epoch1},
>> > > > > > > >> > > > > which
>> > > > > > > >> > > > > > > MetadataResponse is newer?  You have no way of
>> > > > knowing.
>> > > > > > It
>> > > > > > > >> could
>> > > > > > > >> > > be
>> > > > > > > >> > > > > that
>> > > > > > > >> > > > > > > part2 has just been created, and the response
>> > with 2
>> > > > > > > >> partitions is
>> > > > > > > >> > > > > newer.
>> > > > > > > >> > > > > > > Or it coudl be that part2 has just been
>> deleted,
>> > and
>> > > > > > > >> therefore the
>> > > > > > > >> > > > > response
>> > > > > > > >> > > > > > > with 1 partition is newer.  You must have a
>> global
>> > > > epoch
>> > > > > > to
>> > > > > > > >> > > > > disambiguate
>> > > > > > > >> > > > > > > these two cases.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > Previously, I worked on the Ceph distributed
>> > > > filesystem.
>> > > > > > > >> Ceph had
>> > > > > > > >> > > the
>> > > > > > > >> > > > > > > concept of a map of the whole cluster,
>> maintained
>> > > by a
>> > > > > few
>> > > > > > > >> servers
>> > > > > > > >> > > > > doing
>> > > > > > > >> > > > > > > paxos.  This map was versioned by a single
>> 64-bit
>> > > > epoch
>> > > > > > number
>> > > > > > > >> > > which
>> > > > > > > >> > > > > > > increased on every change.  It was propagated
>> to
>> > > > clients
>> > > > > > > >> through
>> > > > > > > >> > > > > gossip.  I
>> > > > > > > >> > > > > > > wonder if something similar could work here?
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > It seems like the the Kafka MetadataResponse
>> > serves
>> > > > two
>> > > > > > > >> somewhat
>> > > > > > > >> > > > > unrelated
>> > > > > > > >> > > > > > > purposes.  Firstly, it lets clients know what
>> > > > partitions
>> > > > > > > >> exist in
>> > > > > > > >> > > the
>> > > > > > > >> > > > > > > system and where they live.  Secondly, it lets
>> > > clients
>> > > > > > know
>> > > > > > > >> which
>> > > > > > > >> > > nodes
>> > > > > > > >> > > > > > > within the partition are in-sync (in the ISR)
>> and
>> > > > which
>> > > > > > node
>> > > > > > > >> is the
>> > > > > > > >> > > > > leader.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > The first purpose is what you really need a
>> > metadata
>> > > > > epoch
>> > > > > > > >> for, I
>> > > > > > > >> > > > > think.
>> > > > > > > >> > > > > > > You want to know whether a partition exists or
>> > not,
>> > > or
>> > > > > you
>> > > > > > > >> want to
>> > > > > > > >> > > know
>> > > > > > > >> > > > > > > which nodes you should talk to in order to
>> write
>> > to
>> > > a
>> > > > > > given
>> > > > > > > >> > > > > partition.  A
>> > > > > > > >> > > > > > > single metadata epoch for the whole response
>> > should
>> > > be
>> > > > > > > >> adequate
>> > > > > > > >> > > here.
>> > > > > > > >> > > > > We
>> > > > > > > >> > > > > > > should not change the partition assignment
>> without
>> > > > going
>> > > > > > > >> through
>> > > > > > > >> > > > > zookeeper
>> > > > > > > >> > > > > > > (or a similar system), and this inherently
>> > > serializes
>> > > > > > updates
>> > > > > > > >> into
>> > > > > > > >> > > a
>> > > > > > > >> > > > > > > numbered stream.  Brokers should also stop
>> > > responding
>> > > > to
>> > > > > > > >> requests
>> > > > > > > >> > > when
>> > > > > > > >> > > > > they
>> > > > > > > >> > > > > > > are unable to contact ZK for a certain time
>> > period.
>> > > > > This
>> > > > > > > >> prevents
>> > > > > > > >> > > the
>> > > > > > > >> > > > > case
>> > > > > > > >> > > > > > > where a given partition has been moved off some
>> > set
>> > > of
>> > > > > > nodes,
>> > > > > > > >> but a
>> > > > > > > >> > > > > client
>> > > > > > > >> > > > > > > still ends up talking to those nodes and
>> writing
>> > > data
>> > > > > > there.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > For the second purpose, this is "soft state"
>> > anyway.
>> > > > If
>> > > > > > the
>> > > > > > > >> client
>> > > > > > > >> > > > > thinks
>> > > > > > > >> > > > > > > X is the leader but Y is really the leader, the
>> > > client
>> > > > > > will
>> > > > > > > >> talk
>> > > > > > > >> > > to X,
>> > > > > > > >> > > > > and
>> > > > > > > >> > > > > > > X will point out its mistake by sending back a
>> > > > > > > >> > > > > NOT_LEADER_FOR_PARTITION.
>> > > > > > > >> > > > > > > Then the client can update its metadata again
>> and
>> > > find
>> > > > > > the new
>> > > > > > > >> > > leader,
>> > > > > > > >> > > > > if
>> > > > > > > >> > > > > > > there is one.  There is no need for an epoch to
>> > > handle
>> > > > > > this.
>> > > > > > > >> > > > > Similarly, I
>> > > > > > > >> > > > > > > can't think of a reason why changing the
>> in-sync
>> > > > replica
>> > > > > > set
>> > > > > > > >> needs
>> > > > > > > >> > > to
>> > > > > > > >> > > > > bump
>> > > > > > > >> > > > > > > the epoch.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > best,
>> > > > > > > >> > > > > > > Colin
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > On Wed, Jan 24, 2018, at 09:45, Dong Lin wrote:
>> > > > > > > >> > > > > > > > Thanks much for reviewing the KIP!
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > Dong
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > On Wed, Jan 24, 2018 at 7:10 AM, Guozhang
>> Wang <
>> > > > > > > >> > > wangguoz@gmail.com>
>> > > > > > > >> > > > > > > wrote:
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > > Yeah that makes sense, again I'm just
>> making
>> > > sure
>> > > > we
>> > > > > > > >> understand
>> > > > > > > >> > > > > all the
>> > > > > > > >> > > > > > > > > scenarios and what to expect.
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > I agree that if, more generally speaking,
>> say
>> > > > users
>> > > > > > have
>> > > > > > > >> only
>> > > > > > > >> > > > > consumed
>> > > > > > > >> > > > > > > to
>> > > > > > > >> > > > > > > > > offset 8, and then call seek(16) to "jump"
>> to
>> > a
>> > > > > > further
>> > > > > > > >> > > position,
>> > > > > > > >> > > > > then
>> > > > > > > >> > > > > > > she
>> > > > > > > >> > > > > > > > > needs to be aware that OORE maybe thrown
>> and
>> > she
>> > > > > > needs to
>> > > > > > > >> > > handle
>> > > > > > > >> > > > > it or
>> > > > > > > >> > > > > > > rely
>> > > > > > > >> > > > > > > > > on reset policy which should not surprise
>> her.
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > I'm +1 on the KIP.
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > Guozhang
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > On Wed, Jan 24, 2018 at 12:31 AM, Dong Lin
>> <
>> > > > > > > >> > > lindong28@gmail.com>
>> > > > > > > >> > > > > > > wrote:
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > > Yes, in general we can not prevent
>> > > > > > > >> OffsetOutOfRangeException
>> > > > > > > >> > > if
>> > > > > > > >> > > > > user
>> > > > > > > >> > > > > > > > > seeks
>> > > > > > > >> > > > > > > > > > to a wrong offset. The main goal is to
>> > prevent
>> > > > > > > >> > > > > > > OffsetOutOfRangeException
>> > > > > > > >> > > > > > > > > if
>> > > > > > > >> > > > > > > > > > user has done things in the right way,
>> e.g.
>> > > user
>> > > > > > should
>> > > > > > > >> know
>> > > > > > > >> > > that
>> > > > > > > >> > > > > > > there
>> > > > > > > >> > > > > > > > > is
>> > > > > > > >> > > > > > > > > > message with this offset.
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > > For example, if user calls seek(..) right
>> > > after
>> > > > > > > >> > > construction, the
>> > > > > > > >> > > > > > > only
>> > > > > > > >> > > > > > > > > > reason I can think of is that user stores
>> > > offset
>> > > > > > > >> externally.
>> > > > > > > >> > > In
>> > > > > > > >> > > > > this
>> > > > > > > >> > > > > > > > > case,
>> > > > > > > >> > > > > > > > > > user currently needs to use the offset
>> which
>> > > is
>> > > > > > obtained
>> > > > > > > >> > > using
>> > > > > > > >> > > > > > > > > position(..)
>> > > > > > > >> > > > > > > > > > from the last run. With this KIP, user
>> needs
>> > > to
>> > > > > get
>> > > > > > the
>> > > > > > > >> > > offset
>> > > > > > > >> > > > > and
>> > > > > > > >> > > > > > > the
>> > > > > > > >> > > > > > > > > > offsetEpoch using
>> > positionAndOffsetEpoch(...)
>> > > > and
>> > > > > > stores
>> > > > > > > >> > > these
>> > > > > > > >> > > > > > > > > information
>> > > > > > > >> > > > > > > > > > externally. The next time user starts
>> > > consumer,
>> > > > > > he/she
>> > > > > > > >> needs
>> > > > > > > >> > > to
>> > > > > > > >> > > > > call
>> > > > > > > >> > > > > > > > > > seek(..., offset, offsetEpoch) right
>> after
>> > > > > > construction.
>> > > > > > > >> > > Then KIP
>> > > > > > > >> > > > > > > should
>> > > > > > > >> > > > > > > > > be
>> > > > > > > >> > > > > > > > > > able to ensure that we don't throw
>> > > > > > > >> OffsetOutOfRangeException
>> > > > > > > >> > > if
>> > > > > > > >> > > > > > > there is
>> > > > > > > >> > > > > > > > > no
>> > > > > > > >> > > > > > > > > > unclean leader election.
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > > Does this sound OK?
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > > Regards,
>> > > > > > > >> > > > > > > > > > Dong
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > > On Tue, Jan 23, 2018 at 11:44 PM,
>> Guozhang
>> > > Wang
>> > > > <
>> > > > > > > >> > > > > wangguoz@gmail.com>
>> > > > > > > >> > > > > > > > > > wrote:
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > "If consumer wants to consume message
>> with
>> > > > > offset
>> > > > > > 16,
>> > > > > > > >> then
>> > > > > > > >> > > > > consumer
>> > > > > > > >> > > > > > > > > must
>> > > > > > > >> > > > > > > > > > > have
>> > > > > > > >> > > > > > > > > > > already fetched message with offset 15"
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > --> this may not be always true right?
>> > What
>> > > if
>> > > > > > > >> consumer
>> > > > > > > >> > > just
>> > > > > > > >> > > > > call
>> > > > > > > >> > > > > > > > > > seek(16)
>> > > > > > > >> > > > > > > > > > > after construction and then poll
>> without
>> > > > > committed
>> > > > > > > >> offset
>> > > > > > > >> > > ever
>> > > > > > > >> > > > > > > stored
>> > > > > > > >> > > > > > > > > > > before? Admittedly it is rare but we do
>> > not
>> > > > > > > >> programmably
>> > > > > > > >> > > > > disallow
>> > > > > > > >> > > > > > > it.
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > Guozhang
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > On Tue, Jan 23, 2018 at 10:42 PM, Dong
>> > Lin <
>> > > > > > > >> > > > > lindong28@gmail.com>
>> > > > > > > >> > > > > > > > > wrote:
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > Hey Guozhang,
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > Thanks much for reviewing the KIP!
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > In the scenario you described, let's
>> > > assume
>> > > > > that
>> > > > > > > >> broker
>> > > > > > > >> > > A has
>> > > > > > > >> > > > > > > > > messages
>> > > > > > > >> > > > > > > > > > > with
>> > > > > > > >> > > > > > > > > > > > offset up to 10, and broker B has
>> > messages
>> > > > > with
>> > > > > > > >> offset
>> > > > > > > >> > > up to
>> > > > > > > >> > > > > 20.
>> > > > > > > >> > > > > > > If
>> > > > > > > >> > > > > > > > > > > > consumer wants to consume message
>> with
>> > > > offset
>> > > > > > 9, it
>> > > > > > > >> will
>> > > > > > > >> > > not
>> > > > > > > >> > > > > > > receive
>> > > > > > > >> > > > > > > > > > > > OffsetOutOfRangeException
>> > > > > > > >> > > > > > > > > > > > from broker A.
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > If consumer wants to consume message
>> > with
>> > > > > offset
>> > > > > > > >> 16, then
>> > > > > > > >> > > > > > > consumer
>> > > > > > > >> > > > > > > > > must
>> > > > > > > >> > > > > > > > > > > > have already fetched message with
>> offset
>> > > 15,
>> > > > > > which
>> > > > > > > >> can
>> > > > > > > >> > > only
>> > > > > > > >> > > > > come
>> > > > > > > >> > > > > > > from
>> > > > > > > >> > > > > > > > > > > > broker B. Because consumer will fetch
>> > from
>> > > > > > broker B
>> > > > > > > >> only
>> > > > > > > >> > > if
>> > > > > > > >> > > > > > > > > leaderEpoch
>> > > > > > > >> > > > > > > > > > > >=
>> > > > > > > >> > > > > > > > > > > > 2, then the current consumer
>> leaderEpoch
>> > > can
>> > > > > > not be
>> > > > > > > >> 1
>> > > > > > > >> > > since
>> > > > > > > >> > > > > this
>> > > > > > > >> > > > > > > KIP
>> > > > > > > >> > > > > > > > > > > > prevents leaderEpoch rewind. Thus we
>> > will
>> > > > not
>> > > > > > have
>> > > > > > > >> > > > > > > > > > > > OffsetOutOfRangeException
>> > > > > > > >> > > > > > > > > > > > in this case.
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > Does this address your question, or
>> > maybe
>> > > > > there
>> > > > > > is
>> > > > > > > >> more
>> > > > > > > >> > > > > advanced
>> > > > > > > >> > > > > > > > > > scenario
>> > > > > > > >> > > > > > > > > > > > that the KIP does not handle?
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > Thanks,
>> > > > > > > >> > > > > > > > > > > > Dong
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > On Tue, Jan 23, 2018 at 9:43 PM,
>> > Guozhang
>> > > > > Wang <
>> > > > > > > >> > > > > > > wangguoz@gmail.com>
>> > > > > > > >> > > > > > > > > > > wrote:
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > Thanks Dong, I made a pass over the
>> > wiki
>> > > > and
>> > > > > > it
>> > > > > > > >> lgtm.
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > Just a quick question: can we
>> > completely
>> > > > > > > >> eliminate the
>> > > > > > > >> > > > > > > > > > > > > OffsetOutOfRangeException with this
>> > > > > approach?
>> > > > > > Say
>> > > > > > > >> if
>> > > > > > > >> > > there
>> > > > > > > >> > > > > is
>> > > > > > > >> > > > > > > > > > > consecutive
>> > > > > > > >> > > > > > > > > > > > > leader changes such that the cached
>> > > > > metadata's
>> > > > > > > >> > > partition
>> > > > > > > >> > > > > epoch
>> > > > > > > >> > > > > > > is
>> > > > > > > >> > > > > > > > > 1,
>> > > > > > > >> > > > > > > > > > > and
>> > > > > > > >> > > > > > > > > > > > > the metadata fetch response returns
>> > > with
>> > > > > > > >> partition
>> > > > > > > >> > > epoch 2
>> > > > > > > >> > > > > > > > > pointing
>> > > > > > > >> > > > > > > > > > to
>> > > > > > > >> > > > > > > > > > > > > leader broker A, while the actual
>> > > > up-to-date
>> > > > > > > >> metadata
>> > > > > > > >> > > has
>> > > > > > > >> > > > > > > partition
>> > > > > > > >> > > > > > > > > > > > epoch 3
>> > > > > > > >> > > > > > > > > > > > > whose leader is now broker B, the
>> > > metadata
>> > > > > > > >> refresh will
>> > > > > > > >> > > > > still
>> > > > > > > >> > > > > > > > > succeed
>> > > > > > > >> > > > > > > > > > > and
>> > > > > > > >> > > > > > > > > > > > > the follow-up fetch request may
>> still
>> > > see
>> > > > > > OORE?
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > Guozhang
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > On Tue, Jan 23, 2018 at 3:47 PM,
>> Dong
>> > > Lin
>> > > > <
>> > > > > > > >> > > > > lindong28@gmail.com
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > > > wrote:
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > > Hi all,
>> > > > > > > >> > > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > > I would like to start the voting
>> > > process
>> > > > > for
>> > > > > > > >> KIP-232:
>> > > > > > > >> > > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > > https://cwiki.apache.org/
>> > > > > > > >> > > confluence/display/KAFKA/KIP-
>> > > > > > > >> > > > > > > > > > > > > > 232%3A+Detect+outdated+metadat
>> > > > > > > >> a+using+leaderEpoch+
>> > > > > > > >> > > > > > > > > > and+partitionEpoch
>> > > > > > > >> > > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > > The KIP will help fix a
>> concurrency
>> > > > issue
>> > > > > in
>> > > > > > > >> Kafka
>> > > > > > > >> > > which
>> > > > > > > >> > > > > > > > > currently
>> > > > > > > >> > > > > > > > > > > can
>> > > > > > > >> > > > > > > > > > > > > > cause message loss or message
>> > > > duplication
>> > > > > in
>> > > > > > > >> > > consumer.
>> > > > > > > >> > > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > > Regards,
>> > > > > > > >> > > > > > > > > > > > > > Dong
>> > > > > > > >> > > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > --
>> > > > > > > >> > > > > > > > > > > > > -- Guozhang
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > --
>> > > > > > > >> > > > > > > > > > > -- Guozhang
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > --
>> > > > > > > >> > > > > > > > > -- Guozhang
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > >
>> > > > > > > >> > >
>> > > > > > > >>
>> > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Ah thanks. I did not check the wiki but just browsed through all open
KIP discussions following the email threads.


-Matthias

On 9/30/18 12:06 PM, Dong Lin wrote:
> Hey Matthias,
> 
> Thanks for checking back  on the status. The KIP has been marked as
> replaced by KIP-320 in the KIP list wiki page and the status has been
> updated in the discussion and voting email thread.
> 
> Thanks,
> Dong
> 
> On Sun, 30 Sep 2018 at 11:51 AM Matthias J. Sax <ma...@confluent.io>
> wrote:
> 
>> It seems that KIP-320 was accepted. Thus, I am wondering what the status
>> of this KIP is?
>>
>> -Matthias
>>
>> On 7/11/18 10:59 AM, Dong Lin wrote:
>>> Hey Jun,
>>>
>>> Certainly. We can discuss later after KIP-320 settles.
>>>
>>> Thanks!
>>> Dong
>>>
>>>
>>> On Wed, Jul 11, 2018 at 8:54 AM, Jun Rao <ju...@confluent.io> wrote:
>>>
>>>> Hi, Dong,
>>>>
>>>> Sorry for the late response. Since KIP-320 is covering some of the
>> similar
>>>> problems described in this KIP, perhaps we can wait until KIP-320
>> settles
>>>> and see what's still left uncovered in this KIP.
>>>>
>>>> Thanks,
>>>>
>>>> Jun
>>>>
>>>> On Mon, Jun 4, 2018 at 7:03 PM, Dong Lin <li...@gmail.com> wrote:
>>>>
>>>>> Hey Jun,
>>>>>
>>>>> It seems that we have made considerable progress on the discussion of
>>>>> KIP-253 since February. Do you think we should continue the discussion
>>>>> there, or can we continue the voting for this KIP? I am happy to submit
>>>> the
>>>>> PR and move forward the progress for this KIP.
>>>>>
>>>>> Thanks!
>>>>> Dong
>>>>>
>>>>>
>>>>> On Wed, Feb 7, 2018 at 11:42 PM, Dong Lin <li...@gmail.com> wrote:
>>>>>
>>>>>> Hey Jun,
>>>>>>
>>>>>> Sure, I will come up with a KIP this week. I think there is a way to
>>>>> allow
>>>>>> partition expansion to arbitrary number without introducing new
>>>> concepts
>>>>>> such as read-only partition or repartition epoch.
>>>>>>
>>>>>> Thanks,
>>>>>> Dong
>>>>>>
>>>>>> On Wed, Feb 7, 2018 at 5:28 PM, Jun Rao <ju...@confluent.io> wrote:
>>>>>>
>>>>>>> Hi, Dong,
>>>>>>>
>>>>>>> Thanks for the reply. The general idea that you had for adding
>>>>> partitions
>>>>>>> is similar to what we had in mind. It would be useful to make this
>>>> more
>>>>>>> general, allowing adding an arbitrary number of partitions (instead
>> of
>>>>>>> just
>>>>>>> doubling) and potentially removing partitions as well. The following
>>>> is
>>>>>>> the
>>>>>>> high level idea from the discussion with Colin, Jason and Ismael.
>>>>>>>
>>>>>>> * To change the number of partitions from X to Y in a topic, the
>>>>>>> controller
>>>>>>> marks all existing X partitions as read-only and creates Y new
>>>>> partitions.
>>>>>>> The new partitions are writable and are tagged with a higher
>>>> repartition
>>>>>>> epoch (RE).
>>>>>>>
>>>>>>> * The controller propagates the new metadata to every broker. Once
>> the
>>>>>>> leader of a partition is marked as read-only, it rejects the produce
>>>>>>> requests on this partition. The producer will then refresh the
>>>> metadata
>>>>>>> and
>>>>>>> start publishing to the new writable partitions.
>>>>>>>
>>>>>>> * The consumers will then be consuming messages in RE order. The
>>>>> consumer
>>>>>>> coordinator will only assign partitions in the same RE to consumers.
>>>>> Only
>>>>>>> after all messages in an RE are consumed, will partitions in a higher
>>>> RE
>>>>>>> be
>>>>>>> assigned to consumers.
>>>>>>>
>>>>>>> As Colin mentioned, if we do the above, we could potentially (1) use
>> a
>>>>>>> globally unique partition id, or (2) use a globally unique topic id
>> to
>>>>>>> distinguish recreated partitions due to topic deletion.
>>>>>>>
>>>>>>> So, perhaps we can sketch out the re-partitioning KIP a bit more and
>>>> see
>>>>>>> if
>>>>>>> there is any overlap with KIP-232. Would you be interested in doing
>>>>> that?
>>>>>>> If not, we can do that next week.
>>>>>>>
>>>>>>> Jun
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 6, 2018 at 11:30 AM, Dong Lin <li...@gmail.com>
>>>> wrote:
>>>>>>>
>>>>>>>> Hey Jun,
>>>>>>>>
>>>>>>>> Interestingly I am also planning to sketch a KIP to allow partition
>>>>>>>> expansion for keyed topics after this KIP. Since you are already
>>>> doing
>>>>>>>> that, I guess I will just share my high level idea here in case it
>>>> is
>>>>>>>> helpful.
>>>>>>>>
>>>>>>>> The motivation for the KIP is that we currently lose order guarantee
>>>>> for
>>>>>>>> messages with the same key if we expand partitions of keyed topic.
>>>>>>>>
>>>>>>>> The solution can probably be built upon the following ideas:
>>>>>>>>
>>>>>>>> - Partition number of the keyed topic should always be doubled (or
>>>>>>>> multiplied by power of 2). Given that we select a partition based on
>>>>>>>> hash(key) % partitionNum, this should help us ensure that, a message
>>>>>>>> assigned to an existing partition will not be mapped to another
>>>>> existing
>>>>>>>> partition after partition expansion.
>>>>>>>>
>>>>>>>> - Producer includes in the ProduceRequest some information that
>>>> helps
>>>>>>>> ensure that messages produced ti a partition will monotonically
>>>>>>> increase in
>>>>>>>> the partitionNum of the topic. In other words, if broker receives a
>>>>>>>> ProduceRequest and notices that the producer does not know the
>>>>> partition
>>>>>>>> number has increased, broker should reject this request. That
>>>>>>> "information"
>>>>>>>> maybe leaderEpoch, max partitionEpoch of the partitions of the
>>>> topic,
>>>>> or
>>>>>>>> simply partitionNum of the topic. The benefit of this property is
>>>> that
>>>>>>> we
>>>>>>>> can keep the new logic for in-order message consumption entirely in
>>>>> how
>>>>>>>> consumer leader determines the partition -> consumer mapping.
>>>>>>>>
>>>>>>>> - When consumer leader determines partition -> consumer mapping,
>>>>> leader
>>>>>>>> first reads the start position for each partition using
>>>>>>> OffsetFetchRequest.
>>>>>>>> If start position are all non-zero, then assignment can be done in
>>>> its
>>>>>>>> current manner. The assumption is that, a message in the new
>>>> partition
>>>>>>>> should only be consumed after all messages with the same key
>>>> produced
>>>>>>>> before it has been consumed. Since some messages in the new
>>>> partition
>>>>>>> has
>>>>>>>> been consumed, we should not worry about consuming messages
>>>>>>> out-of-order.
>>>>>>>> This benefit of this approach is that we can avoid unnecessary
>>>>> overhead
>>>>>>> in
>>>>>>>> the common case.
>>>>>>>>
>>>>>>>> - If the consumer leader finds that the start position for some
>>>>>>> partition
>>>>>>>> is 0. Say the current partition number is 18 and the partition index
>>>>> is
>>>>>>> 12,
>>>>>>>> then consumer leader should ensure that messages produced to
>>>> partition
>>>>>>> 12 -
>>>>>>>> 18/2 = 3 before the first message of partition 12 is consumed,
>>>> before
>>>>> it
>>>>>>>> assigned partition 12 to any consumer in the consumer group. Since
>>>> we
>>>>>>> have
>>>>>>>> a "information" that is monotonically increasing per partition,
>>>>> consumer
>>>>>>>> can read the value of this information from the first message in
>>>>>>> partition
>>>>>>>> 12, get the offset corresponding to this value in partition 3,
>>>> assign
>>>>>>>> partition except for partition 12 (and probably other new
>>>> partitions)
>>>>> to
>>>>>>>> the existing consumers, waiting for the committed offset to go
>>>> beyond
>>>>>>> this
>>>>>>>> offset for partition 3, and trigger rebalance again so that
>>>> partition
>>>>> 3
>>>>>>> can
>>>>>>>> be reassigned to some consumer.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Dong
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 6, 2018 at 10:10 AM, Jun Rao <ju...@confluent.io> wrote:
>>>>>>>>
>>>>>>>>> Hi, Dong,
>>>>>>>>>
>>>>>>>>> Thanks for the KIP. It looks good overall. We are working on a
>>>>>>> separate
>>>>>>>> KIP
>>>>>>>>> for adding partitions while preserving the ordering guarantees.
>>>> That
>>>>>>> may
>>>>>>>>> require another flavor of partition epoch. It's not very clear
>>>>> whether
>>>>>>>> that
>>>>>>>>> partition epoch can be merged with the partition epoch in this
>>>> KIP.
>>>>>>> So,
>>>>>>>>> perhaps you can wait on this a bit until we post the other KIP in
>>>>> the
>>>>>>>> next
>>>>>>>>> few days.
>>>>>>>>>
>>>>>>>>> Jun
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 5, 2018 at 2:43 PM, Becket Qin <be...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 on the KIP.
>>>>>>>>>>
>>>>>>>>>> I think the KIP is mainly about adding the capability of
>>>> tracking
>>>>>>> the
>>>>>>>>>> system state change lineage. It does not seem necessary to
>>>> bundle
>>>>>>> this
>>>>>>>>> KIP
>>>>>>>>>> with replacing the topic partition with partition epoch in
>>>>>>>> produce/fetch.
>>>>>>>>>> Replacing topic-partition string with partition epoch is
>>>>>>> essentially a
>>>>>>>>>> performance improvement on top of this KIP. That can probably be
>>>>>>> done
>>>>>>>>>> separately.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 29, 2018 at 11:52 AM, Dong Lin <lindong28@gmail.com
>>>>>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Colin,
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 29, 2018 at 11:23 AM, Colin McCabe <
>>>>>>> cmccabe@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jan 29, 2018 at 10:35 AM, Dong Lin <
>>>>>>> lindong28@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Colin,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I understand that the KIP will adds overhead by
>>>>> introducing
>>>>>>>>>>>> per-partition
>>>>>>>>>>>>>> partitionEpoch. I am open to alternative solutions that
>>>>> does
>>>>>>>> not
>>>>>>>>>>> incur
>>>>>>>>>>>>>> additional overhead. But I don't see a better way now.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> IMO the overhead in the FetchResponse may not be that
>>>>> much.
>>>>>>> We
>>>>>>>>>>> probably
>>>>>>>>>>>>>> should discuss the percentage increase rather than the
>>>>>>> absolute
>>>>>>>>>>> number
>>>>>>>>>>>>>> increase. Currently after KIP-227, per-partition header
>>>>> has
>>>>>>> 23
>>>>>>>>>> bytes.
>>>>>>>>>>>> This
>>>>>>>>>>>>>> KIP adds another 4 bytes. Assume the records size is
>>>> 10KB,
>>>>>>> the
>>>>>>>>>>>> percentage
>>>>>>>>>>>>>> increase is 4 / (23 + 10000) = 0.03%. It seems
>>>> negligible,
>>>>>>>> right?
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Dong,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the response.  I agree that the FetchRequest /
>>>>>>>>> FetchResponse
>>>>>>>>>>>> overhead should be OK, now that we have incremental fetch
>>>>>>> requests
>>>>>>>>> and
>>>>>>>>>>>> responses.  However, there are a lot of cases where the
>>>>>>> percentage
>>>>>>>>>>> increase
>>>>>>>>>>>> is much greater.  For example, if a client is doing full
>>>>>>>>>>> MetadataRequests /
>>>>>>>>>>>> Responses, we have some math kind of like this per
>>>> partition:
>>>>>>>>>>>>
>>>>>>>>>>>>> UpdateMetadataRequestPartitionState => topic partition
>>>>>>>>>>> controller_epoch
>>>>>>>>>>>> leader  leader_epoch partition_epoch isr zk_version replicas
>>>>>>>>>>>> offline_replicas
>>>>>>>>>>>>> 14 bytes:  topic => string (assuming about 10 byte topic
>>>>>>> names)
>>>>>>>>>>>>> 4 bytes:  partition => int32
>>>>>>>>>>>>> 4  bytes: conroller_epoch => int32
>>>>>>>>>>>>> 4  bytes: leader => int32
>>>>>>>>>>>>> 4  bytes: leader_epoch => int32
>>>>>>>>>>>>> +4 EXTRA bytes: partition_epoch => int32        <-- NEW
>>>>>>>>>>>>> 2+4+4+4 bytes: isr => [int32] (assuming 3 in the ISR)
>>>>>>>>>>>>> 4 bytes: zk_version => int32
>>>>>>>>>>>>> 2+4+4+4 bytes: replicas => [int32] (assuming 3 replicas)
>>>>>>>>>>>>> 2  offline_replicas => [int32] (assuming no offline
>>>>> replicas)
>>>>>>>>>>>>
>>>>>>>>>>>> Assuming I added that up correctly, the per-partition
>>>> overhead
>>>>>>> goes
>>>>>>>>>> from
>>>>>>>>>>>> 64 bytes per partition to 68, a 6.2% increase.
>>>>>>>>>>>>
>>>>>>>>>>>> We could do similar math for a lot of the other RPCs.  And
>>>> you
>>>>>>> will
>>>>>>>>>> have
>>>>>>>>>>> a
>>>>>>>>>>>> similar memory and garbage collection impact on the brokers
>>>>>>> since
>>>>>>>> you
>>>>>>>>>>> have
>>>>>>>>>>>> to store all this extra state as well.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That is correct. IMO the Metadata is only updated periodically
>>>>>>> and is
>>>>>>>>>>> probably not a big deal if we increase it by 6%. The
>>>>> FetchResponse
>>>>>>>> and
>>>>>>>>>>> ProduceRequest are probably the only requests that are bounded
>>>>> by
>>>>>>> the
>>>>>>>>>>> bandwidth throughput.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree that we can probably save more space by using
>>>>>>> partition
>>>>>>>>> ID
>>>>>>>>>> so
>>>>>>>>>>>> that
>>>>>>>>>>>>>> we no longer needs the string topic name. The similar
>>>> idea
>>>>>>> has
>>>>>>>>> also
>>>>>>>>>>>> been
>>>>>>>>>>>>>> put in the Rejected Alternative section in KIP-227.
>>>> While
>>>>>>> this
>>>>>>>>> idea
>>>>>>>>>>> is
>>>>>>>>>>>>>> promising, it seems orthogonal to the goal of this KIP.
>>>>>>> Given
>>>>>>>>> that
>>>>>>>>>>>> there is
>>>>>>>>>>>>>> already many work to do in this KIP, maybe we can do the
>>>>>>>>> partition
>>>>>>>>>> ID
>>>>>>>>>>>> in a
>>>>>>>>>>>>>> separate KIP?
>>>>>>>>>>>>
>>>>>>>>>>>> I guess my thinking is that the goal here is to replace an
>>>>>>>> identifier
>>>>>>>>>>>> which can be re-used (the tuple of topic name, partition ID)
>>>>>>> with
>>>>>>>> an
>>>>>>>>>>>> identifier that cannot be re-used (the tuple of topic name,
>>>>>>>> partition
>>>>>>>>>> ID,
>>>>>>>>>>>> partition epoch) in order to gain better semantics.  As long
>>>>> as
>>>>>>> we
>>>>>>>>> are
>>>>>>>>>>>> replacing the identifier, why not replace it with an
>>>>> identifier
>>>>>>>> that
>>>>>>>>>> has
>>>>>>>>>>>> important performance advantages?  The KIP freeze for the
>>>> next
>>>>>>>>> release
>>>>>>>>>>> has
>>>>>>>>>>>> already passed, so there is time to do this.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In general it can be easier for discussion and implementation
>>>> if
>>>>>>> we
>>>>>>>> can
>>>>>>>>>>> split a larger task into smaller and independent tasks. For
>>>>>>> example,
>>>>>>>>>>> KIP-112 and KIP-113 both deals with the JBOD support. KIP-31,
>>>>>>> KIP-32
>>>>>>>>> and
>>>>>>>>>>> KIP-33 are about timestamp support. The option on this can be
>>>>>>> subject
>>>>>>>>>>> though.
>>>>>>>>>>>
>>>>>>>>>>> IMO the change to switch from (topic, partition ID) to
>>>>>>> partitionEpch
>>>>>>>> in
>>>>>>>>>> all
>>>>>>>>>>> request/response requires us to going through all request one
>>>> by
>>>>>>> one.
>>>>>>>>> It
>>>>>>>>>>> may not be hard but it can be time consuming and tedious. At
>>>>> high
>>>>>>>> level
>>>>>>>>>> the
>>>>>>>>>>> goal and the change for that will be orthogonal to the changes
>>>>>>>> required
>>>>>>>>>> in
>>>>>>>>>>> this KIP. That is the main reason I think we can split them
>>>> into
>>>>>>> two
>>>>>>>>>> KIPs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jan 29, 2018, at 10:54, Dong Lin wrote:
>>>>>>>>>>>>> I think it is possible to move to entirely use
>>>>> partitionEpoch
>>>>>>>>> instead
>>>>>>>>>>> of
>>>>>>>>>>>>> (topic, partition) to identify a partition. Client can
>>>>> obtain
>>>>>>> the
>>>>>>>>>>>>> partitionEpoch -> (topic, partition) mapping from
>>>>>>>> MetadataResponse.
>>>>>>>>>> We
>>>>>>>>>>>>> probably need to figure out a way to assign partitionEpoch
>>>>> to
>>>>>>>>>> existing
>>>>>>>>>>>>> partitions in the cluster. But this should be doable.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is a good idea. I think it will save us some space in
>>>>> the
>>>>>>>>>>>>> request/response. The actual space saving in percentage
>>>>>>> probably
>>>>>>>>>>> depends
>>>>>>>>>>>> on
>>>>>>>>>>>>> the amount of data and the number of partitions of the
>>>> same
>>>>>>>> topic.
>>>>>>>>> I
>>>>>>>>>>> just
>>>>>>>>>>>>> think we can do it in a separate KIP.
>>>>>>>>>>>>
>>>>>>>>>>>> Hmm.  How much extra work would be required?  It seems like
>>>> we
>>>>>>> are
>>>>>>>>>>> already
>>>>>>>>>>>> changing almost every RPC that involves topics and
>>>> partitions,
>>>>>>>>> already
>>>>>>>>>>>> adding new per-partition state to ZooKeeper, already
>>>> changing
>>>>>>> how
>>>>>>>>>> clients
>>>>>>>>>>>> interact with partitions.  Is there some other big piece of
>>>>> work
>>>>>>>> we'd
>>>>>>>>>>> have
>>>>>>>>>>>> to do to move to partition IDs that we wouldn't need for
>>>>>>> partition
>>>>>>>>>>> epochs?
>>>>>>>>>>>> I guess we'd have to find a way to support regular
>>>>>>> expression-based
>>>>>>>>>> topic
>>>>>>>>>>>> subscriptions.  If we split this into multiple KIPs,
>>>> wouldn't
>>>>> we
>>>>>>>> end
>>>>>>>>> up
>>>>>>>>>>>> changing all that RPCs and ZK state a second time?  Also,
>>>> I'm
>>>>>>>> curious
>>>>>>>>>> if
>>>>>>>>>>>> anyone has done any proof of concept GC, memory, and network
>>>>>>> usage
>>>>>>>>>>>> measurements on switching topic names for topic IDs.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We will need to go over all requests/responses to check how to
>>>>>>>> replace
>>>>>>>>>>> (topic, partition ID) with partition epoch. It requires
>>>>>>> non-trivial
>>>>>>>>> work
>>>>>>>>>>> and could take time. As you mentioned, we may want to see how
>>>>> much
>>>>>>>>> saving
>>>>>>>>>>> we can get by switching from topic names to partition epoch.
>>>>> That
>>>>>>>>> itself
>>>>>>>>>>> requires time and experiment. It seems that the new idea does
>>>>> not
>>>>>>>>>> rollback
>>>>>>>>>>> any change proposed in this KIP. So I am not sure we can get
>>>>> much
>>>>>>> by
>>>>>>>>>>> putting them into the same KIP.
>>>>>>>>>>>
>>>>>>>>>>> Anyway, if more people are interested in seeing the new idea
>>>> in
>>>>>>> the
>>>>>>>>> same
>>>>>>>>>>> KIP, I can try that.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> best,
>>>>>>>>>>>> Colin
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jan 29, 2018 at 10:18 AM, Colin McCabe <
>>>>>>>>> cmccabe@apache.org
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jan 26, 2018, at 12:17, Dong Lin wrote:
>>>>>>>>>>>>>>>> Hey Colin,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Jan 26, 2018 at 10:16 AM, Colin McCabe <
>>>>>>>>>>> cmccabe@apache.org>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Jan 25, 2018, at 16:47, Dong Lin wrote:
>>>>>>>>>>>>>>>>>> Hey Colin,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for the comment.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Jan 25, 2018 at 4:15 PM, Colin McCabe <
>>>>>>>>>>>> cmccabe@apache.org>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018, at 21:07, Dong Lin wrote:
>>>>>>>>>>>>>>>>>>>> Hey Colin,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for reviewing the KIP.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If I understand you right, you maybe
>>>> suggesting
>>>>>>> that
>>>>>>>>> we
>>>>>>>>>>> can
>>>>>>>>>>>> use
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> global
>>>>>>>>>>>>>>>>>>>> metadataEpoch that is incremented every time
>>>>>>>>> controller
>>>>>>>>>>>> updates
>>>>>>>>>>>>>>>>> metadata.
>>>>>>>>>>>>>>>>>>>> The problem with this solution is that, if a
>>>>>>> topic
>>>>>>>> is
>>>>>>>>>>>> deleted
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> created
>>>>>>>>>>>>>>>>>>>> again, user will not know whether that the
>>>>> offset
>>>>>>>>> which
>>>>>>>>>> is
>>>>>>>>>>>>>>> stored
>>>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>>>>>> the topic deletion is no longer valid. This
>>>>>>>> motivates
>>>>>>>>>> the
>>>>>>>>>>>> idea
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> include
>>>>>>>>>>>>>>>>>>>> per-partition partitionEpoch. Does this sound
>>>>>>>>>> reasonable?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Dong,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Perhaps we can store the last valid offset of
>>>>> each
>>>>>>>>> deleted
>>>>>>>>>>>> topic
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> ZooKeeper.  Then, when a topic with one of
>>>> those
>>>>>>> names
>>>>>>>>>> gets
>>>>>>>>>>>>>>>>> re-created, we
>>>>>>>>>>>>>>>>>>> can start the topic at the previous end offset
>>>>>>> rather
>>>>>>>>> than
>>>>>>>>>>> at
>>>>>>>>>>>> 0.
>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>>>> preserves immutability.  It is no more
>>>> burdensome
>>>>>>> than
>>>>>>>>>>> having
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> preserve a
>>>>>>>>>>>>>>>>>>> "last epoch" for the deleted partition
>>>> somewhere,
>>>>>>>> right?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> My concern with this solution is that the number
>>>> of
>>>>>>>>>> zookeeper
>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>> more and more over time if some users keep
>>>> deleting
>>>>>>> and
>>>>>>>>>>> creating
>>>>>>>>>>>>>>> topics.
>>>>>>>>>>>>>>>>> Do
>>>>>>>>>>>>>>>>>> you think this can be a problem?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Dong,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We could expire the "partition tombstones" after an
>>>>>>> hour
>>>>>>>> or
>>>>>>>>>> so.
>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>> practice this would solve the issue for clients
>>>> that
>>>>>>> like
>>>>>>>> to
>>>>>>>>>>>> destroy
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> re-create topics all the time.  In any case,
>>>> doesn't
>>>>>>> the
>>>>>>>>>> current
>>>>>>>>>>>>>>> proposal
>>>>>>>>>>>>>>>>> add per-partition znodes as well that we have to
>>>>> track
>>>>>>>> even
>>>>>>>>>>> after
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> partition is deleted?  Or did I misunderstand that?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Actually the current KIP does not add per-partition
>>>>>>> znodes.
>>>>>>>>>> Could
>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> double check? I can fix the KIP wiki if there is
>>>>> anything
>>>>>>>>>>>> misleading.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Dong,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I double-checked the KIP, and I can see that you are in
>>>>>>> fact
>>>>>>>>>> using a
>>>>>>>>>>>>>>> global counter for initializing partition epochs.  So,
>>>>> you
>>>>>>> are
>>>>>>>>>>>> correct, it
>>>>>>>>>>>>>>> doesn't add per-partition znodes for partitions that no
>>>>>>> longer
>>>>>>>>>>> exist.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If we expire the "partition tomstones" after an hour,
>>>>> and
>>>>>>>> the
>>>>>>>>>>> topic
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> re-created after more than an hour since the topic
>>>>>>> deletion,
>>>>>>>>>> then
>>>>>>>>>>>> we are
>>>>>>>>>>>>>>>> back to the situation where user can not tell whether
>>>>> the
>>>>>>>>> topic
>>>>>>>>>>> has
>>>>>>>>>>>> been
>>>>>>>>>>>>>>>> re-created or not, right?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, with an expiration period, it would not ensure
>>>>>>>>> immutability--
>>>>>>>>>>> you
>>>>>>>>>>>>>>> could effectively reuse partition names and they would
>>>>> look
>>>>>>>> the
>>>>>>>>>>> same.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It's not really clear to me what should happen
>>>> when a
>>>>>>>> topic
>>>>>>>>> is
>>>>>>>>>>>>>>> destroyed
>>>>>>>>>>>>>>>>> and re-created with new data.  Should consumers
>>>>>>> continue
>>>>>>>> to
>>>>>>>>> be
>>>>>>>>>>>> able to
>>>>>>>>>>>>>>>>> consume?  We don't know where they stopped
>>>> consuming
>>>>>>> from
>>>>>>>>> the
>>>>>>>>>>>> previous
>>>>>>>>>>>>>>>>> incarnation of the topic, so messages may have been
>>>>>>> lost.
>>>>>>>>>>>> Certainly
>>>>>>>>>>>>>>>>> consuming data from offset X of the new incarnation
>>>>> of
>>>>>>> the
>>>>>>>>>> topic
>>>>>>>>>>>> may
>>>>>>>>>>>>>>> give
>>>>>>>>>>>>>>>>> something totally different from what you would
>>>> have
>>>>>>>> gotten
>>>>>>>>>> from
>>>>>>>>>>>>>>> offset X
>>>>>>>>>>>>>>>>> of the previous incarnation of the topic.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With the current KIP, if a consumer consumes a topic
>>>>>>> based
>>>>>>>> on
>>>>>>>>>> the
>>>>>>>>>>>> last
>>>>>>>>>>>>>>>> remembered (offset, partitionEpoch, leaderEpoch), and
>>>>> if
>>>>>>> the
>>>>>>>>>> topic
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> re-created, consume will throw
>>>>>>>> InvalidPartitionEpochException
>>>>>>>>>>>> because
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> previous partitionEpoch will be different from the
>>>>>>> current
>>>>>>>>>>>>>>> partitionEpoch.
>>>>>>>>>>>>>>>> This is described in the Proposed Changes ->
>>>>> Consumption
>>>>>>>> after
>>>>>>>>>>> topic
>>>>>>>>>>>>>>>> deletion in the KIP. I can improve the KIP if there
>>>> is
>>>>>>>>> anything
>>>>>>>>>>> not
>>>>>>>>>>>>>>> clear.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the clarification.  It sounds like what you
>>>>>>> really
>>>>>>>>> want
>>>>>>>>>>> is
>>>>>>>>>>>>>>> immutability-- i.e., to never "really" reuse partition
>>>>>>>>>> identifiers.
>>>>>>>>>>>> And
>>>>>>>>>>>>>>> you do this by making the partition name no longer the
>>>>>>> "real"
>>>>>>>>>>>> identifier.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My big concern about this KIP is that it seems like an
>>>>>>>>>>>> anti-scalability
>>>>>>>>>>>>>>> feature.  Now we are adding 4 extra bytes for every
>>>>>>> partition
>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>>>>>>> FetchResponse and Request, for example.  That could be
>>>> 40
>>>>>>> kb
>>>>>>>> per
>>>>>>>>>>>> request,
>>>>>>>>>>>>>>> if the user has 10,000 partitions.  And of course, the
>>>>> KIP
>>>>>>>> also
>>>>>>>>>>> makes
>>>>>>>>>>>>>>> massive changes to UpdateMetadataRequest,
>>>>> MetadataResponse,
>>>>>>>>>>>>>>> OffsetCommitRequest, OffsetFetchResponse,
>>>>>>> LeaderAndIsrRequest,
>>>>>>>>>>>>>>> ListOffsetResponse, etc. which will also increase their
>>>>>>> size
>>>>>>>> on
>>>>>>>>>> the
>>>>>>>>>>>> wire
>>>>>>>>>>>>>>> and in memory.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One thing that we talked a lot about in the past is
>>>>>>> replacing
>>>>>>>>>>>> partition
>>>>>>>>>>>>>>> names with IDs.  IDs have a lot of really nice
>>>> features.
>>>>>>> They
>>>>>>>>>> take
>>>>>>>>>>>> up much
>>>>>>>>>>>>>>> less space in memory than strings (especially 2-byte
>>>> Java
>>>>>>>>>> strings).
>>>>>>>>>>>> They
>>>>>>>>>>>>>>> can often be allocated on the stack rather than the
>>>> heap
>>>>>>>>>> (important
>>>>>>>>>>>> when
>>>>>>>>>>>>>>> you are dealing with hundreds of thousands of them).
>>>>> They
>>>>>>> can
>>>>>>>>> be
>>>>>>>>>>>>>>> efficiently deserialized and serialized.  If we use
>>>>> 64-bit
>>>>>>>> ones,
>>>>>>>>>> we
>>>>>>>>>>>> will
>>>>>>>>>>>>>>> never run out of IDs, which means that they can always
>>>> be
>>>>>>>> unique
>>>>>>>>>> per
>>>>>>>>>>>>>>> partition.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Given that the partition name is no longer the "real"
>>>>>>>> identifier
>>>>>>>>>> for
>>>>>>>>>>>>>>> partitions in the current KIP-232 proposal, why not
>>>> just
>>>>>>> move
>>>>>>>> to
>>>>>>>>>>> using
>>>>>>>>>>>>>>> partition IDs entirely instead of strings?  You have to
>>>>>>> change
>>>>>>>>> all
>>>>>>>>>>> the
>>>>>>>>>>>>>>> messages anyway.  There isn't much point any more to
>>>>>>> carrying
>>>>>>>>>> around
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> partition name in every RPC, since you really need
>>>> (name,
>>>>>>>> epoch)
>>>>>>>>>> to
>>>>>>>>>>>>>>> identify the partition.
>>>>>>>>>>>>>>> Probably the metadata response and a few other messages
>>>>>>> would
>>>>>>>>> have
>>>>>>>>>>> to
>>>>>>>>>>>>>>> still carry the partition name, to allow clients to go
>>>>> from
>>>>>>>> name
>>>>>>>>>> to
>>>>>>>>>>>> id.
>>>>>>>>>>>>>>> But we could mostly forget about the strings.  And then
>>>>>>> this
>>>>>>>>> would
>>>>>>>>>>> be
>>>>>>>>>>>> a
>>>>>>>>>>>>>>> scalability improvement rather than a scalability
>>>>> problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> By choosing to reuse the same (topic, partition,
>>>>>>> offset)
>>>>>>>>>>> 3-tuple,
>>>>>>>>>>>> we
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> chosen to give up immutability.  That was a really
>>>> bad
>>>>>>>>> decision.
>>>>>>>>>>>> And
>>>>>>>>>>>>>>> now
>>>>>>>>>>>>>>>>> we have to worry about time dependencies, stale
>>>>> cached
>>>>>>>> data,
>>>>>>>>>> and
>>>>>>>>>>>> all
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> rest.  We can't completely fix this inside Kafka no
>>>>>>> matter
>>>>>>>>>> what
>>>>>>>>>>>> we do,
>>>>>>>>>>>>>>>>> because not all that cached data is inside Kafka
>>>>>>> itself.
>>>>>>>>> Some
>>>>>>>>>>> of
>>>>>>>>>>>> it
>>>>>>>>>>>>>>> may be
>>>>>>>>>>>>>>>>> in systems that Kafka has sent data to, such as
>>>> other
>>>>>>>>> daemons,
>>>>>>>>>>> SQL
>>>>>>>>>>>>>>>>> databases, streams, and so forth.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current KIP will uniquely identify a message
>>>> using
>>>>>>>> (topic,
>>>>>>>>>>>>>>> partition,
>>>>>>>>>>>>>>>> offset, partitionEpoch) 4-tuple. This addresses the
>>>>>>> message
>>>>>>>>>>>> immutability
>>>>>>>>>>>>>>>> issue that you mentioned. Is there any corner case
>>>>> where
>>>>>>> the
>>>>>>>>>>> message
>>>>>>>>>>>>>>>> immutability is still not preserved with the current
>>>>> KIP?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I guess the idea here is that mirror maker should
>>>>> work
>>>>>>> as
>>>>>>>>>>> expected
>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>> users destroy a topic and re-create it with the
>>>> same
>>>>>>> name.
>>>>>>>>>>> That's
>>>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>>>> tough, though, since in that scenario, mirror maker
>>>>>>>> probably
>>>>>>>>>>>> should
>>>>>>>>>>>>>>> destroy
>>>>>>>>>>>>>>>>> and re-create the topic on the other end, too,
>>>> right?
>>>>>>>>>>> Otherwise,
>>>>>>>>>>>>>>> what you
>>>>>>>>>>>>>>>>> end up with on the other end could be half of one
>>>>>>>>> incarnation
>>>>>>>>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> topic,
>>>>>>>>>>>>>>>>> and half of another.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> What mirror maker really needs is to be able to
>>>>> follow
>>>>>>> a
>>>>>>>>>> stream
>>>>>>>>>>> of
>>>>>>>>>>>>>>> events
>>>>>>>>>>>>>>>>> about the kafka cluster itself.  We could have some
>>>>>>> master
>>>>>>>>>> topic
>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>>> always present and which contains data about all
>>>>> topic
>>>>>>>>>>> deletions,
>>>>>>>>>>>>>>>>> creations, etc.  Then MM can simply follow this
>>>> topic
>>>>>>> and
>>>>>>>> do
>>>>>>>>>>> what
>>>>>>>>>>>> is
>>>>>>>>>>>>>>> needed.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Then the next question maybe, should we use a
>>>>>>> global
>>>>>>>>>>>>>>> metadataEpoch +
>>>>>>>>>>>>>>>>>>>> per-partition partitionEpoch, instead of
>>>> using
>>>>>>>>>>> per-partition
>>>>>>>>>>>>>>>>> leaderEpoch
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> per-partition leaderEpoch. The former
>>>> solution
>>>>>>> using
>>>>>>>>>>>>>>> metadataEpoch
>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>> not work due to the following scenario
>>>>> (provided
>>>>>>> by
>>>>>>>>>> Jun):
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> "Consider the following scenario. In metadata
>>>>> v1,
>>>>>>>> the
>>>>>>>>>>> leader
>>>>>>>>>>>>>>> for a
>>>>>>>>>>>>>>>>>>>> partition is at broker 1. In metadata v2,
>>>>> leader
>>>>>>> is
>>>>>>>> at
>>>>>>>>>>>> broker
>>>>>>>>>>>>>>> 2. In
>>>>>>>>>>>>>>>>>>>> metadata v3, leader is at broker 1 again. The
>>>>>>> last
>>>>>>>>>>> committed
>>>>>>>>>>>>>>> offset
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> v1,
>>>>>>>>>>>>>>>>>>>> v2 and v3 are 10, 20 and 30, respectively. A
>>>>>>>> consumer
>>>>>>>>> is
>>>>>>>>>>>>>>> started and
>>>>>>>>>>>>>>>>>>> reads
>>>>>>>>>>>>>>>>>>>> metadata v1 and reads messages from offset 0
>>>> to
>>>>>>> 25
>>>>>>>>> from
>>>>>>>>>>>> broker
>>>>>>>>>>>>>>> 1. My
>>>>>>>>>>>>>>>>>>>> understanding is that in the current
>>>> proposal,
>>>>>>> the
>>>>>>>>>>> metadata
>>>>>>>>>>>>>>> version
>>>>>>>>>>>>>>>>>>>> associated with offset 25 is v1. The consumer
>>>>> is
>>>>>>>> then
>>>>>>>>>>>> restarted
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> fetches
>>>>>>>>>>>>>>>>>>>> metadata v2. The consumer tries to read from
>>>>>>> broker
>>>>>>>> 2,
>>>>>>>>>>>> which is
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> old
>>>>>>>>>>>>>>>>>>>> leader with the last offset at 20. In this
>>>>> case,
>>>>>>> the
>>>>>>>>>>>> consumer
>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>>> get OffsetOutOfRangeException incorrectly."
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regarding your comment "For the second
>>>> purpose,
>>>>>>> this
>>>>>>>>> is
>>>>>>>>>>>> "soft
>>>>>>>>>>>>>>> state"
>>>>>>>>>>>>>>>>>>>> anyway.  If the client thinks X is the leader
>>>>>>> but Y
>>>>>>>> is
>>>>>>>>>>>> really
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> leader,
>>>>>>>>>>>>>>>>>>>> the client will talk to X, and X will point
>>>> out
>>>>>>> its
>>>>>>>>>>> mistake
>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>> sending
>>>>>>>>>>>>>>>>>>> back
>>>>>>>>>>>>>>>>>>>> a NOT_LEADER_FOR_PARTITION.", it is probably
>>>> no
>>>>>>>> true.
>>>>>>>>>> The
>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>> here is
>>>>>>>>>>>>>>>>>>>> that the old leader X may still think it is
>>>> the
>>>>>>>> leader
>>>>>>>>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> partition
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> thus it will not send back
>>>>>>> NOT_LEADER_FOR_PARTITION.
>>>>>>>>> The
>>>>>>>>>>>> reason
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> provided
>>>>>>>>>>>>>>>>>>>> in KAFKA-6262. Can you check if that makes
>>>>> sense?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This is solvable with a timeout, right?  If the
>>>>>>> leader
>>>>>>>>>> can't
>>>>>>>>>>>>>>>>> communicate
>>>>>>>>>>>>>>>>>>> with the controller for a certain period of
>>>> time,
>>>>>>> it
>>>>>>>>>> should
>>>>>>>>>>>> stop
>>>>>>>>>>>>>>>>> acting as
>>>>>>>>>>>>>>>>>>> the leader.  We have to solve this problem,
>>>>>>> anyway, in
>>>>>>>>>> order
>>>>>>>>>>>> to
>>>>>>>>>>>>>>> fix
>>>>>>>>>>>>>>>>> all the
>>>>>>>>>>>>>>>>>>> corner cases.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Not sure if I fully understand your proposal. The
>>>>>>>> proposal
>>>>>>>>>>>> seems to
>>>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>>>> non-trivial changes to our existing leadership
>>>>>>> election
>>>>>>>>>>>> mechanism.
>>>>>>>>>>>>>>> Could
>>>>>>>>>>>>>>>>>> you provide more detail regarding how it works?
>>>> For
>>>>>>>>> example,
>>>>>>>>>>> how
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>> user choose this timeout, how leader determines
>>>>>>> whether
>>>>>>>> it
>>>>>>>>>> can
>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>> communicate with controller, and how this
>>>> triggers
>>>>>>>>>> controller
>>>>>>>>>>> to
>>>>>>>>>>>>>>> elect
>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>> leader?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Before I come up with any proposal, let me make
>>>> sure
>>>>> I
>>>>>>>>>>> understand
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> problem correctly.  My big question was, what
>>>>> prevents
>>>>>>>>>>> split-brain
>>>>>>>>>>>>>>> here?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Let's say I have a partition which is on nodes A,
>>>> B,
>>>>>>> and
>>>>>>>> C,
>>>>>>>>>> with
>>>>>>>>>>>>>>> min-ISR
>>>>>>>>>>>>>>>>> 2.  The controller is D.  At some point, there is a
>>>>>>>> network
>>>>>>>>>>>> partition
>>>>>>>>>>>>>>>>> between A and B and the rest of the cluster.  The
>>>>>>>> Controller
>>>>>>>>>>>>>>> re-assigns the
>>>>>>>>>>>>>>>>> partition to nodes C, D, and E.  But A and B keep
>>>>>>> chugging
>>>>>>>>>> away,
>>>>>>>>>>>> even
>>>>>>>>>>>>>>>>> though they can no longer communicate with the
>>>>>>> controller.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> At some point, a client with stale metadata writes
>>>> to
>>>>>>> the
>>>>>>>>>>>> partition.
>>>>>>>>>>>>>>> It
>>>>>>>>>>>>>>>>> still thinks the partition is on node A, B, and C,
>>>> so
>>>>>>>> that's
>>>>>>>>>>>> where it
>>>>>>>>>>>>>>> sends
>>>>>>>>>>>>>>>>> the data.  It's unable to talk to C, but A and B
>>>>> reply
>>>>>>>> back
>>>>>>>>>> that
>>>>>>>>>>>> all
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is this not a case where we could lose data due to
>>>>>>> split
>>>>>>>>>> brain?
>>>>>>>>>>>> Or is
>>>>>>>>>>>>>>>>> there a mechanism for preventing this that I
>>>> missed?
>>>>>>> If
>>>>>>>> it
>>>>>>>>>> is,
>>>>>>>>>>> it
>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>> like a pretty serious failure case that we should
>>>> be
>>>>>>>>> handling
>>>>>>>>>>>> with our
>>>>>>>>>>>>>>>>> metadata rework.  And I think epoch numbers and
>>>>>>> timeouts
>>>>>>>>> might
>>>>>>>>>>> be
>>>>>>>>>>>>>>> part of
>>>>>>>>>>>>>>>>> the solution.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Right, split brain can happen if RF=4 and minIsr=2.
>>>>>>>> However, I
>>>>>>>>>> am
>>>>>>>>>>>> not
>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>> it is a pretty serious issue which we need to address
>>>>>>> today.
>>>>>>>>>> This
>>>>>>>>>>>> can be
>>>>>>>>>>>>>>>> prevented by configuring the Kafka topic so that
>>>>> minIsr >
>>>>>>>>> RF/2.
>>>>>>>>>>>>>>> Actually,
>>>>>>>>>>>>>>>> if user sets minIsr=2, is there anything reason that
>>>>> user
>>>>>>>>> wants
>>>>>>>>>> to
>>>>>>>>>>>> set
>>>>>>>>>>>>>>> RF=4
>>>>>>>>>>>>>>>> instead of 4?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Introducing timeout in leader election mechanism is
>>>>>>>>>> non-trivial. I
>>>>>>>>>>>>>>> think we
>>>>>>>>>>>>>>>> probably want to do that only if there is good
>>>> use-case
>>>>>>> that
>>>>>>>>> can
>>>>>>>>>>> not
>>>>>>>>>>>>>>>> otherwise be addressed with the current mechanism.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I still would like to think about these corner cases
>>>>> more.
>>>>>>>> But
>>>>>>>>>>>> perhaps
>>>>>>>>>>>>>>> it's not directly related to this KIP.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>> Colin
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> best,
>>>>>>>>>>>>>>>>> Colin
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> best,
>>>>>>>>>>>>>>>>>>> Colin
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018 at 10:39 AM, Colin
>>>> McCabe
>>>>> <
>>>>>>>>>>>>>>> cmccabe@apache.org>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Dong,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for proposing this KIP.  I think a
>>>>>>> metadata
>>>>>>>>>> epoch
>>>>>>>>>>>> is a
>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>>>> idea.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I read through the DISCUSS thread, but I
>>>>> still
>>>>>>>> don't
>>>>>>>>>>> have
>>>>>>>>>>>> a
>>>>>>>>>>>>>>> clear
>>>>>>>>>>>>>>>>>>> picture
>>>>>>>>>>>>>>>>>>>>> of why the proposal uses a metadata epoch
>>>> per
>>>>>>>>>> partition
>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>> than a
>>>>>>>>>>>>>>>>>>>>> global metadata epoch.  A metadata epoch
>>>> per
>>>>>>>>> partition
>>>>>>>>>>> is
>>>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>>>>>>>> unpleasant-- it's at least 4 extra bytes
>>>> per
>>>>>>>>> partition
>>>>>>>>>>>> that we
>>>>>>>>>>>>>>>>> have to
>>>>>>>>>>>>>>>>>>> send
>>>>>>>>>>>>>>>>>>>>> over the wire in every full metadata
>>>> request,
>>>>>>>> which
>>>>>>>>>>> could
>>>>>>>>>>>>>>> become
>>>>>>>>>>>>>>>>> extra
>>>>>>>>>>>>>>>>>>>>> kilobytes on the wire when the number of
>>>>>>>> partitions
>>>>>>>>>>>> becomes
>>>>>>>>>>>>>>> large.
>>>>>>>>>>>>>>>>>>> Plus,
>>>>>>>>>>>>>>>>>>>>> we have to update all the auxillary classes
>>>>> to
>>>>>>>>> include
>>>>>>>>>>> an
>>>>>>>>>>>>>>> epoch.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> We need to have a global metadata epoch
>>>>> anyway
>>>>>>> to
>>>>>>>>>> handle
>>>>>>>>>>>>>>> partition
>>>>>>>>>>>>>>>>>>>>> addition and deletion.  For example, if I
>>>>> give
>>>>>>> you
>>>>>>>>>>>>>>>>>>>>> MetadataResponse{part1,epoch 1, part2,
>>>> epoch
>>>>> 1}
>>>>>>>> and
>>>>>>>>>>>> {part1,
>>>>>>>>>>>>>>>>> epoch1},
>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>> MetadataResponse is newer?  You have no way
>>>>> of
>>>>>>>>>> knowing.
>>>>>>>>>>>> It
>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> part2 has just been created, and the
>>>> response
>>>>>>>> with 2
>>>>>>>>>>>>>>> partitions is
>>>>>>>>>>>>>>>>>>> newer.
>>>>>>>>>>>>>>>>>>>>> Or it coudl be that part2 has just been
>>>>>>> deleted,
>>>>>>>> and
>>>>>>>>>>>>>>> therefore the
>>>>>>>>>>>>>>>>>>> response
>>>>>>>>>>>>>>>>>>>>> with 1 partition is newer.  You must have a
>>>>>>> global
>>>>>>>>>> epoch
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> disambiguate
>>>>>>>>>>>>>>>>>>>>> these two cases.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Previously, I worked on the Ceph
>>>> distributed
>>>>>>>>>> filesystem.
>>>>>>>>>>>>>>> Ceph had
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> concept of a map of the whole cluster,
>>>>>>> maintained
>>>>>>>>> by a
>>>>>>>>>>> few
>>>>>>>>>>>>>>> servers
>>>>>>>>>>>>>>>>>>> doing
>>>>>>>>>>>>>>>>>>>>> paxos.  This map was versioned by a single
>>>>>>> 64-bit
>>>>>>>>>> epoch
>>>>>>>>>>>> number
>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>> increased on every change.  It was
>>>> propagated
>>>>>>> to
>>>>>>>>>> clients
>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>> gossip.  I
>>>>>>>>>>>>>>>>>>>>> wonder if something similar could work
>>>> here?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> It seems like the the Kafka
>>>> MetadataResponse
>>>>>>>> serves
>>>>>>>>>> two
>>>>>>>>>>>>>>> somewhat
>>>>>>>>>>>>>>>>>>> unrelated
>>>>>>>>>>>>>>>>>>>>> purposes.  Firstly, it lets clients know
>>>> what
>>>>>>>>>> partitions
>>>>>>>>>>>>>>> exist in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> system and where they live.  Secondly, it
>>>>> lets
>>>>>>>>> clients
>>>>>>>>>>>> know
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>>>> within the partition are in-sync (in the
>>>> ISR)
>>>>>>> and
>>>>>>>>>> which
>>>>>>>>>>>> node
>>>>>>>>>>>>>>> is the
>>>>>>>>>>>>>>>>>>> leader.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The first purpose is what you really need a
>>>>>>>> metadata
>>>>>>>>>>> epoch
>>>>>>>>>>>>>>> for, I
>>>>>>>>>>>>>>>>>>> think.
>>>>>>>>>>>>>>>>>>>>> You want to know whether a partition exists
>>>>> or
>>>>>>>> not,
>>>>>>>>> or
>>>>>>>>>>> you
>>>>>>>>>>>>>>> want to
>>>>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>>>>>>>> which nodes you should talk to in order to
>>>>>>> write
>>>>>>>> to
>>>>>>>>> a
>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>> partition.  A
>>>>>>>>>>>>>>>>>>>>> single metadata epoch for the whole
>>>> response
>>>>>>>> should
>>>>>>>>> be
>>>>>>>>>>>>>>> adequate
>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>>>> should not change the partition assignment
>>>>>>> without
>>>>>>>>>> going
>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>> zookeeper
>>>>>>>>>>>>>>>>>>>>> (or a similar system), and this inherently
>>>>>>>>> serializes
>>>>>>>>>>>> updates
>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> numbered stream.  Brokers should also stop
>>>>>>>>> responding
>>>>>>>>>> to
>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>>> are unable to contact ZK for a certain time
>>>>>>>> period.
>>>>>>>>>>> This
>>>>>>>>>>>>>>> prevents
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>> where a given partition has been moved off
>>>>> some
>>>>>>>> set
>>>>>>>>> of
>>>>>>>>>>>> nodes,
>>>>>>>>>>>>>>> but a
>>>>>>>>>>>>>>>>>>> client
>>>>>>>>>>>>>>>>>>>>> still ends up talking to those nodes and
>>>>>>> writing
>>>>>>>>> data
>>>>>>>>>>>> there.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> For the second purpose, this is "soft
>>>> state"
>>>>>>>> anyway.
>>>>>>>>>> If
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> client
>>>>>>>>>>>>>>>>>>> thinks
>>>>>>>>>>>>>>>>>>>>> X is the leader but Y is really the leader,
>>>>> the
>>>>>>>>> client
>>>>>>>>>>>> will
>>>>>>>>>>>>>>> talk
>>>>>>>>>>>>>>>>> to X,
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> X will point out its mistake by sending
>>>> back
>>>>> a
>>>>>>>>>>>>>>>>>>> NOT_LEADER_FOR_PARTITION.
>>>>>>>>>>>>>>>>>>>>> Then the client can update its metadata
>>>> again
>>>>>>> and
>>>>>>>>> find
>>>>>>>>>>>> the new
>>>>>>>>>>>>>>>>> leader,
>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>> there is one.  There is no need for an
>>>> epoch
>>>>> to
>>>>>>>>> handle
>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>> Similarly, I
>>>>>>>>>>>>>>>>>>>>> can't think of a reason why changing the
>>>>>>> in-sync
>>>>>>>>>> replica
>>>>>>>>>>>> set
>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> bump
>>>>>>>>>>>>>>>>>>>>> the epoch.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> best,
>>>>>>>>>>>>>>>>>>>>> Colin
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018, at 09:45, Dong Lin
>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> Thanks much for reviewing the KIP!
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018 at 7:10 AM, Guozhang
>>>>>>> Wang <
>>>>>>>>>>>>>>>>> wangguoz@gmail.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Yeah that makes sense, again I'm just
>>>>>>> making
>>>>>>>>> sure
>>>>>>>>>> we
>>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>>>> all the
>>>>>>>>>>>>>>>>>>>>>>> scenarios and what to expect.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I agree that if, more generally
>>>> speaking,
>>>>>>> say
>>>>>>>>>> users
>>>>>>>>>>>> have
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> consumed
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> offset 8, and then call seek(16) to
>>>>> "jump"
>>>>>>> to
>>>>>>>> a
>>>>>>>>>>>> further
>>>>>>>>>>>>>>>>> position,
>>>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>>> she
>>>>>>>>>>>>>>>>>>>>>>> needs to be aware that OORE maybe
>>>> thrown
>>>>>>> and
>>>>>>>> she
>>>>>>>>>>>> needs to
>>>>>>>>>>>>>>>>> handle
>>>>>>>>>>>>>>>>>>> it or
>>>>>>>>>>>>>>>>>>>>> rely
>>>>>>>>>>>>>>>>>>>>>>> on reset policy which should not
>>>> surprise
>>>>>>> her.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I'm +1 on the KIP.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018 at 12:31 AM, Dong
>>>>> Lin
>>>>>>> <
>>>>>>>>>>>>>>>>> lindong28@gmail.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Yes, in general we can not prevent
>>>>>>>>>>>>>>> OffsetOutOfRangeException
>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>>>> seeks
>>>>>>>>>>>>>>>>>>>>>>>> to a wrong offset. The main goal is
>>>> to
>>>>>>>> prevent
>>>>>>>>>>>>>>>>>>>>> OffsetOutOfRangeException
>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>>> user has done things in the right
>>>> way,
>>>>>>> e.g.
>>>>>>>>> user
>>>>>>>>>>>> should
>>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> message with this offset.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> For example, if user calls seek(..)
>>>>> right
>>>>>>>>> after
>>>>>>>>>>>>>>>>> construction, the
>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>> reason I can think of is that user
>>>>> stores
>>>>>>>>> offset
>>>>>>>>>>>>>>> externally.
>>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>>>>>>>>> user currently needs to use the
>>>> offset
>>>>>>> which
>>>>>>>>> is
>>>>>>>>>>>> obtained
>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>>>>>>> position(..)
>>>>>>>>>>>>>>>>>>>>>>>> from the last run. With this KIP,
>>>> user
>>>>>>> needs
>>>>>>>>> to
>>>>>>>>>>> get
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> offset
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> offsetEpoch using
>>>>>>>> positionAndOffsetEpoch(...)
>>>>>>>>>> and
>>>>>>>>>>>> stores
>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>>>>>>>>>> externally. The next time user starts
>>>>>>>>> consumer,
>>>>>>>>>>>> he/she
>>>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>>>> to
>>>>>>>
>

Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Posted by Dong Lin <li...@gmail.com>.

Hey Matthias,

Thanks for checking back  on the status. The KIP has been marked as
replaced by KIP-320 in the KIP list wiki page and the status has been
updated in the discussion and voting email thread.

Thanks,
Dong

On Sun, 30 Sep 2018 at 11:51 AM Matthias J. Sax <ma...@confluent.io>
wrote:

> It seems that KIP-320 was accepted. Thus, I am wondering what the status
> of this KIP is?
>
> -Matthias
>
> On 7/11/18 10:59 AM, Dong Lin wrote:
> > Hey Jun,
> >
> > Certainly. We can discuss later after KIP-320 settles.
> >
> > Thanks!
> > Dong
> >
> >
> > On Wed, Jul 11, 2018 at 8:54 AM, Jun Rao <ju...@confluent.io> wrote:
> >
> >> Hi, Dong,
> >>
> >> Sorry for the late response. Since KIP-320 is covering some of the
> similar
> >> problems described in this KIP, perhaps we can wait until KIP-320
> settles
> >> and see what's still left uncovered in this KIP.
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Mon, Jun 4, 2018 at 7:03 PM, Dong Lin <li...@gmail.com> wrote:
> >>
> >>> Hey Jun,
> >>>
> >>> It seems that we have made considerable progress on the discussion of
> >>> KIP-253 since February. Do you think we should continue the discussion
> >>> there, or can we continue the voting for this KIP? I am happy to submit
> >> the
> >>> PR and move forward the progress for this KIP.
> >>>
> >>> Thanks!
> >>> Dong
> >>>
> >>>
> >>> On Wed, Feb 7, 2018 at 11:42 PM, Dong Lin <li...@gmail.com> wrote:
> >>>
> >>>> Hey Jun,
> >>>>
> >>>> Sure, I will come up with a KIP this week. I think there is a way to
> >>> allow
> >>>> partition expansion to arbitrary number without introducing new
> >> concepts
> >>>> such as read-only partition or repartition epoch.
> >>>>
> >>>> Thanks,
> >>>> Dong
> >>>>
> >>>> On Wed, Feb 7, 2018 at 5:28 PM, Jun Rao <ju...@confluent.io> wrote:
> >>>>
> >>>>> Hi, Dong,
> >>>>>
> >>>>> Thanks for the reply. The general idea that you had for adding
> >>> partitions
> >>>>> is similar to what we had in mind. It would be useful to make this
> >> more
> >>>>> general, allowing adding an arbitrary number of partitions (instead
> of
> >>>>> just
> >>>>> doubling) and potentially removing partitions as well. The following
> >> is
> >>>>> the
> >>>>> high level idea from the discussion with Colin, Jason and Ismael.
> >>>>>
> >>>>> * To change the number of partitions from X to Y in a topic, the
> >>>>> controller
> >>>>> marks all existing X partitions as read-only and creates Y new
> >>> partitions.
> >>>>> The new partitions are writable and are tagged with a higher
> >> repartition
> >>>>> epoch (RE).
> >>>>>
> >>>>> * The controller propagates the new metadata to every broker. Once
> the
> >>>>> leader of a partition is marked as read-only, it rejects the produce
> >>>>> requests on this partition. The producer will then refresh the
> >> metadata
> >>>>> and
> >>>>> start publishing to the new writable partitions.
> >>>>>
> >>>>> * The consumers will then be consuming messages in RE order. The
> >>> consumer
> >>>>> coordinator will only assign partitions in the same RE to consumers.
> >>> Only
> >>>>> after all messages in an RE are consumed, will partitions in a higher
> >> RE
> >>>>> be
> >>>>> assigned to consumers.
> >>>>>
> >>>>> As Colin mentioned, if we do the above, we could potentially (1) use
> a
> >>>>> globally unique partition id, or (2) use a globally unique topic id
> to
> >>>>> distinguish recreated partitions due to topic deletion.
> >>>>>
> >>>>> So, perhaps we can sketch out the re-partitioning KIP a bit more and
> >> see
> >>>>> if
> >>>>> there is any overlap with KIP-232. Would you be interested in doing
> >>> that?
> >>>>> If not, we can do that next week.
> >>>>>
> >>>>> Jun
> >>>>>
> >>>>>
> >>>>> On Tue, Feb 6, 2018 at 11:30 AM, Dong Lin <li...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> Hey Jun,
> >>>>>>
> >>>>>> Interestingly I am also planning to sketch a KIP to allow partition
> >>>>>> expansion for keyed topics after this KIP. Since you are already
> >> doing
> >>>>>> that, I guess I will just share my high level idea here in case it
> >> is
> >>>>>> helpful.
> >>>>>>
> >>>>>> The motivation for the KIP is that we currently lose order guarantee
> >>> for
> >>>>>> messages with the same key if we expand partitions of keyed topic.
> >>>>>>
> >>>>>> The solution can probably be built upon the following ideas:
> >>>>>>
> >>>>>> - Partition number of the keyed topic should always be doubled (or
> >>>>>> multiplied by power of 2). Given that we select a partition based on
> >>>>>> hash(key) % partitionNum, this should help us ensure that, a message
> >>>>>> assigned to an existing partition will not be mapped to another
> >>> existing
> >>>>>> partition after partition expansion.
> >>>>>>
> >>>>>> - Producer includes in the ProduceRequest some information that
> >> helps
> >>>>>> ensure that messages produced ti a partition will monotonically
> >>>>> increase in
> >>>>>> the partitionNum of the topic. In other words, if broker receives a
> >>>>>> ProduceRequest and notices that the producer does not know the
> >>> partition
> >>>>>> number has increased, broker should reject this request. That
> >>>>> "information"
> >>>>>> maybe leaderEpoch, max partitionEpoch of the partitions of the
> >> topic,
> >>> or
> >>>>>> simply partitionNum of the topic. The benefit of this property is
> >> that
> >>>>> we
> >>>>>> can keep the new logic for in-order message consumption entirely in
> >>> how
> >>>>>> consumer leader determines the partition -> consumer mapping.
> >>>>>>
> >>>>>> - When consumer leader determines partition -> consumer mapping,
> >>> leader
> >>>>>> first reads the start position for each partition using
> >>>>> OffsetFetchRequest.
> >>>>>> If start position are all non-zero, then assignment can be done in
> >> its
> >>>>>> current manner. The assumption is that, a message in the new
> >> partition
> >>>>>> should only be consumed after all messages with the same key
> >> produced
> >>>>>> before it has been consumed. Since some messages in the new
> >> partition
> >>>>> has
> >>>>>> been consumed, we should not worry about consuming messages
> >>>>> out-of-order.
> >>>>>> This benefit of this approach is that we can avoid unnecessary
> >>> overhead
> >>>>> in
> >>>>>> the common case.
> >>>>>>
> >>>>>> - If the consumer leader finds that the start position for some
> >>>>> partition
> >>>>>> is 0. Say the current partition number is 18 and the partition index
> >>> is
> >>>>> 12,
> >>>>>> then consumer leader should ensure that messages produced to
> >> partition
> >>>>> 12 -
> >>>>>> 18/2 = 3 before the first message of partition 12 is consumed,
> >> before
> >>> it
> >>>>>> assigned partition 12 to any consumer in the consumer group. Since
> >> we
> >>>>> have
> >>>>>> a "information" that is monotonically increasing per partition,
> >>> consumer
> >>>>>> can read the value of this information from the first message in
> >>>>> partition
> >>>>>> 12, get the offset corresponding to this value in partition 3,
> >> assign
> >>>>>> partition except for partition 12 (and probably other new
> >> partitions)
> >>> to
> >>>>>> the existing consumers, waiting for the committed offset to go
> >> beyond
> >>>>> this
> >>>>>> offset for partition 3, and trigger rebalance again so that
> >> partition
> >>> 3
> >>>>> can
> >>>>>> be reassigned to some consumer.
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Dong
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Feb 6, 2018 at 10:10 AM, Jun Rao <ju...@confluent.io> wrote:
> >>>>>>
> >>>>>>> Hi, Dong,
> >>>>>>>
> >>>>>>> Thanks for the KIP. It looks good overall. We are working on a
> >>>>> separate
> >>>>>> KIP
> >>>>>>> for adding partitions while preserving the ordering guarantees.
> >> That
> >>>>> may
> >>>>>>> require another flavor of partition epoch. It's not very clear
> >>> whether
> >>>>>> that
> >>>>>>> partition epoch can be merged with the partition epoch in this
> >> KIP.
> >>>>> So,
> >>>>>>> perhaps you can wait on this a bit until we post the other KIP in
> >>> the
> >>>>>> next
> >>>>>>> few days.
> >>>>>>>
> >>>>>>> Jun
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Feb 5, 2018 at 2:43 PM, Becket Qin <be...@gmail.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> +1 on the KIP.
> >>>>>>>>
> >>>>>>>> I think the KIP is mainly about adding the capability of
> >> tracking
> >>>>> the
> >>>>>>>> system state change lineage. It does not seem necessary to
> >> bundle
> >>>>> this
> >>>>>>> KIP
> >>>>>>>> with replacing the topic partition with partition epoch in
> >>>>>> produce/fetch.
> >>>>>>>> Replacing topic-partition string with partition epoch is
> >>>>> essentially a
> >>>>>>>> performance improvement on top of this KIP. That can probably be
> >>>>> done
> >>>>>>>> separately.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>
> >>>>>>>> On Mon, Jan 29, 2018 at 11:52 AM, Dong Lin <lindong28@gmail.com
> >>>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Colin,
> >>>>>>>>>
> >>>>>>>>> On Mon, Jan 29, 2018 at 11:23 AM, Colin McCabe <
> >>>>> cmccabe@apache.org>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>>> On Mon, Jan 29, 2018 at 10:35 AM, Dong Lin <
> >>>>> lindong28@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hey Colin,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I understand that the KIP will adds overhead by
> >>> introducing
> >>>>>>>>>> per-partition
> >>>>>>>>>>>> partitionEpoch. I am open to alternative solutions that
> >>> does
> >>>>>> not
> >>>>>>>>> incur
> >>>>>>>>>>>> additional overhead. But I don't see a better way now.
> >>>>>>>>>>>>
> >>>>>>>>>>>> IMO the overhead in the FetchResponse may not be that
> >>> much.
> >>>>> We
> >>>>>>>>> probably
> >>>>>>>>>>>> should discuss the percentage increase rather than the
> >>>>> absolute
> >>>>>>>>> number
> >>>>>>>>>>>> increase. Currently after KIP-227, per-partition header
> >>> has
> >>>>> 23
> >>>>>>>> bytes.
> >>>>>>>>>> This
> >>>>>>>>>>>> KIP adds another 4 bytes. Assume the records size is
> >> 10KB,
> >>>>> the
> >>>>>>>>>> percentage
> >>>>>>>>>>>> increase is 4 / (23 + 10000) = 0.03%. It seems
> >> negligible,
> >>>>>> right?
> >>>>>>>>>>
> >>>>>>>>>> Hi Dong,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the response.  I agree that the FetchRequest /
> >>>>>>> FetchResponse
> >>>>>>>>>> overhead should be OK, now that we have incremental fetch
> >>>>> requests
> >>>>>>> and
> >>>>>>>>>> responses.  However, there are a lot of cases where the
> >>>>> percentage
> >>>>>>>>> increase
> >>>>>>>>>> is much greater.  For example, if a client is doing full
> >>>>>>>>> MetadataRequests /
> >>>>>>>>>> Responses, we have some math kind of like this per
> >> partition:
> >>>>>>>>>>
> >>>>>>>>>>> UpdateMetadataRequestPartitionState => topic partition
> >>>>>>>>> controller_epoch
> >>>>>>>>>> leader  leader_epoch partition_epoch isr zk_version replicas
> >>>>>>>>>> offline_replicas
> >>>>>>>>>>> 14 bytes:  topic => string (assuming about 10 byte topic
> >>>>> names)
> >>>>>>>>>>> 4 bytes:  partition => int32
> >>>>>>>>>>> 4  bytes: conroller_epoch => int32
> >>>>>>>>>>> 4  bytes: leader => int32
> >>>>>>>>>>> 4  bytes: leader_epoch => int32
> >>>>>>>>>>> +4 EXTRA bytes: partition_epoch => int32        <-- NEW
> >>>>>>>>>>> 2+4+4+4 bytes: isr => [int32] (assuming 3 in the ISR)
> >>>>>>>>>>> 4 bytes: zk_version => int32
> >>>>>>>>>>> 2+4+4+4 bytes: replicas => [int32] (assuming 3 replicas)
> >>>>>>>>>>> 2  offline_replicas => [int32] (assuming no offline
> >>> replicas)
> >>>>>>>>>>
> >>>>>>>>>> Assuming I added that up correctly, the per-partition
> >> overhead
> >>>>> goes
> >>>>>>>> from
> >>>>>>>>>> 64 bytes per partition to 68, a 6.2% increase.
> >>>>>>>>>>
> >>>>>>>>>> We could do similar math for a lot of the other RPCs.  And
> >> you
> >>>>> will
> >>>>>>>> have
> >>>>>>>>> a
> >>>>>>>>>> similar memory and garbage collection impact on the brokers
> >>>>> since
> >>>>>> you
> >>>>>>>>> have
> >>>>>>>>>> to store all this extra state as well.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> That is correct. IMO the Metadata is only updated periodically
> >>>>> and is
> >>>>>>>>> probably not a big deal if we increase it by 6%. The
> >>> FetchResponse
> >>>>>> and
> >>>>>>>>> ProduceRequest are probably the only requests that are bounded
> >>> by
> >>>>> the
> >>>>>>>>> bandwidth throughput.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I agree that we can probably save more space by using
> >>>>> partition
> >>>>>>> ID
> >>>>>>>> so
> >>>>>>>>>> that
> >>>>>>>>>>>> we no longer needs the string topic name. The similar
> >> idea
> >>>>> has
> >>>>>>> also
> >>>>>>>>>> been
> >>>>>>>>>>>> put in the Rejected Alternative section in KIP-227.
> >> While
> >>>>> this
> >>>>>>> idea
> >>>>>>>>> is
> >>>>>>>>>>>> promising, it seems orthogonal to the goal of this KIP.
> >>>>> Given
> >>>>>>> that
> >>>>>>>>>> there is
> >>>>>>>>>>>> already many work to do in this KIP, maybe we can do the
> >>>>>>> partition
> >>>>>>>> ID
> >>>>>>>>>> in a
> >>>>>>>>>>>> separate KIP?
> >>>>>>>>>>
> >>>>>>>>>> I guess my thinking is that the goal here is to replace an
> >>>>>> identifier
> >>>>>>>>>> which can be re-used (the tuple of topic name, partition ID)
> >>>>> with
> >>>>>> an
> >>>>>>>>>> identifier that cannot be re-used (the tuple of topic name,
> >>>>>> partition
> >>>>>>>> ID,
> >>>>>>>>>> partition epoch) in order to gain better semantics.  As long
> >>> as
> >>>>> we
> >>>>>>> are
> >>>>>>>>>> replacing the identifier, why not replace it with an
> >>> identifier
> >>>>>> that
> >>>>>>>> has
> >>>>>>>>>> important performance advantages?  The KIP freeze for the
> >> next
> >>>>>>> release
> >>>>>>>>> has
> >>>>>>>>>> already passed, so there is time to do this.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> In general it can be easier for discussion and implementation
> >> if
> >>>>> we
> >>>>>> can
> >>>>>>>>> split a larger task into smaller and independent tasks. For
> >>>>> example,
> >>>>>>>>> KIP-112 and KIP-113 both deals with the JBOD support. KIP-31,
> >>>>> KIP-32
> >>>>>>> and
> >>>>>>>>> KIP-33 are about timestamp support. The option on this can be
> >>>>> subject
> >>>>>>>>> though.
> >>>>>>>>>
> >>>>>>>>> IMO the change to switch from (topic, partition ID) to
> >>>>> partitionEpch
> >>>>>> in
> >>>>>>>> all
> >>>>>>>>> request/response requires us to going through all request one
> >> by
> >>>>> one.
> >>>>>>> It
> >>>>>>>>> may not be hard but it can be time consuming and tedious. At
> >>> high
> >>>>>> level
> >>>>>>>> the
> >>>>>>>>> goal and the change for that will be orthogonal to the changes
> >>>>>> required
> >>>>>>>> in
> >>>>>>>>> this KIP. That is the main reason I think we can split them
> >> into
> >>>>> two
> >>>>>>>> KIPs.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Mon, Jan 29, 2018, at 10:54, Dong Lin wrote:
> >>>>>>>>>>> I think it is possible to move to entirely use
> >>> partitionEpoch
> >>>>>>> instead
> >>>>>>>>> of
> >>>>>>>>>>> (topic, partition) to identify a partition. Client can
> >>> obtain
> >>>>> the
> >>>>>>>>>>> partitionEpoch -> (topic, partition) mapping from
> >>>>>> MetadataResponse.
> >>>>>>>> We
> >>>>>>>>>>> probably need to figure out a way to assign partitionEpoch
> >>> to
> >>>>>>>> existing
> >>>>>>>>>>> partitions in the cluster. But this should be doable.
> >>>>>>>>>>>
> >>>>>>>>>>> This is a good idea. I think it will save us some space in
> >>> the
> >>>>>>>>>>> request/response. The actual space saving in percentage
> >>>>> probably
> >>>>>>>>> depends
> >>>>>>>>>> on
> >>>>>>>>>>> the amount of data and the number of partitions of the
> >> same
> >>>>>> topic.
> >>>>>>> I
> >>>>>>>>> just
> >>>>>>>>>>> think we can do it in a separate KIP.
> >>>>>>>>>>
> >>>>>>>>>> Hmm.  How much extra work would be required?  It seems like
> >> we
> >>>>> are
> >>>>>>>>> already
> >>>>>>>>>> changing almost every RPC that involves topics and
> >> partitions,
> >>>>>>> already
> >>>>>>>>>> adding new per-partition state to ZooKeeper, already
> >> changing
> >>>>> how
> >>>>>>>> clients
> >>>>>>>>>> interact with partitions.  Is there some other big piece of
> >>> work
> >>>>>> we'd
> >>>>>>>>> have
> >>>>>>>>>> to do to move to partition IDs that we wouldn't need for
> >>>>> partition
> >>>>>>>>> epochs?
> >>>>>>>>>> I guess we'd have to find a way to support regular
> >>>>> expression-based
> >>>>>>>> topic
> >>>>>>>>>> subscriptions.  If we split this into multiple KIPs,
> >> wouldn't
> >>> we
> >>>>>> end
> >>>>>>> up
> >>>>>>>>>> changing all that RPCs and ZK state a second time?  Also,
> >> I'm
> >>>>>> curious
> >>>>>>>> if
> >>>>>>>>>> anyone has done any proof of concept GC, memory, and network
> >>>>> usage
> >>>>>>>>>> measurements on switching topic names for topic IDs.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> We will need to go over all requests/responses to check how to
> >>>>>> replace
> >>>>>>>>> (topic, partition ID) with partition epoch. It requires
> >>>>> non-trivial
> >>>>>>> work
> >>>>>>>>> and could take time. As you mentioned, we may want to see how
> >>> much
> >>>>>>> saving
> >>>>>>>>> we can get by switching from topic names to partition epoch.
> >>> That
> >>>>>>> itself
> >>>>>>>>> requires time and experiment. It seems that the new idea does
> >>> not
> >>>>>>>> rollback
> >>>>>>>>> any change proposed in this KIP. So I am not sure we can get
> >>> much
> >>>>> by
> >>>>>>>>> putting them into the same KIP.
> >>>>>>>>>
> >>>>>>>>> Anyway, if more people are interested in seeing the new idea
> >> in
> >>>>> the
> >>>>>>> same
> >>>>>>>>> KIP, I can try that.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> best,
> >>>>>>>>>> Colin
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Jan 29, 2018 at 10:18 AM, Colin McCabe <
> >>>>>>> cmccabe@apache.org
> >>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Jan 26, 2018, at 12:17, Dong Lin wrote:
> >>>>>>>>>>>>>> Hey Colin,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Jan 26, 2018 at 10:16 AM, Colin McCabe <
> >>>>>>>>> cmccabe@apache.org>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Jan 25, 2018, at 16:47, Dong Lin wrote:
> >>>>>>>>>>>>>>>> Hey Colin,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for the comment.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, Jan 25, 2018 at 4:15 PM, Colin McCabe <
> >>>>>>>>>> cmccabe@apache.org>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018, at 21:07, Dong Lin wrote:
> >>>>>>>>>>>>>>>>>> Hey Colin,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for reviewing the KIP.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If I understand you right, you maybe
> >> suggesting
> >>>>> that
> >>>>>>> we
> >>>>>>>>> can
> >>>>>>>>>> use
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>>> global
> >>>>>>>>>>>>>>>>>> metadataEpoch that is incremented every time
> >>>>>>> controller
> >>>>>>>>>> updates
> >>>>>>>>>>>>>>> metadata.
> >>>>>>>>>>>>>>>>>> The problem with this solution is that, if a
> >>>>> topic
> >>>>>> is
> >>>>>>>>>> deleted
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> created
> >>>>>>>>>>>>>>>>>> again, user will not know whether that the
> >>> offset
> >>>>>>> which
> >>>>>>>> is
> >>>>>>>>>>>>> stored
> >>>>>>>>>>>>>>> before
> >>>>>>>>>>>>>>>>>> the topic deletion is no longer valid. This
> >>>>>> motivates
> >>>>>>>> the
> >>>>>>>>>> idea
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> include
> >>>>>>>>>>>>>>>>>> per-partition partitionEpoch. Does this sound
> >>>>>>>> reasonable?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Dong,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Perhaps we can store the last valid offset of
> >>> each
> >>>>>>> deleted
> >>>>>>>>>> topic
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> ZooKeeper.  Then, when a topic with one of
> >> those
> >>>>> names
> >>>>>>>> gets
> >>>>>>>>>>>>>>> re-created, we
> >>>>>>>>>>>>>>>>> can start the topic at the previous end offset
> >>>>> rather
> >>>>>>> than
> >>>>>>>>> at
> >>>>>>>>>> 0.
> >>>>>>>>>>>>> This
> >>>>>>>>>>>>>>>>> preserves immutability.  It is no more
> >> burdensome
> >>>>> than
> >>>>>>>>> having
> >>>>>>>>>> to
> >>>>>>>>>>>>>>> preserve a
> >>>>>>>>>>>>>>>>> "last epoch" for the deleted partition
> >> somewhere,
> >>>>>> right?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> My concern with this solution is that the number
> >> of
> >>>>>>>> zookeeper
> >>>>>>>>>> nodes
> >>>>>>>>>>>>> get
> >>>>>>>>>>>>>>>> more and more over time if some users keep
> >> deleting
> >>>>> and
> >>>>>>>>> creating
> >>>>>>>>>>>>> topics.
> >>>>>>>>>>>>>>> Do
> >>>>>>>>>>>>>>>> you think this can be a problem?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Dong,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We could expire the "partition tombstones" after an
> >>>>> hour
> >>>>>> or
> >>>>>>>> so.
> >>>>>>>>>> In
> >>>>>>>>>>>>>>> practice this would solve the issue for clients
> >> that
> >>>>> like
> >>>>>> to
> >>>>>>>>>> destroy
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> re-create topics all the time.  In any case,
> >> doesn't
> >>>>> the
> >>>>>>>> current
> >>>>>>>>>>>>> proposal
> >>>>>>>>>>>>>>> add per-partition znodes as well that we have to
> >>> track
> >>>>>> even
> >>>>>>>>> after
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> partition is deleted?  Or did I misunderstand that?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Actually the current KIP does not add per-partition
> >>>>> znodes.
> >>>>>>>> Could
> >>>>>>>>>> you
> >>>>>>>>>>>>>> double check? I can fix the KIP wiki if there is
> >>> anything
> >>>>>>>>>> misleading.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Dong,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I double-checked the KIP, and I can see that you are in
> >>>>> fact
> >>>>>>>> using a
> >>>>>>>>>>>>> global counter for initializing partition epochs.  So,
> >>> you
> >>>>> are
> >>>>>>>>>> correct, it
> >>>>>>>>>>>>> doesn't add per-partition znodes for partitions that no
> >>>>> longer
> >>>>>>>>> exist.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> If we expire the "partition tomstones" after an hour,
> >>> and
> >>>>>> the
> >>>>>>>>> topic
> >>>>>>>>>> is
> >>>>>>>>>>>>>> re-created after more than an hour since the topic
> >>>>> deletion,
> >>>>>>>> then
> >>>>>>>>>> we are
> >>>>>>>>>>>>>> back to the situation where user can not tell whether
> >>> the
> >>>>>>> topic
> >>>>>>>>> has
> >>>>>>>>>> been
> >>>>>>>>>>>>>> re-created or not, right?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yes, with an expiration period, it would not ensure
> >>>>>>> immutability--
> >>>>>>>>> you
> >>>>>>>>>>>>> could effectively reuse partition names and they would
> >>> look
> >>>>>> the
> >>>>>>>>> same.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It's not really clear to me what should happen
> >> when a
> >>>>>> topic
> >>>>>>> is
> >>>>>>>>>>>>> destroyed
> >>>>>>>>>>>>>>> and re-created with new data.  Should consumers
> >>>>> continue
> >>>>>> to
> >>>>>>> be
> >>>>>>>>>> able to
> >>>>>>>>>>>>>>> consume?  We don't know where they stopped
> >> consuming
> >>>>> from
> >>>>>>> the
> >>>>>>>>>> previous
> >>>>>>>>>>>>>>> incarnation of the topic, so messages may have been
> >>>>> lost.
> >>>>>>>>>> Certainly
> >>>>>>>>>>>>>>> consuming data from offset X of the new incarnation
> >>> of
> >>>>> the
> >>>>>>>> topic
> >>>>>>>>>> may
> >>>>>>>>>>>>> give
> >>>>>>>>>>>>>>> something totally different from what you would
> >> have
> >>>>>> gotten
> >>>>>>>> from
> >>>>>>>>>>>>> offset X
> >>>>>>>>>>>>>>> of the previous incarnation of the topic.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With the current KIP, if a consumer consumes a topic
> >>>>> based
> >>>>>> on
> >>>>>>>> the
> >>>>>>>>>> last
> >>>>>>>>>>>>>> remembered (offset, partitionEpoch, leaderEpoch), and
> >>> if
> >>>>> the
> >>>>>>>> topic
> >>>>>>>>>> is
> >>>>>>>>>>>>>> re-created, consume will throw
> >>>>>> InvalidPartitionEpochException
> >>>>>>>>>> because
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> previous partitionEpoch will be different from the
> >>>>> current
> >>>>>>>>>>>>> partitionEpoch.
> >>>>>>>>>>>>>> This is described in the Proposed Changes ->
> >>> Consumption
> >>>>>> after
> >>>>>>>>> topic
> >>>>>>>>>>>>>> deletion in the KIP. I can improve the KIP if there
> >> is
> >>>>>>> anything
> >>>>>>>>> not
> >>>>>>>>>>>>> clear.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the clarification.  It sounds like what you
> >>>>> really
> >>>>>>> want
> >>>>>>>>> is
> >>>>>>>>>>>>> immutability-- i.e., to never "really" reuse partition
> >>>>>>>> identifiers.
> >>>>>>>>>> And
> >>>>>>>>>>>>> you do this by making the partition name no longer the
> >>>>> "real"
> >>>>>>>>>> identifier.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> My big concern about this KIP is that it seems like an
> >>>>>>>>>> anti-scalability
> >>>>>>>>>>>>> feature.  Now we are adding 4 extra bytes for every
> >>>>> partition
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>>> FetchResponse and Request, for example.  That could be
> >> 40
> >>>>> kb
> >>>>>> per
> >>>>>>>>>> request,
> >>>>>>>>>>>>> if the user has 10,000 partitions.  And of course, the
> >>> KIP
> >>>>>> also
> >>>>>>>>> makes
> >>>>>>>>>>>>> massive changes to UpdateMetadataRequest,
> >>> MetadataResponse,
> >>>>>>>>>>>>> OffsetCommitRequest, OffsetFetchResponse,
> >>>>> LeaderAndIsrRequest,
> >>>>>>>>>>>>> ListOffsetResponse, etc. which will also increase their
> >>>>> size
> >>>>>> on
> >>>>>>>> the
> >>>>>>>>>> wire
> >>>>>>>>>>>>> and in memory.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> One thing that we talked a lot about in the past is
> >>>>> replacing
> >>>>>>>>>> partition
> >>>>>>>>>>>>> names with IDs.  IDs have a lot of really nice
> >> features.
> >>>>> They
> >>>>>>>> take
> >>>>>>>>>> up much
> >>>>>>>>>>>>> less space in memory than strings (especially 2-byte
> >> Java
> >>>>>>>> strings).
> >>>>>>>>>> They
> >>>>>>>>>>>>> can often be allocated on the stack rather than the
> >> heap
> >>>>>>>> (important
> >>>>>>>>>> when
> >>>>>>>>>>>>> you are dealing with hundreds of thousands of them).
> >>> They
> >>>>> can
> >>>>>>> be
> >>>>>>>>>>>>> efficiently deserialized and serialized.  If we use
> >>> 64-bit
> >>>>>> ones,
> >>>>>>>> we
> >>>>>>>>>> will
> >>>>>>>>>>>>> never run out of IDs, which means that they can always
> >> be
> >>>>>> unique
> >>>>>>>> per
> >>>>>>>>>>>>> partition.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Given that the partition name is no longer the "real"
> >>>>>> identifier
> >>>>>>>> for
> >>>>>>>>>>>>> partitions in the current KIP-232 proposal, why not
> >> just
> >>>>> move
> >>>>>> to
> >>>>>>>>> using
> >>>>>>>>>>>>> partition IDs entirely instead of strings?  You have to
> >>>>> change
> >>>>>>> all
> >>>>>>>>> the
> >>>>>>>>>>>>> messages anyway.  There isn't much point any more to
> >>>>> carrying
> >>>>>>>> around
> >>>>>>>>>> the
> >>>>>>>>>>>>> partition name in every RPC, since you really need
> >> (name,
> >>>>>> epoch)
> >>>>>>>> to
> >>>>>>>>>>>>> identify the partition.
> >>>>>>>>>>>>> Probably the metadata response and a few other messages
> >>>>> would
> >>>>>>> have
> >>>>>>>>> to
> >>>>>>>>>>>>> still carry the partition name, to allow clients to go
> >>> from
> >>>>>> name
> >>>>>>>> to
> >>>>>>>>>> id.
> >>>>>>>>>>>>> But we could mostly forget about the strings.  And then
> >>>>> this
> >>>>>>> would
> >>>>>>>>> be
> >>>>>>>>>> a
> >>>>>>>>>>>>> scalability improvement rather than a scalability
> >>> problem.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> By choosing to reuse the same (topic, partition,
> >>>>> offset)
> >>>>>>>>> 3-tuple,
> >>>>>>>>>> we
> >>>>>>>>>>>>> have
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> chosen to give up immutability.  That was a really
> >> bad
> >>>>>>> decision.
> >>>>>>>>>> And
> >>>>>>>>>>>>> now
> >>>>>>>>>>>>>>> we have to worry about time dependencies, stale
> >>> cached
> >>>>>> data,
> >>>>>>>> and
> >>>>>>>>>> all
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> rest.  We can't completely fix this inside Kafka no
> >>>>> matter
> >>>>>>>> what
> >>>>>>>>>> we do,
> >>>>>>>>>>>>>>> because not all that cached data is inside Kafka
> >>>>> itself.
> >>>>>>> Some
> >>>>>>>>> of
> >>>>>>>>>> it
> >>>>>>>>>>>>> may be
> >>>>>>>>>>>>>>> in systems that Kafka has sent data to, such as
> >> other
> >>>>>>> daemons,
> >>>>>>>>> SQL
> >>>>>>>>>>>>>>> databases, streams, and so forth.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The current KIP will uniquely identify a message
> >> using
> >>>>>> (topic,
> >>>>>>>>>>>>> partition,
> >>>>>>>>>>>>>> offset, partitionEpoch) 4-tuple. This addresses the
> >>>>> message
> >>>>>>>>>> immutability
> >>>>>>>>>>>>>> issue that you mentioned. Is there any corner case
> >>> where
> >>>>> the
> >>>>>>>>> message
> >>>>>>>>>>>>>> immutability is still not preserved with the current
> >>> KIP?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I guess the idea here is that mirror maker should
> >>> work
> >>>>> as
> >>>>>>>>> expected
> >>>>>>>>>>>>> when
> >>>>>>>>>>>>>>> users destroy a topic and re-create it with the
> >> same
> >>>>> name.
> >>>>>>>>> That's
> >>>>>>>>>>>>> kind of
> >>>>>>>>>>>>>>> tough, though, since in that scenario, mirror maker
> >>>>>> probably
> >>>>>>>>>> should
> >>>>>>>>>>>>> destroy
> >>>>>>>>>>>>>>> and re-create the topic on the other end, too,
> >> right?
> >>>>>>>>> Otherwise,
> >>>>>>>>>>>>> what you
> >>>>>>>>>>>>>>> end up with on the other end could be half of one
> >>>>>>> incarnation
> >>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>>>> topic,
> >>>>>>>>>>>>>>> and half of another.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> What mirror maker really needs is to be able to
> >>> follow
> >>>>> a
> >>>>>>>> stream
> >>>>>>>>> of
> >>>>>>>>>>>>> events
> >>>>>>>>>>>>>>> about the kafka cluster itself.  We could have some
> >>>>> master
> >>>>>>>> topic
> >>>>>>>>>>>>> which is
> >>>>>>>>>>>>>>> always present and which contains data about all
> >>> topic
> >>>>>>>>> deletions,
> >>>>>>>>>>>>>>> creations, etc.  Then MM can simply follow this
> >> topic
> >>>>> and
> >>>>>> do
> >>>>>>>>> what
> >>>>>>>>>> is
> >>>>>>>>>>>>> needed.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Then the next question maybe, should we use a
> >>>>> global
> >>>>>>>>>>>>> metadataEpoch +
> >>>>>>>>>>>>>>>>>> per-partition partitionEpoch, instead of
> >> using
> >>>>>>>>> per-partition
> >>>>>>>>>>>>>>> leaderEpoch
> >>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> per-partition leaderEpoch. The former
> >> solution
> >>>>> using
> >>>>>>>>>>>>> metadataEpoch
> >>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>> not work due to the following scenario
> >>> (provided
> >>>>> by
> >>>>>>>> Jun):
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> "Consider the following scenario. In metadata
> >>> v1,
> >>>>>> the
> >>>>>>>>> leader
> >>>>>>>>>>>>> for a
> >>>>>>>>>>>>>>>>>> partition is at broker 1. In metadata v2,
> >>> leader
> >>>>> is
> >>>>>> at
> >>>>>>>>>> broker
> >>>>>>>>>>>>> 2. In
> >>>>>>>>>>>>>>>>>> metadata v3, leader is at broker 1 again. The
> >>>>> last
> >>>>>>>>> committed
> >>>>>>>>>>>>> offset
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> v1,
> >>>>>>>>>>>>>>>>>> v2 and v3 are 10, 20 and 30, respectively. A
> >>>>>> consumer
> >>>>>>> is
> >>>>>>>>>>>>> started and
> >>>>>>>>>>>>>>>>> reads
> >>>>>>>>>>>>>>>>>> metadata v1 and reads messages from offset 0
> >> to
> >>>>> 25
> >>>>>>> from
> >>>>>>>>>> broker
> >>>>>>>>>>>>> 1. My
> >>>>>>>>>>>>>>>>>> understanding is that in the current
> >> proposal,
> >>>>> the
> >>>>>>>>> metadata
> >>>>>>>>>>>>> version
> >>>>>>>>>>>>>>>>>> associated with offset 25 is v1. The consumer
> >>> is
> >>>>>> then
> >>>>>>>>>> restarted
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> fetches
> >>>>>>>>>>>>>>>>>> metadata v2. The consumer tries to read from
> >>>>> broker
> >>>>>> 2,
> >>>>>>>>>> which is
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> old
> >>>>>>>>>>>>>>>>>> leader with the last offset at 20. In this
> >>> case,
> >>>>> the
> >>>>>>>>>> consumer
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>> still
> >>>>>>>>>>>>>>>>>> get OffsetOutOfRangeException incorrectly."
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Regarding your comment "For the second
> >> purpose,
> >>>>> this
> >>>>>>> is
> >>>>>>>>>> "soft
> >>>>>>>>>>>>> state"
> >>>>>>>>>>>>>>>>>> anyway.  If the client thinks X is the leader
> >>>>> but Y
> >>>>>> is
> >>>>>>>>>> really
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> leader,
> >>>>>>>>>>>>>>>>>> the client will talk to X, and X will point
> >> out
> >>>>> its
> >>>>>>>>> mistake
> >>>>>>>>>> by
> >>>>>>>>>>>>>>> sending
> >>>>>>>>>>>>>>>>> back
> >>>>>>>>>>>>>>>>>> a NOT_LEADER_FOR_PARTITION.", it is probably
> >> no
> >>>>>> true.
> >>>>>>>> The
> >>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>> here is
> >>>>>>>>>>>>>>>>>> that the old leader X may still think it is
> >> the
> >>>>>> leader
> >>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> partition
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> thus it will not send back
> >>>>> NOT_LEADER_FOR_PARTITION.
> >>>>>>> The
> >>>>>>>>>> reason
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> provided
> >>>>>>>>>>>>>>>>>> in KAFKA-6262. Can you check if that makes
> >>> sense?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This is solvable with a timeout, right?  If the
> >>>>> leader
> >>>>>>>> can't
> >>>>>>>>>>>>>>> communicate
> >>>>>>>>>>>>>>>>> with the controller for a certain period of
> >> time,
> >>>>> it
> >>>>>>>> should
> >>>>>>>>>> stop
> >>>>>>>>>>>>>>> acting as
> >>>>>>>>>>>>>>>>> the leader.  We have to solve this problem,
> >>>>> anyway, in
> >>>>>>>> order
> >>>>>>>>>> to
> >>>>>>>>>>>>> fix
> >>>>>>>>>>>>>>> all the
> >>>>>>>>>>>>>>>>> corner cases.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Not sure if I fully understand your proposal. The
> >>>>>> proposal
> >>>>>>>>>> seems to
> >>>>>>>>>>>>>>> require
> >>>>>>>>>>>>>>>> non-trivial changes to our existing leadership
> >>>>> election
> >>>>>>>>>> mechanism.
> >>>>>>>>>>>>> Could
> >>>>>>>>>>>>>>>> you provide more detail regarding how it works?
> >> For
> >>>>>>> example,
> >>>>>>>>> how
> >>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>> user choose this timeout, how leader determines
> >>>>> whether
> >>>>>> it
> >>>>>>>> can
> >>>>>>>>>> still
> >>>>>>>>>>>>>>>> communicate with controller, and how this
> >> triggers
> >>>>>>>> controller
> >>>>>>>>> to
> >>>>>>>>>>>>> elect
> >>>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>> leader?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Before I come up with any proposal, let me make
> >> sure
> >>> I
> >>>>>>>>> understand
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> problem correctly.  My big question was, what
> >>> prevents
> >>>>>>>>> split-brain
> >>>>>>>>>>>>> here?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Let's say I have a partition which is on nodes A,
> >> B,
> >>>>> and
> >>>>>> C,
> >>>>>>>> with
> >>>>>>>>>>>>> min-ISR
> >>>>>>>>>>>>>>> 2.  The controller is D.  At some point, there is a
> >>>>>> network
> >>>>>>>>>> partition
> >>>>>>>>>>>>>>> between A and B and the rest of the cluster.  The
> >>>>>> Controller
> >>>>>>>>>>>>> re-assigns the
> >>>>>>>>>>>>>>> partition to nodes C, D, and E.  But A and B keep
> >>>>> chugging
> >>>>>>>> away,
> >>>>>>>>>> even
> >>>>>>>>>>>>>>> though they can no longer communicate with the
> >>>>> controller.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> At some point, a client with stale metadata writes
> >> to
> >>>>> the
> >>>>>>>>>> partition.
> >>>>>>>>>>>>> It
> >>>>>>>>>>>>>>> still thinks the partition is on node A, B, and C,
> >> so
> >>>>>> that's
> >>>>>>>>>> where it
> >>>>>>>>>>>>> sends
> >>>>>>>>>>>>>>> the data.  It's unable to talk to C, but A and B
> >>> reply
> >>>>>> back
> >>>>>>>> that
> >>>>>>>>>> all
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>> well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Is this not a case where we could lose data due to
> >>>>> split
> >>>>>>>> brain?
> >>>>>>>>>> Or is
> >>>>>>>>>>>>>>> there a mechanism for preventing this that I
> >> missed?
> >>>>> If
> >>>>>> it
> >>>>>>>> is,
> >>>>>>>>> it
> >>>>>>>>>>>>> seems
> >>>>>>>>>>>>>>> like a pretty serious failure case that we should
> >> be
> >>>>>>> handling
> >>>>>>>>>> with our
> >>>>>>>>>>>>>>> metadata rework.  And I think epoch numbers and
> >>>>> timeouts
> >>>>>>> might
> >>>>>>>>> be
> >>>>>>>>>>>>> part of
> >>>>>>>>>>>>>>> the solution.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Right, split brain can happen if RF=4 and minIsr=2.
> >>>>>> However, I
> >>>>>>>> am
> >>>>>>>>>> not
> >>>>>>>>>>>>> sure
> >>>>>>>>>>>>>> it is a pretty serious issue which we need to address
> >>>>> today.
> >>>>>>>> This
> >>>>>>>>>> can be
> >>>>>>>>>>>>>> prevented by configuring the Kafka topic so that
> >>> minIsr >
> >>>>>>> RF/2.
> >>>>>>>>>>>>> Actually,
> >>>>>>>>>>>>>> if user sets minIsr=2, is there anything reason that
> >>> user
> >>>>>>> wants
> >>>>>>>> to
> >>>>>>>>>> set
> >>>>>>>>>>>>> RF=4
> >>>>>>>>>>>>>> instead of 4?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Introducing timeout in leader election mechanism is
> >>>>>>>> non-trivial. I
> >>>>>>>>>>>>> think we
> >>>>>>>>>>>>>> probably want to do that only if there is good
> >> use-case
> >>>>> that
> >>>>>>> can
> >>>>>>>>> not
> >>>>>>>>>>>>>> otherwise be addressed with the current mechanism.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I still would like to think about these corner cases
> >>> more.
> >>>>>> But
> >>>>>>>>>> perhaps
> >>>>>>>>>>>>> it's not directly related to this KIP.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> regards,
> >>>>>>>>>>>>> Colin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> best,
> >>>>>>>>>>>>>>> Colin
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> best,
> >>>>>>>>>>>>>>>>> Colin
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>> Dong
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018 at 10:39 AM, Colin
> >> McCabe
> >>> <
> >>>>>>>>>>>>> cmccabe@apache.org>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Dong,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for proposing this KIP.  I think a
> >>>>> metadata
> >>>>>>>> epoch
> >>>>>>>>>> is a
> >>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>> good
> >>>>>>>>>>>>>>>>>>> idea.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I read through the DISCUSS thread, but I
> >>> still
> >>>>>> don't
> >>>>>>>>> have
> >>>>>>>>>> a
> >>>>>>>>>>>>> clear
> >>>>>>>>>>>>>>>>> picture
> >>>>>>>>>>>>>>>>>>> of why the proposal uses a metadata epoch
> >> per
> >>>>>>>> partition
> >>>>>>>>>> rather
> >>>>>>>>>>>>>>> than a
> >>>>>>>>>>>>>>>>>>> global metadata epoch.  A metadata epoch
> >> per
> >>>>>>> partition
> >>>>>>>>> is
> >>>>>>>>>>>>> kind of
> >>>>>>>>>>>>>>>>>>> unpleasant-- it's at least 4 extra bytes
> >> per
> >>>>>>> partition
> >>>>>>>>>> that we
> >>>>>>>>>>>>>>> have to
> >>>>>>>>>>>>>>>>> send
> >>>>>>>>>>>>>>>>>>> over the wire in every full metadata
> >> request,
> >>>>>> which
> >>>>>>>>> could
> >>>>>>>>>>>>> become
> >>>>>>>>>>>>>>> extra
> >>>>>>>>>>>>>>>>>>> kilobytes on the wire when the number of
> >>>>>> partitions
> >>>>>>>>>> becomes
> >>>>>>>>>>>>> large.
> >>>>>>>>>>>>>>>>> Plus,
> >>>>>>>>>>>>>>>>>>> we have to update all the auxillary classes
> >>> to
> >>>>>>> include
> >>>>>>>>> an
> >>>>>>>>>>>>> epoch.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> We need to have a global metadata epoch
> >>> anyway
> >>>>> to
> >>>>>>>> handle
> >>>>>>>>>>>>> partition
> >>>>>>>>>>>>>>>>>>> addition and deletion.  For example, if I
> >>> give
> >>>>> you
> >>>>>>>>>>>>>>>>>>> MetadataResponse{part1,epoch 1, part2,
> >> epoch
> >>> 1}
> >>>>>> and
> >>>>>>>>>> {part1,
> >>>>>>>>>>>>>>> epoch1},
> >>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>> MetadataResponse is newer?  You have no way
> >>> of
> >>>>>>>> knowing.
> >>>>>>>>>> It
> >>>>>>>>>>>>> could
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> part2 has just been created, and the
> >> response
> >>>>>> with 2
> >>>>>>>>>>>>> partitions is
> >>>>>>>>>>>>>>>>> newer.
> >>>>>>>>>>>>>>>>>>> Or it coudl be that part2 has just been
> >>>>> deleted,
> >>>>>> and
> >>>>>>>>>>>>> therefore the
> >>>>>>>>>>>>>>>>> response
> >>>>>>>>>>>>>>>>>>> with 1 partition is newer.  You must have a
> >>>>> global
> >>>>>>>> epoch
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>> disambiguate
> >>>>>>>>>>>>>>>>>>> these two cases.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Previously, I worked on the Ceph
> >> distributed
> >>>>>>>> filesystem.
> >>>>>>>>>>>>> Ceph had
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> concept of a map of the whole cluster,
> >>>>> maintained
> >>>>>>> by a
> >>>>>>>>> few
> >>>>>>>>>>>>> servers
> >>>>>>>>>>>>>>>>> doing
> >>>>>>>>>>>>>>>>>>> paxos.  This map was versioned by a single
> >>>>> 64-bit
> >>>>>>>> epoch
> >>>>>>>>>> number
> >>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>> increased on every change.  It was
> >> propagated
> >>>>> to
> >>>>>>>> clients
> >>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>> gossip.  I
> >>>>>>>>>>>>>>>>>>> wonder if something similar could work
> >> here?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It seems like the the Kafka
> >> MetadataResponse
> >>>>>> serves
> >>>>>>>> two
> >>>>>>>>>>>>> somewhat
> >>>>>>>>>>>>>>>>> unrelated
> >>>>>>>>>>>>>>>>>>> purposes.  Firstly, it lets clients know
> >> what
> >>>>>>>> partitions
> >>>>>>>>>>>>> exist in
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> system and where they live.  Secondly, it
> >>> lets
> >>>>>>> clients
> >>>>>>>>>> know
> >>>>>>>>>>>>> which
> >>>>>>>>>>>>>>> nodes
> >>>>>>>>>>>>>>>>>>> within the partition are in-sync (in the
> >> ISR)
> >>>>> and
> >>>>>>>> which
> >>>>>>>>>> node
> >>>>>>>>>>>>> is the
> >>>>>>>>>>>>>>>>> leader.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> The first purpose is what you really need a
> >>>>>> metadata
> >>>>>>>>> epoch
> >>>>>>>>>>>>> for, I
> >>>>>>>>>>>>>>>>> think.
> >>>>>>>>>>>>>>>>>>> You want to know whether a partition exists
> >>> or
> >>>>>> not,
> >>>>>>> or
> >>>>>>>>> you
> >>>>>>>>>>>>> want to
> >>>>>>>>>>>>>>> know
> >>>>>>>>>>>>>>>>>>> which nodes you should talk to in order to
> >>>>> write
> >>>>>> to
> >>>>>>> a
> >>>>>>>>>> given
> >>>>>>>>>>>>>>>>> partition.  A
> >>>>>>>>>>>>>>>>>>> single metadata epoch for the whole
> >> response
> >>>>>> should
> >>>>>>> be
> >>>>>>>>>>>>> adequate
> >>>>>>>>>>>>>>> here.
> >>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>> should not change the partition assignment
> >>>>> without
> >>>>>>>> going
> >>>>>>>>>>>>> through
> >>>>>>>>>>>>>>>>> zookeeper
> >>>>>>>>>>>>>>>>>>> (or a similar system), and this inherently
> >>>>>>> serializes
> >>>>>>>>>> updates
> >>>>>>>>>>>>> into
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> numbered stream.  Brokers should also stop
> >>>>>>> responding
> >>>>>>>> to
> >>>>>>>>>>>>> requests
> >>>>>>>>>>>>>>> when
> >>>>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>> are unable to contact ZK for a certain time
> >>>>>> period.
> >>>>>>>>> This
> >>>>>>>>>>>>> prevents
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>> where a given partition has been moved off
> >>> some
> >>>>>> set
> >>>>>>> of
> >>>>>>>>>> nodes,
> >>>>>>>>>>>>> but a
> >>>>>>>>>>>>>>>>> client
> >>>>>>>>>>>>>>>>>>> still ends up talking to those nodes and
> >>>>> writing
> >>>>>>> data
> >>>>>>>>>> there.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For the second purpose, this is "soft
> >> state"
> >>>>>> anyway.
> >>>>>>>> If
> >>>>>>>>>> the
> >>>>>>>>>>>>> client
> >>>>>>>>>>>>>>>>> thinks
> >>>>>>>>>>>>>>>>>>> X is the leader but Y is really the leader,
> >>> the
> >>>>>>> client
> >>>>>>>>>> will
> >>>>>>>>>>>>> talk
> >>>>>>>>>>>>>>> to X,
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> X will point out its mistake by sending
> >> back
> >>> a
> >>>>>>>>>>>>>>>>> NOT_LEADER_FOR_PARTITION.
> >>>>>>>>>>>>>>>>>>> Then the client can update its metadata
> >> again
> >>>>> and
> >>>>>>> find
> >>>>>>>>>> the new
> >>>>>>>>>>>>>>> leader,
> >>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>> there is one.  There is no need for an
> >> epoch
> >>> to
> >>>>>>> handle
> >>>>>>>>>> this.
> >>>>>>>>>>>>>>>>> Similarly, I
> >>>>>>>>>>>>>>>>>>> can't think of a reason why changing the
> >>>>> in-sync
> >>>>>>>> replica
> >>>>>>>>>> set
> >>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> bump
> >>>>>>>>>>>>>>>>>>> the epoch.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> best,
> >>>>>>>>>>>>>>>>>>> Colin
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018, at 09:45, Dong Lin
> >>> wrote:
> >>>>>>>>>>>>>>>>>>>> Thanks much for reviewing the KIP!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Dong
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018 at 7:10 AM, Guozhang
> >>>>> Wang <
> >>>>>>>>>>>>>>> wangguoz@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Yeah that makes sense, again I'm just
> >>>>> making
> >>>>>>> sure
> >>>>>>>> we
> >>>>>>>>>>>>> understand
> >>>>>>>>>>>>>>>>> all the
> >>>>>>>>>>>>>>>>>>>>> scenarios and what to expect.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I agree that if, more generally
> >> speaking,
> >>>>> say
> >>>>>>>> users
> >>>>>>>>>> have
> >>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>> consumed
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> offset 8, and then call seek(16) to
> >>> "jump"
> >>>>> to
> >>>>>> a
> >>>>>>>>>> further
> >>>>>>>>>>>>>>> position,
> >>>>>>>>>>>>>>>>> then
> >>>>>>>>>>>>>>>>>>> she
> >>>>>>>>>>>>>>>>>>>>> needs to be aware that OORE maybe
> >> thrown
> >>>>> and
> >>>>>> she
> >>>>>>>>>> needs to
> >>>>>>>>>>>>>>> handle
> >>>>>>>>>>>>>>>>> it or
> >>>>>>>>>>>>>>>>>>> rely
> >>>>>>>>>>>>>>>>>>>>> on reset policy which should not
> >> surprise
> >>>>> her.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I'm +1 on the KIP.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018 at 12:31 AM, Dong
> >>> Lin
> >>>>> <
> >>>>>>>>>>>>>>> lindong28@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Yes, in general we can not prevent
> >>>>>>>>>>>>> OffsetOutOfRangeException
> >>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>> user
> >>>>>>>>>>>>>>>>>>>>> seeks
> >>>>>>>>>>>>>>>>>>>>>> to a wrong offset. The main goal is
> >> to
> >>>>>> prevent
> >>>>>>>>>>>>>>>>>>> OffsetOutOfRangeException
> >>>>>>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>>>> user has done things in the right
> >> way,
> >>>>> e.g.
> >>>>>>> user
> >>>>>>>>>> should
> >>>>>>>>>>>>> know
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>> message with this offset.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> For example, if user calls seek(..)
> >>> right
> >>>>>>> after
> >>>>>>>>>>>>>>> construction, the
> >>>>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>>>> reason I can think of is that user
> >>> stores
> >>>>>>> offset
> >>>>>>>>>>>>> externally.
> >>>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>> case,
> >>>>>>>>>>>>>>>>>>>>>> user currently needs to use the
> >> offset
> >>>>> which
> >>>>>>> is
> >>>>>>>>>> obtained
> >>>>>>>>>>>>>>> using
> >>>>>>>>>>>>>>>>>>>>> position(..)
> >>>>>>>>>>>>>>>>>>>>>> from the last run. With this KIP,
> >> user
> >>>>> needs
> >>>>>>> to
> >>>>>>>>> get
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> offset
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> offsetEpoch using
> >>>>>> positionAndOffsetEpoch(...)
> >>>>>>>> and
> >>>>>>>>>> stores
> >>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>>> information
> >>>>>>>>>>>>>>>>>>>>>> externally. The next time user starts
> >>>>>>> consumer,
> >>>>>>>>>> he/she
> >>>>>>>>>>>>> needs
> >>>>>>>>>>>>>>> to
> >>>>>

Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Posted by "Matthias J. Sax" <ma...@confluent.io>.

It seems that KIP-320 was accepted. Thus, I am wondering what the status
of this KIP is?

-Matthias

On 7/11/18 10:59 AM, Dong Lin wrote:
> Hey Jun,
> 
> Certainly. We can discuss later after KIP-320 settles.
> 
> Thanks!
> Dong
> 
> 
> On Wed, Jul 11, 2018 at 8:54 AM, Jun Rao <ju...@confluent.io> wrote:
> 
>> Hi, Dong,
>>
>> Sorry for the late response. Since KIP-320 is covering some of the similar
>> problems described in this KIP, perhaps we can wait until KIP-320 settles
>> and see what's still left uncovered in this KIP.
>>
>> Thanks,
>>
>> Jun
>>
>> On Mon, Jun 4, 2018 at 7:03 PM, Dong Lin <li...@gmail.com> wrote:
>>
>>> Hey Jun,
>>>
>>> It seems that we have made considerable progress on the discussion of
>>> KIP-253 since February. Do you think we should continue the discussion
>>> there, or can we continue the voting for this KIP? I am happy to submit
>> the
>>> PR and move forward the progress for this KIP.
>>>
>>> Thanks!
>>> Dong
>>>
>>>
>>> On Wed, Feb 7, 2018 at 11:42 PM, Dong Lin <li...@gmail.com> wrote:
>>>
>>>> Hey Jun,
>>>>
>>>> Sure, I will come up with a KIP this week. I think there is a way to
>>> allow
>>>> partition expansion to arbitrary number without introducing new
>> concepts
>>>> such as read-only partition or repartition epoch.
>>>>
>>>> Thanks,
>>>> Dong
>>>>
>>>> On Wed, Feb 7, 2018 at 5:28 PM, Jun Rao <ju...@confluent.io> wrote:
>>>>
>>>>> Hi, Dong,
>>>>>
>>>>> Thanks for the reply. The general idea that you had for adding
>>> partitions
>>>>> is similar to what we had in mind. It would be useful to make this
>> more
>>>>> general, allowing adding an arbitrary number of partitions (instead of
>>>>> just
>>>>> doubling) and potentially removing partitions as well. The following
>> is
>>>>> the
>>>>> high level idea from the discussion with Colin, Jason and Ismael.
>>>>>
>>>>> * To change the number of partitions from X to Y in a topic, the
>>>>> controller
>>>>> marks all existing X partitions as read-only and creates Y new
>>> partitions.
>>>>> The new partitions are writable and are tagged with a higher
>> repartition
>>>>> epoch (RE).
>>>>>
>>>>> * The controller propagates the new metadata to every broker. Once the
>>>>> leader of a partition is marked as read-only, it rejects the produce
>>>>> requests on this partition. The producer will then refresh the
>> metadata
>>>>> and
>>>>> start publishing to the new writable partitions.
>>>>>
>>>>> * The consumers will then be consuming messages in RE order. The
>>> consumer
>>>>> coordinator will only assign partitions in the same RE to consumers.
>>> Only
>>>>> after all messages in an RE are consumed, will partitions in a higher
>> RE
>>>>> be
>>>>> assigned to consumers.
>>>>>
>>>>> As Colin mentioned, if we do the above, we could potentially (1) use a
>>>>> globally unique partition id, or (2) use a globally unique topic id to
>>>>> distinguish recreated partitions due to topic deletion.
>>>>>
>>>>> So, perhaps we can sketch out the re-partitioning KIP a bit more and
>> see
>>>>> if
>>>>> there is any overlap with KIP-232. Would you be interested in doing
>>> that?
>>>>> If not, we can do that next week.
>>>>>
>>>>> Jun
>>>>>
>>>>>
>>>>> On Tue, Feb 6, 2018 at 11:30 AM, Dong Lin <li...@gmail.com>
>> wrote:
>>>>>
>>>>>> Hey Jun,
>>>>>>
>>>>>> Interestingly I am also planning to sketch a KIP to allow partition
>>>>>> expansion for keyed topics after this KIP. Since you are already
>> doing
>>>>>> that, I guess I will just share my high level idea here in case it
>> is
>>>>>> helpful.
>>>>>>
>>>>>> The motivation for the KIP is that we currently lose order guarantee
>>> for
>>>>>> messages with the same key if we expand partitions of keyed topic.
>>>>>>
>>>>>> The solution can probably be built upon the following ideas:
>>>>>>
>>>>>> - Partition number of the keyed topic should always be doubled (or
>>>>>> multiplied by power of 2). Given that we select a partition based on
>>>>>> hash(key) % partitionNum, this should help us ensure that, a message
>>>>>> assigned to an existing partition will not be mapped to another
>>> existing
>>>>>> partition after partition expansion.
>>>>>>
>>>>>> - Producer includes in the ProduceRequest some information that
>> helps
>>>>>> ensure that messages produced ti a partition will monotonically
>>>>> increase in
>>>>>> the partitionNum of the topic. In other words, if broker receives a
>>>>>> ProduceRequest and notices that the producer does not know the
>>> partition
>>>>>> number has increased, broker should reject this request. That
>>>>> "information"
>>>>>> maybe leaderEpoch, max partitionEpoch of the partitions of the
>> topic,
>>> or
>>>>>> simply partitionNum of the topic. The benefit of this property is
>> that
>>>>> we
>>>>>> can keep the new logic for in-order message consumption entirely in
>>> how
>>>>>> consumer leader determines the partition -> consumer mapping.
>>>>>>
>>>>>> - When consumer leader determines partition -> consumer mapping,
>>> leader
>>>>>> first reads the start position for each partition using
>>>>> OffsetFetchRequest.
>>>>>> If start position are all non-zero, then assignment can be done in
>> its
>>>>>> current manner. The assumption is that, a message in the new
>> partition
>>>>>> should only be consumed after all messages with the same key
>> produced
>>>>>> before it has been consumed. Since some messages in the new
>> partition
>>>>> has
>>>>>> been consumed, we should not worry about consuming messages
>>>>> out-of-order.
>>>>>> This benefit of this approach is that we can avoid unnecessary
>>> overhead
>>>>> in
>>>>>> the common case.
>>>>>>
>>>>>> - If the consumer leader finds that the start position for some
>>>>> partition
>>>>>> is 0. Say the current partition number is 18 and the partition index
>>> is
>>>>> 12,
>>>>>> then consumer leader should ensure that messages produced to
>> partition
>>>>> 12 -
>>>>>> 18/2 = 3 before the first message of partition 12 is consumed,
>> before
>>> it
>>>>>> assigned partition 12 to any consumer in the consumer group. Since
>> we
>>>>> have
>>>>>> a "information" that is monotonically increasing per partition,
>>> consumer
>>>>>> can read the value of this information from the first message in
>>>>> partition
>>>>>> 12, get the offset corresponding to this value in partition 3,
>> assign
>>>>>> partition except for partition 12 (and probably other new
>> partitions)
>>> to
>>>>>> the existing consumers, waiting for the committed offset to go
>> beyond
>>>>> this
>>>>>> offset for partition 3, and trigger rebalance again so that
>> partition
>>> 3
>>>>> can
>>>>>> be reassigned to some consumer.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Dong
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 6, 2018 at 10:10 AM, Jun Rao <ju...@confluent.io> wrote:
>>>>>>
>>>>>>> Hi, Dong,
>>>>>>>
>>>>>>> Thanks for the KIP. It looks good overall. We are working on a
>>>>> separate
>>>>>> KIP
>>>>>>> for adding partitions while preserving the ordering guarantees.
>> That
>>>>> may
>>>>>>> require another flavor of partition epoch. It's not very clear
>>> whether
>>>>>> that
>>>>>>> partition epoch can be merged with the partition epoch in this
>> KIP.
>>>>> So,
>>>>>>> perhaps you can wait on this a bit until we post the other KIP in
>>> the
>>>>>> next
>>>>>>> few days.
>>>>>>>
>>>>>>> Jun
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 5, 2018 at 2:43 PM, Becket Qin <be...@gmail.com>
>>>>> wrote:
>>>>>>>
>>>>>>>> +1 on the KIP.
>>>>>>>>
>>>>>>>> I think the KIP is mainly about adding the capability of
>> tracking
>>>>> the
>>>>>>>> system state change lineage. It does not seem necessary to
>> bundle
>>>>> this
>>>>>>> KIP
>>>>>>>> with replacing the topic partition with partition epoch in
>>>>>> produce/fetch.
>>>>>>>> Replacing topic-partition string with partition epoch is
>>>>> essentially a
>>>>>>>> performance improvement on top of this KIP. That can probably be
>>>>> done
>>>>>>>> separately.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>
>>>>>>>> On Mon, Jan 29, 2018 at 11:52 AM, Dong Lin <lindong28@gmail.com
>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Colin,
>>>>>>>>>
>>>>>>>>> On Mon, Jan 29, 2018 at 11:23 AM, Colin McCabe <
>>>>> cmccabe@apache.org>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 29, 2018 at 10:35 AM, Dong Lin <
>>>>> lindong28@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Colin,
>>>>>>>>>>>>
>>>>>>>>>>>> I understand that the KIP will adds overhead by
>>> introducing
>>>>>>>>>> per-partition
>>>>>>>>>>>> partitionEpoch. I am open to alternative solutions that
>>> does
>>>>>> not
>>>>>>>>> incur
>>>>>>>>>>>> additional overhead. But I don't see a better way now.
>>>>>>>>>>>>
>>>>>>>>>>>> IMO the overhead in the FetchResponse may not be that
>>> much.
>>>>> We
>>>>>>>>> probably
>>>>>>>>>>>> should discuss the percentage increase rather than the
>>>>> absolute
>>>>>>>>> number
>>>>>>>>>>>> increase. Currently after KIP-227, per-partition header
>>> has
>>>>> 23
>>>>>>>> bytes.
>>>>>>>>>> This
>>>>>>>>>>>> KIP adds another 4 bytes. Assume the records size is
>> 10KB,
>>>>> the
>>>>>>>>>> percentage
>>>>>>>>>>>> increase is 4 / (23 + 10000) = 0.03%. It seems
>> negligible,
>>>>>> right?
>>>>>>>>>>
>>>>>>>>>> Hi Dong,
>>>>>>>>>>
>>>>>>>>>> Thanks for the response.  I agree that the FetchRequest /
>>>>>>> FetchResponse
>>>>>>>>>> overhead should be OK, now that we have incremental fetch
>>>>> requests
>>>>>>> and
>>>>>>>>>> responses.  However, there are a lot of cases where the
>>>>> percentage
>>>>>>>>> increase
>>>>>>>>>> is much greater.  For example, if a client is doing full
>>>>>>>>> MetadataRequests /
>>>>>>>>>> Responses, we have some math kind of like this per
>> partition:
>>>>>>>>>>
>>>>>>>>>>> UpdateMetadataRequestPartitionState => topic partition
>>>>>>>>> controller_epoch
>>>>>>>>>> leader  leader_epoch partition_epoch isr zk_version replicas
>>>>>>>>>> offline_replicas
>>>>>>>>>>> 14 bytes:  topic => string (assuming about 10 byte topic
>>>>> names)
>>>>>>>>>>> 4 bytes:  partition => int32
>>>>>>>>>>> 4  bytes: conroller_epoch => int32
>>>>>>>>>>> 4  bytes: leader => int32
>>>>>>>>>>> 4  bytes: leader_epoch => int32
>>>>>>>>>>> +4 EXTRA bytes: partition_epoch => int32        <-- NEW
>>>>>>>>>>> 2+4+4+4 bytes: isr => [int32] (assuming 3 in the ISR)
>>>>>>>>>>> 4 bytes: zk_version => int32
>>>>>>>>>>> 2+4+4+4 bytes: replicas => [int32] (assuming 3 replicas)
>>>>>>>>>>> 2  offline_replicas => [int32] (assuming no offline
>>> replicas)
>>>>>>>>>>
>>>>>>>>>> Assuming I added that up correctly, the per-partition
>> overhead
>>>>> goes
>>>>>>>> from
>>>>>>>>>> 64 bytes per partition to 68, a 6.2% increase.
>>>>>>>>>>
>>>>>>>>>> We could do similar math for a lot of the other RPCs.  And
>> you
>>>>> will
>>>>>>>> have
>>>>>>>>> a
>>>>>>>>>> similar memory and garbage collection impact on the brokers
>>>>> since
>>>>>> you
>>>>>>>>> have
>>>>>>>>>> to store all this extra state as well.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That is correct. IMO the Metadata is only updated periodically
>>>>> and is
>>>>>>>>> probably not a big deal if we increase it by 6%. The
>>> FetchResponse
>>>>>> and
>>>>>>>>> ProduceRequest are probably the only requests that are bounded
>>> by
>>>>> the
>>>>>>>>> bandwidth throughput.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I agree that we can probably save more space by using
>>>>> partition
>>>>>>> ID
>>>>>>>> so
>>>>>>>>>> that
>>>>>>>>>>>> we no longer needs the string topic name. The similar
>> idea
>>>>> has
>>>>>>> also
>>>>>>>>>> been
>>>>>>>>>>>> put in the Rejected Alternative section in KIP-227.
>> While
>>>>> this
>>>>>>> idea
>>>>>>>>> is
>>>>>>>>>>>> promising, it seems orthogonal to the goal of this KIP.
>>>>> Given
>>>>>>> that
>>>>>>>>>> there is
>>>>>>>>>>>> already many work to do in this KIP, maybe we can do the
>>>>>>> partition
>>>>>>>> ID
>>>>>>>>>> in a
>>>>>>>>>>>> separate KIP?
>>>>>>>>>>
>>>>>>>>>> I guess my thinking is that the goal here is to replace an
>>>>>> identifier
>>>>>>>>>> which can be re-used (the tuple of topic name, partition ID)
>>>>> with
>>>>>> an
>>>>>>>>>> identifier that cannot be re-used (the tuple of topic name,
>>>>>> partition
>>>>>>>> ID,
>>>>>>>>>> partition epoch) in order to gain better semantics.  As long
>>> as
>>>>> we
>>>>>>> are
>>>>>>>>>> replacing the identifier, why not replace it with an
>>> identifier
>>>>>> that
>>>>>>>> has
>>>>>>>>>> important performance advantages?  The KIP freeze for the
>> next
>>>>>>> release
>>>>>>>>> has
>>>>>>>>>> already passed, so there is time to do this.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In general it can be easier for discussion and implementation
>> if
>>>>> we
>>>>>> can
>>>>>>>>> split a larger task into smaller and independent tasks. For
>>>>> example,
>>>>>>>>> KIP-112 and KIP-113 both deals with the JBOD support. KIP-31,
>>>>> KIP-32
>>>>>>> and
>>>>>>>>> KIP-33 are about timestamp support. The option on this can be
>>>>> subject
>>>>>>>>> though.
>>>>>>>>>
>>>>>>>>> IMO the change to switch from (topic, partition ID) to
>>>>> partitionEpch
>>>>>> in
>>>>>>>> all
>>>>>>>>> request/response requires us to going through all request one
>> by
>>>>> one.
>>>>>>> It
>>>>>>>>> may not be hard but it can be time consuming and tedious. At
>>> high
>>>>>> level
>>>>>>>> the
>>>>>>>>> goal and the change for that will be orthogonal to the changes
>>>>>> required
>>>>>>>> in
>>>>>>>>> this KIP. That is the main reason I think we can split them
>> into
>>>>> two
>>>>>>>> KIPs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Mon, Jan 29, 2018, at 10:54, Dong Lin wrote:
>>>>>>>>>>> I think it is possible to move to entirely use
>>> partitionEpoch
>>>>>>> instead
>>>>>>>>> of
>>>>>>>>>>> (topic, partition) to identify a partition. Client can
>>> obtain
>>>>> the
>>>>>>>>>>> partitionEpoch -> (topic, partition) mapping from
>>>>>> MetadataResponse.
>>>>>>>> We
>>>>>>>>>>> probably need to figure out a way to assign partitionEpoch
>>> to
>>>>>>>> existing
>>>>>>>>>>> partitions in the cluster. But this should be doable.
>>>>>>>>>>>
>>>>>>>>>>> This is a good idea. I think it will save us some space in
>>> the
>>>>>>>>>>> request/response. The actual space saving in percentage
>>>>> probably
>>>>>>>>> depends
>>>>>>>>>> on
>>>>>>>>>>> the amount of data and the number of partitions of the
>> same
>>>>>> topic.
>>>>>>> I
>>>>>>>>> just
>>>>>>>>>>> think we can do it in a separate KIP.
>>>>>>>>>>
>>>>>>>>>> Hmm.  How much extra work would be required?  It seems like
>> we
>>>>> are
>>>>>>>>> already
>>>>>>>>>> changing almost every RPC that involves topics and
>> partitions,
>>>>>>> already
>>>>>>>>>> adding new per-partition state to ZooKeeper, already
>> changing
>>>>> how
>>>>>>>> clients
>>>>>>>>>> interact with partitions.  Is there some other big piece of
>>> work
>>>>>> we'd
>>>>>>>>> have
>>>>>>>>>> to do to move to partition IDs that we wouldn't need for
>>>>> partition
>>>>>>>>> epochs?
>>>>>>>>>> I guess we'd have to find a way to support regular
>>>>> expression-based
>>>>>>>> topic
>>>>>>>>>> subscriptions.  If we split this into multiple KIPs,
>> wouldn't
>>> we
>>>>>> end
>>>>>>> up
>>>>>>>>>> changing all that RPCs and ZK state a second time?  Also,
>> I'm
>>>>>> curious
>>>>>>>> if
>>>>>>>>>> anyone has done any proof of concept GC, memory, and network
>>>>> usage
>>>>>>>>>> measurements on switching topic names for topic IDs.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We will need to go over all requests/responses to check how to
>>>>>> replace
>>>>>>>>> (topic, partition ID) with partition epoch. It requires
>>>>> non-trivial
>>>>>>> work
>>>>>>>>> and could take time. As you mentioned, we may want to see how
>>> much
>>>>>>> saving
>>>>>>>>> we can get by switching from topic names to partition epoch.
>>> That
>>>>>>> itself
>>>>>>>>> requires time and experiment. It seems that the new idea does
>>> not
>>>>>>>> rollback
>>>>>>>>> any change proposed in this KIP. So I am not sure we can get
>>> much
>>>>> by
>>>>>>>>> putting them into the same KIP.
>>>>>>>>>
>>>>>>>>> Anyway, if more people are interested in seeing the new idea
>> in
>>>>> the
>>>>>>> same
>>>>>>>>> KIP, I can try that.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> best,
>>>>>>>>>> Colin
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jan 29, 2018 at 10:18 AM, Colin McCabe <
>>>>>>> cmccabe@apache.org
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jan 26, 2018, at 12:17, Dong Lin wrote:
>>>>>>>>>>>>>> Hey Colin,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Jan 26, 2018 at 10:16 AM, Colin McCabe <
>>>>>>>>> cmccabe@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Jan 25, 2018, at 16:47, Dong Lin wrote:
>>>>>>>>>>>>>>>> Hey Colin,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the comment.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Jan 25, 2018 at 4:15 PM, Colin McCabe <
>>>>>>>>>> cmccabe@apache.org>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018, at 21:07, Dong Lin wrote:
>>>>>>>>>>>>>>>>>> Hey Colin,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for reviewing the KIP.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If I understand you right, you maybe
>> suggesting
>>>>> that
>>>>>>> we
>>>>>>>>> can
>>>>>>>>>> use
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> global
>>>>>>>>>>>>>>>>>> metadataEpoch that is incremented every time
>>>>>>> controller
>>>>>>>>>> updates
>>>>>>>>>>>>>>> metadata.
>>>>>>>>>>>>>>>>>> The problem with this solution is that, if a
>>>>> topic
>>>>>> is
>>>>>>>>>> deleted
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> created
>>>>>>>>>>>>>>>>>> again, user will not know whether that the
>>> offset
>>>>>>> which
>>>>>>>> is
>>>>>>>>>>>>> stored
>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>>>> the topic deletion is no longer valid. This
>>>>>> motivates
>>>>>>>> the
>>>>>>>>>> idea
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> include
>>>>>>>>>>>>>>>>>> per-partition partitionEpoch. Does this sound
>>>>>>>> reasonable?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Dong,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Perhaps we can store the last valid offset of
>>> each
>>>>>>> deleted
>>>>>>>>>> topic
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> ZooKeeper.  Then, when a topic with one of
>> those
>>>>> names
>>>>>>>> gets
>>>>>>>>>>>>>>> re-created, we
>>>>>>>>>>>>>>>>> can start the topic at the previous end offset
>>>>> rather
>>>>>>> than
>>>>>>>>> at
>>>>>>>>>> 0.
>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>> preserves immutability.  It is no more
>> burdensome
>>>>> than
>>>>>>>>> having
>>>>>>>>>> to
>>>>>>>>>>>>>>> preserve a
>>>>>>>>>>>>>>>>> "last epoch" for the deleted partition
>> somewhere,
>>>>>> right?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My concern with this solution is that the number
>> of
>>>>>>>> zookeeper
>>>>>>>>>> nodes
>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>> more and more over time if some users keep
>> deleting
>>>>> and
>>>>>>>>> creating
>>>>>>>>>>>>> topics.
>>>>>>>>>>>>>>> Do
>>>>>>>>>>>>>>>> you think this can be a problem?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Dong,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We could expire the "partition tombstones" after an
>>>>> hour
>>>>>> or
>>>>>>>> so.
>>>>>>>>>> In
>>>>>>>>>>>>>>> practice this would solve the issue for clients
>> that
>>>>> like
>>>>>> to
>>>>>>>>>> destroy
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> re-create topics all the time.  In any case,
>> doesn't
>>>>> the
>>>>>>>> current
>>>>>>>>>>>>> proposal
>>>>>>>>>>>>>>> add per-partition znodes as well that we have to
>>> track
>>>>>> even
>>>>>>>>> after
>>>>>>>>>> the
>>>>>>>>>>>>>>> partition is deleted?  Or did I misunderstand that?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Actually the current KIP does not add per-partition
>>>>> znodes.
>>>>>>>> Could
>>>>>>>>>> you
>>>>>>>>>>>>>> double check? I can fix the KIP wiki if there is
>>> anything
>>>>>>>>>> misleading.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Dong,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I double-checked the KIP, and I can see that you are in
>>>>> fact
>>>>>>>> using a
>>>>>>>>>>>>> global counter for initializing partition epochs.  So,
>>> you
>>>>> are
>>>>>>>>>> correct, it
>>>>>>>>>>>>> doesn't add per-partition znodes for partitions that no
>>>>> longer
>>>>>>>>> exist.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we expire the "partition tomstones" after an hour,
>>> and
>>>>>> the
>>>>>>>>> topic
>>>>>>>>>> is
>>>>>>>>>>>>>> re-created after more than an hour since the topic
>>>>> deletion,
>>>>>>>> then
>>>>>>>>>> we are
>>>>>>>>>>>>>> back to the situation where user can not tell whether
>>> the
>>>>>>> topic
>>>>>>>>> has
>>>>>>>>>> been
>>>>>>>>>>>>>> re-created or not, right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, with an expiration period, it would not ensure
>>>>>>> immutability--
>>>>>>>>> you
>>>>>>>>>>>>> could effectively reuse partition names and they would
>>> look
>>>>>> the
>>>>>>>>> same.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's not really clear to me what should happen
>> when a
>>>>>> topic
>>>>>>> is
>>>>>>>>>>>>> destroyed
>>>>>>>>>>>>>>> and re-created with new data.  Should consumers
>>>>> continue
>>>>>> to
>>>>>>> be
>>>>>>>>>> able to
>>>>>>>>>>>>>>> consume?  We don't know where they stopped
>> consuming
>>>>> from
>>>>>>> the
>>>>>>>>>> previous
>>>>>>>>>>>>>>> incarnation of the topic, so messages may have been
>>>>> lost.
>>>>>>>>>> Certainly
>>>>>>>>>>>>>>> consuming data from offset X of the new incarnation
>>> of
>>>>> the
>>>>>>>> topic
>>>>>>>>>> may
>>>>>>>>>>>>> give
>>>>>>>>>>>>>>> something totally different from what you would
>> have
>>>>>> gotten
>>>>>>>> from
>>>>>>>>>>>>> offset X
>>>>>>>>>>>>>>> of the previous incarnation of the topic.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With the current KIP, if a consumer consumes a topic
>>>>> based
>>>>>> on
>>>>>>>> the
>>>>>>>>>> last
>>>>>>>>>>>>>> remembered (offset, partitionEpoch, leaderEpoch), and
>>> if
>>>>> the
>>>>>>>> topic
>>>>>>>>>> is
>>>>>>>>>>>>>> re-created, consume will throw
>>>>>> InvalidPartitionEpochException
>>>>>>>>>> because
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> previous partitionEpoch will be different from the
>>>>> current
>>>>>>>>>>>>> partitionEpoch.
>>>>>>>>>>>>>> This is described in the Proposed Changes ->
>>> Consumption
>>>>>> after
>>>>>>>>> topic
>>>>>>>>>>>>>> deletion in the KIP. I can improve the KIP if there
>> is
>>>>>>> anything
>>>>>>>>> not
>>>>>>>>>>>>> clear.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the clarification.  It sounds like what you
>>>>> really
>>>>>>> want
>>>>>>>>> is
>>>>>>>>>>>>> immutability-- i.e., to never "really" reuse partition
>>>>>>>> identifiers.
>>>>>>>>>> And
>>>>>>>>>>>>> you do this by making the partition name no longer the
>>>>> "real"
>>>>>>>>>> identifier.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My big concern about this KIP is that it seems like an
>>>>>>>>>> anti-scalability
>>>>>>>>>>>>> feature.  Now we are adding 4 extra bytes for every
>>>>> partition
>>>>>> in
>>>>>>>> the
>>>>>>>>>>>>> FetchResponse and Request, for example.  That could be
>> 40
>>>>> kb
>>>>>> per
>>>>>>>>>> request,
>>>>>>>>>>>>> if the user has 10,000 partitions.  And of course, the
>>> KIP
>>>>>> also
>>>>>>>>> makes
>>>>>>>>>>>>> massive changes to UpdateMetadataRequest,
>>> MetadataResponse,
>>>>>>>>>>>>> OffsetCommitRequest, OffsetFetchResponse,
>>>>> LeaderAndIsrRequest,
>>>>>>>>>>>>> ListOffsetResponse, etc. which will also increase their
>>>>> size
>>>>>> on
>>>>>>>> the
>>>>>>>>>> wire
>>>>>>>>>>>>> and in memory.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One thing that we talked a lot about in the past is
>>>>> replacing
>>>>>>>>>> partition
>>>>>>>>>>>>> names with IDs.  IDs have a lot of really nice
>> features.
>>>>> They
>>>>>>>> take
>>>>>>>>>> up much
>>>>>>>>>>>>> less space in memory than strings (especially 2-byte
>> Java
>>>>>>>> strings).
>>>>>>>>>> They
>>>>>>>>>>>>> can often be allocated on the stack rather than the
>> heap
>>>>>>>> (important
>>>>>>>>>> when
>>>>>>>>>>>>> you are dealing with hundreds of thousands of them).
>>> They
>>>>> can
>>>>>>> be
>>>>>>>>>>>>> efficiently deserialized and serialized.  If we use
>>> 64-bit
>>>>>> ones,
>>>>>>>> we
>>>>>>>>>> will
>>>>>>>>>>>>> never run out of IDs, which means that they can always
>> be
>>>>>> unique
>>>>>>>> per
>>>>>>>>>>>>> partition.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Given that the partition name is no longer the "real"
>>>>>> identifier
>>>>>>>> for
>>>>>>>>>>>>> partitions in the current KIP-232 proposal, why not
>> just
>>>>> move
>>>>>> to
>>>>>>>>> using
>>>>>>>>>>>>> partition IDs entirely instead of strings?  You have to
>>>>> change
>>>>>>> all
>>>>>>>>> the
>>>>>>>>>>>>> messages anyway.  There isn't much point any more to
>>>>> carrying
>>>>>>>> around
>>>>>>>>>> the
>>>>>>>>>>>>> partition name in every RPC, since you really need
>> (name,
>>>>>> epoch)
>>>>>>>> to
>>>>>>>>>>>>> identify the partition.
>>>>>>>>>>>>> Probably the metadata response and a few other messages
>>>>> would
>>>>>>> have
>>>>>>>>> to
>>>>>>>>>>>>> still carry the partition name, to allow clients to go
>>> from
>>>>>> name
>>>>>>>> to
>>>>>>>>>> id.
>>>>>>>>>>>>> But we could mostly forget about the strings.  And then
>>>>> this
>>>>>>> would
>>>>>>>>> be
>>>>>>>>>> a
>>>>>>>>>>>>> scalability improvement rather than a scalability
>>> problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> By choosing to reuse the same (topic, partition,
>>>>> offset)
>>>>>>>>> 3-tuple,
>>>>>>>>>> we
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> chosen to give up immutability.  That was a really
>> bad
>>>>>>> decision.
>>>>>>>>>> And
>>>>>>>>>>>>> now
>>>>>>>>>>>>>>> we have to worry about time dependencies, stale
>>> cached
>>>>>> data,
>>>>>>>> and
>>>>>>>>>> all
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> rest.  We can't completely fix this inside Kafka no
>>>>> matter
>>>>>>>> what
>>>>>>>>>> we do,
>>>>>>>>>>>>>>> because not all that cached data is inside Kafka
>>>>> itself.
>>>>>>> Some
>>>>>>>>> of
>>>>>>>>>> it
>>>>>>>>>>>>> may be
>>>>>>>>>>>>>>> in systems that Kafka has sent data to, such as
>> other
>>>>>>> daemons,
>>>>>>>>> SQL
>>>>>>>>>>>>>>> databases, streams, and so forth.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The current KIP will uniquely identify a message
>> using
>>>>>> (topic,
>>>>>>>>>>>>> partition,
>>>>>>>>>>>>>> offset, partitionEpoch) 4-tuple. This addresses the
>>>>> message
>>>>>>>>>> immutability
>>>>>>>>>>>>>> issue that you mentioned. Is there any corner case
>>> where
>>>>> the
>>>>>>>>> message
>>>>>>>>>>>>>> immutability is still not preserved with the current
>>> KIP?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I guess the idea here is that mirror maker should
>>> work
>>>>> as
>>>>>>>>> expected
>>>>>>>>>>>>> when
>>>>>>>>>>>>>>> users destroy a topic and re-create it with the
>> same
>>>>> name.
>>>>>>>>> That's
>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>> tough, though, since in that scenario, mirror maker
>>>>>> probably
>>>>>>>>>> should
>>>>>>>>>>>>> destroy
>>>>>>>>>>>>>>> and re-create the topic on the other end, too,
>> right?
>>>>>>>>> Otherwise,
>>>>>>>>>>>>> what you
>>>>>>>>>>>>>>> end up with on the other end could be half of one
>>>>>>> incarnation
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>>> topic,
>>>>>>>>>>>>>>> and half of another.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What mirror maker really needs is to be able to
>>> follow
>>>>> a
>>>>>>>> stream
>>>>>>>>> of
>>>>>>>>>>>>> events
>>>>>>>>>>>>>>> about the kafka cluster itself.  We could have some
>>>>> master
>>>>>>>> topic
>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>> always present and which contains data about all
>>> topic
>>>>>>>>> deletions,
>>>>>>>>>>>>>>> creations, etc.  Then MM can simply follow this
>> topic
>>>>> and
>>>>>> do
>>>>>>>>> what
>>>>>>>>>> is
>>>>>>>>>>>>> needed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Then the next question maybe, should we use a
>>>>> global
>>>>>>>>>>>>> metadataEpoch +
>>>>>>>>>>>>>>>>>> per-partition partitionEpoch, instead of
>> using
>>>>>>>>> per-partition
>>>>>>>>>>>>>>> leaderEpoch
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> per-partition leaderEpoch. The former
>> solution
>>>>> using
>>>>>>>>>>>>> metadataEpoch
>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>> not work due to the following scenario
>>> (provided
>>>>> by
>>>>>>>> Jun):
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "Consider the following scenario. In metadata
>>> v1,
>>>>>> the
>>>>>>>>> leader
>>>>>>>>>>>>> for a
>>>>>>>>>>>>>>>>>> partition is at broker 1. In metadata v2,
>>> leader
>>>>> is
>>>>>> at
>>>>>>>>>> broker
>>>>>>>>>>>>> 2. In
>>>>>>>>>>>>>>>>>> metadata v3, leader is at broker 1 again. The
>>>>> last
>>>>>>>>> committed
>>>>>>>>>>>>> offset
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> v1,
>>>>>>>>>>>>>>>>>> v2 and v3 are 10, 20 and 30, respectively. A
>>>>>> consumer
>>>>>>> is
>>>>>>>>>>>>> started and
>>>>>>>>>>>>>>>>> reads
>>>>>>>>>>>>>>>>>> metadata v1 and reads messages from offset 0
>> to
>>>>> 25
>>>>>>> from
>>>>>>>>>> broker
>>>>>>>>>>>>> 1. My
>>>>>>>>>>>>>>>>>> understanding is that in the current
>> proposal,
>>>>> the
>>>>>>>>> metadata
>>>>>>>>>>>>> version
>>>>>>>>>>>>>>>>>> associated with offset 25 is v1. The consumer
>>> is
>>>>>> then
>>>>>>>>>> restarted
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> fetches
>>>>>>>>>>>>>>>>>> metadata v2. The consumer tries to read from
>>>>> broker
>>>>>> 2,
>>>>>>>>>> which is
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> old
>>>>>>>>>>>>>>>>>> leader with the last offset at 20. In this
>>> case,
>>>>> the
>>>>>>>>>> consumer
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>> get OffsetOutOfRangeException incorrectly."
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regarding your comment "For the second
>> purpose,
>>>>> this
>>>>>>> is
>>>>>>>>>> "soft
>>>>>>>>>>>>> state"
>>>>>>>>>>>>>>>>>> anyway.  If the client thinks X is the leader
>>>>> but Y
>>>>>> is
>>>>>>>>>> really
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> leader,
>>>>>>>>>>>>>>>>>> the client will talk to X, and X will point
>> out
>>>>> its
>>>>>>>>> mistake
>>>>>>>>>> by
>>>>>>>>>>>>>>> sending
>>>>>>>>>>>>>>>>> back
>>>>>>>>>>>>>>>>>> a NOT_LEADER_FOR_PARTITION.", it is probably
>> no
>>>>>> true.
>>>>>>>> The
>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>> here is
>>>>>>>>>>>>>>>>>> that the old leader X may still think it is
>> the
>>>>>> leader
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>>>>> partition
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> thus it will not send back
>>>>> NOT_LEADER_FOR_PARTITION.
>>>>>>> The
>>>>>>>>>> reason
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> provided
>>>>>>>>>>>>>>>>>> in KAFKA-6262. Can you check if that makes
>>> sense?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This is solvable with a timeout, right?  If the
>>>>> leader
>>>>>>>> can't
>>>>>>>>>>>>>>> communicate
>>>>>>>>>>>>>>>>> with the controller for a certain period of
>> time,
>>>>> it
>>>>>>>> should
>>>>>>>>>> stop
>>>>>>>>>>>>>>> acting as
>>>>>>>>>>>>>>>>> the leader.  We have to solve this problem,
>>>>> anyway, in
>>>>>>>> order
>>>>>>>>>> to
>>>>>>>>>>>>> fix
>>>>>>>>>>>>>>> all the
>>>>>>>>>>>>>>>>> corner cases.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Not sure if I fully understand your proposal. The
>>>>>> proposal
>>>>>>>>>> seems to
>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>> non-trivial changes to our existing leadership
>>>>> election
>>>>>>>>>> mechanism.
>>>>>>>>>>>>> Could
>>>>>>>>>>>>>>>> you provide more detail regarding how it works?
>> For
>>>>>>> example,
>>>>>>>>> how
>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> user choose this timeout, how leader determines
>>>>> whether
>>>>>> it
>>>>>>>> can
>>>>>>>>>> still
>>>>>>>>>>>>>>>> communicate with controller, and how this
>> triggers
>>>>>>>> controller
>>>>>>>>> to
>>>>>>>>>>>>> elect
>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>> leader?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Before I come up with any proposal, let me make
>> sure
>>> I
>>>>>>>>> understand
>>>>>>>>>> the
>>>>>>>>>>>>>>> problem correctly.  My big question was, what
>>> prevents
>>>>>>>>> split-brain
>>>>>>>>>>>>> here?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let's say I have a partition which is on nodes A,
>> B,
>>>>> and
>>>>>> C,
>>>>>>>> with
>>>>>>>>>>>>> min-ISR
>>>>>>>>>>>>>>> 2.  The controller is D.  At some point, there is a
>>>>>> network
>>>>>>>>>> partition
>>>>>>>>>>>>>>> between A and B and the rest of the cluster.  The
>>>>>> Controller
>>>>>>>>>>>>> re-assigns the
>>>>>>>>>>>>>>> partition to nodes C, D, and E.  But A and B keep
>>>>> chugging
>>>>>>>> away,
>>>>>>>>>> even
>>>>>>>>>>>>>>> though they can no longer communicate with the
>>>>> controller.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> At some point, a client with stale metadata writes
>> to
>>>>> the
>>>>>>>>>> partition.
>>>>>>>>>>>>> It
>>>>>>>>>>>>>>> still thinks the partition is on node A, B, and C,
>> so
>>>>>> that's
>>>>>>>>>> where it
>>>>>>>>>>>>> sends
>>>>>>>>>>>>>>> the data.  It's unable to talk to C, but A and B
>>> reply
>>>>>> back
>>>>>>>> that
>>>>>>>>>> all
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is this not a case where we could lose data due to
>>>>> split
>>>>>>>> brain?
>>>>>>>>>> Or is
>>>>>>>>>>>>>>> there a mechanism for preventing this that I
>> missed?
>>>>> If
>>>>>> it
>>>>>>>> is,
>>>>>>>>> it
>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>> like a pretty serious failure case that we should
>> be
>>>>>>> handling
>>>>>>>>>> with our
>>>>>>>>>>>>>>> metadata rework.  And I think epoch numbers and
>>>>> timeouts
>>>>>>> might
>>>>>>>>> be
>>>>>>>>>>>>> part of
>>>>>>>>>>>>>>> the solution.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Right, split brain can happen if RF=4 and minIsr=2.
>>>>>> However, I
>>>>>>>> am
>>>>>>>>>> not
>>>>>>>>>>>>> sure
>>>>>>>>>>>>>> it is a pretty serious issue which we need to address
>>>>> today.
>>>>>>>> This
>>>>>>>>>> can be
>>>>>>>>>>>>>> prevented by configuring the Kafka topic so that
>>> minIsr >
>>>>>>> RF/2.
>>>>>>>>>>>>> Actually,
>>>>>>>>>>>>>> if user sets minIsr=2, is there anything reason that
>>> user
>>>>>>> wants
>>>>>>>> to
>>>>>>>>>> set
>>>>>>>>>>>>> RF=4
>>>>>>>>>>>>>> instead of 4?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Introducing timeout in leader election mechanism is
>>>>>>>> non-trivial. I
>>>>>>>>>>>>> think we
>>>>>>>>>>>>>> probably want to do that only if there is good
>> use-case
>>>>> that
>>>>>>> can
>>>>>>>>> not
>>>>>>>>>>>>>> otherwise be addressed with the current mechanism.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I still would like to think about these corner cases
>>> more.
>>>>>> But
>>>>>>>>>> perhaps
>>>>>>>>>>>>> it's not directly related to this KIP.
>>>>>>>>>>>>>
>>>>>>>>>>>>> regards,
>>>>>>>>>>>>> Colin
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> best,
>>>>>>>>>>>>>>> Colin
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> best,
>>>>>>>>>>>>>>>>> Colin
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018 at 10:39 AM, Colin
>> McCabe
>>> <
>>>>>>>>>>>>> cmccabe@apache.org>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Dong,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for proposing this KIP.  I think a
>>>>> metadata
>>>>>>>> epoch
>>>>>>>>>> is a
>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>> idea.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I read through the DISCUSS thread, but I
>>> still
>>>>>> don't
>>>>>>>>> have
>>>>>>>>>> a
>>>>>>>>>>>>> clear
>>>>>>>>>>>>>>>>> picture
>>>>>>>>>>>>>>>>>>> of why the proposal uses a metadata epoch
>> per
>>>>>>>> partition
>>>>>>>>>> rather
>>>>>>>>>>>>>>> than a
>>>>>>>>>>>>>>>>>>> global metadata epoch.  A metadata epoch
>> per
>>>>>>> partition
>>>>>>>>> is
>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>>>>>> unpleasant-- it's at least 4 extra bytes
>> per
>>>>>>> partition
>>>>>>>>>> that we
>>>>>>>>>>>>>>> have to
>>>>>>>>>>>>>>>>> send
>>>>>>>>>>>>>>>>>>> over the wire in every full metadata
>> request,
>>>>>> which
>>>>>>>>> could
>>>>>>>>>>>>> become
>>>>>>>>>>>>>>> extra
>>>>>>>>>>>>>>>>>>> kilobytes on the wire when the number of
>>>>>> partitions
>>>>>>>>>> becomes
>>>>>>>>>>>>> large.
>>>>>>>>>>>>>>>>> Plus,
>>>>>>>>>>>>>>>>>>> we have to update all the auxillary classes
>>> to
>>>>>>> include
>>>>>>>>> an
>>>>>>>>>>>>> epoch.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We need to have a global metadata epoch
>>> anyway
>>>>> to
>>>>>>>> handle
>>>>>>>>>>>>> partition
>>>>>>>>>>>>>>>>>>> addition and deletion.  For example, if I
>>> give
>>>>> you
>>>>>>>>>>>>>>>>>>> MetadataResponse{part1,epoch 1, part2,
>> epoch
>>> 1}
>>>>>> and
>>>>>>>>>> {part1,
>>>>>>>>>>>>>>> epoch1},
>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>> MetadataResponse is newer?  You have no way
>>> of
>>>>>>>> knowing.
>>>>>>>>>> It
>>>>>>>>>>>>> could
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> part2 has just been created, and the
>> response
>>>>>> with 2
>>>>>>>>>>>>> partitions is
>>>>>>>>>>>>>>>>> newer.
>>>>>>>>>>>>>>>>>>> Or it coudl be that part2 has just been
>>>>> deleted,
>>>>>> and
>>>>>>>>>>>>> therefore the
>>>>>>>>>>>>>>>>> response
>>>>>>>>>>>>>>>>>>> with 1 partition is newer.  You must have a
>>>>> global
>>>>>>>> epoch
>>>>>>>>>> to
>>>>>>>>>>>>>>>>> disambiguate
>>>>>>>>>>>>>>>>>>> these two cases.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Previously, I worked on the Ceph
>> distributed
>>>>>>>> filesystem.
>>>>>>>>>>>>> Ceph had
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> concept of a map of the whole cluster,
>>>>> maintained
>>>>>>> by a
>>>>>>>>> few
>>>>>>>>>>>>> servers
>>>>>>>>>>>>>>>>> doing
>>>>>>>>>>>>>>>>>>> paxos.  This map was versioned by a single
>>>>> 64-bit
>>>>>>>> epoch
>>>>>>>>>> number
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>> increased on every change.  It was
>> propagated
>>>>> to
>>>>>>>> clients
>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>> gossip.  I
>>>>>>>>>>>>>>>>>>> wonder if something similar could work
>> here?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It seems like the the Kafka
>> MetadataResponse
>>>>>> serves
>>>>>>>> two
>>>>>>>>>>>>> somewhat
>>>>>>>>>>>>>>>>> unrelated
>>>>>>>>>>>>>>>>>>> purposes.  Firstly, it lets clients know
>> what
>>>>>>>> partitions
>>>>>>>>>>>>> exist in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> system and where they live.  Secondly, it
>>> lets
>>>>>>> clients
>>>>>>>>>> know
>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>> within the partition are in-sync (in the
>> ISR)
>>>>> and
>>>>>>>> which
>>>>>>>>>> node
>>>>>>>>>>>>> is the
>>>>>>>>>>>>>>>>> leader.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The first purpose is what you really need a
>>>>>> metadata
>>>>>>>>> epoch
>>>>>>>>>>>>> for, I
>>>>>>>>>>>>>>>>> think.
>>>>>>>>>>>>>>>>>>> You want to know whether a partition exists
>>> or
>>>>>> not,
>>>>>>> or
>>>>>>>>> you
>>>>>>>>>>>>> want to
>>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>>>>>> which nodes you should talk to in order to
>>>>> write
>>>>>> to
>>>>>>> a
>>>>>>>>>> given
>>>>>>>>>>>>>>>>> partition.  A
>>>>>>>>>>>>>>>>>>> single metadata epoch for the whole
>> response
>>>>>> should
>>>>>>> be
>>>>>>>>>>>>> adequate
>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>>>>> should not change the partition assignment
>>>>> without
>>>>>>>> going
>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>> zookeeper
>>>>>>>>>>>>>>>>>>> (or a similar system), and this inherently
>>>>>>> serializes
>>>>>>>>>> updates
>>>>>>>>>>>>> into
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> numbered stream.  Brokers should also stop
>>>>>>> responding
>>>>>>>> to
>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>> are unable to contact ZK for a certain time
>>>>>> period.
>>>>>>>>> This
>>>>>>>>>>>>> prevents
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>> where a given partition has been moved off
>>> some
>>>>>> set
>>>>>>> of
>>>>>>>>>> nodes,
>>>>>>>>>>>>> but a
>>>>>>>>>>>>>>>>> client
>>>>>>>>>>>>>>>>>>> still ends up talking to those nodes and
>>>>> writing
>>>>>>> data
>>>>>>>>>> there.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For the second purpose, this is "soft
>> state"
>>>>>> anyway.
>>>>>>>> If
>>>>>>>>>> the
>>>>>>>>>>>>> client
>>>>>>>>>>>>>>>>> thinks
>>>>>>>>>>>>>>>>>>> X is the leader but Y is really the leader,
>>> the
>>>>>>> client
>>>>>>>>>> will
>>>>>>>>>>>>> talk
>>>>>>>>>>>>>>> to X,
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> X will point out its mistake by sending
>> back
>>> a
>>>>>>>>>>>>>>>>> NOT_LEADER_FOR_PARTITION.
>>>>>>>>>>>>>>>>>>> Then the client can update its metadata
>> again
>>>>> and
>>>>>>> find
>>>>>>>>>> the new
>>>>>>>>>>>>>>> leader,
>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>> there is one.  There is no need for an
>> epoch
>>> to
>>>>>>> handle
>>>>>>>>>> this.
>>>>>>>>>>>>>>>>> Similarly, I
>>>>>>>>>>>>>>>>>>> can't think of a reason why changing the
>>>>> in-sync
>>>>>>>> replica
>>>>>>>>>> set
>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> bump
>>>>>>>>>>>>>>>>>>> the epoch.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> best,
>>>>>>>>>>>>>>>>>>> Colin
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018, at 09:45, Dong Lin
>>> wrote:
>>>>>>>>>>>>>>>>>>>> Thanks much for reviewing the KIP!
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018 at 7:10 AM, Guozhang
>>>>> Wang <
>>>>>>>>>>>>>>> wangguoz@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Yeah that makes sense, again I'm just
>>>>> making
>>>>>>> sure
>>>>>>>> we
>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>> all the
>>>>>>>>>>>>>>>>>>>>> scenarios and what to expect.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I agree that if, more generally
>> speaking,
>>>>> say
>>>>>>>> users
>>>>>>>>>> have
>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>> consumed
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> offset 8, and then call seek(16) to
>>> "jump"
>>>>> to
>>>>>> a
>>>>>>>>>> further
>>>>>>>>>>>>>>> position,
>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>> she
>>>>>>>>>>>>>>>>>>>>> needs to be aware that OORE maybe
>> thrown
>>>>> and
>>>>>> she
>>>>>>>>>> needs to
>>>>>>>>>>>>>>> handle
>>>>>>>>>>>>>>>>> it or
>>>>>>>>>>>>>>>>>>> rely
>>>>>>>>>>>>>>>>>>>>> on reset policy which should not
>> surprise
>>>>> her.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm +1 on the KIP.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 24, 2018 at 12:31 AM, Dong
>>> Lin
>>>>> <
>>>>>>>>>>>>>>> lindong28@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes, in general we can not prevent
>>>>>>>>>>>>> OffsetOutOfRangeException
>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>>>>> seeks
>>>>>>>>>>>>>>>>>>>>>> to a wrong offset. The main goal is
>> to
>>>>>> prevent
>>>>>>>>>>>>>>>>>>> OffsetOutOfRangeException
>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>> user has done things in the right
>> way,
>>>>> e.g.
>>>>>>> user
>>>>>>>>>> should
>>>>>>>>>>>>> know
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> message with this offset.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> For example, if user calls seek(..)
>>> right
>>>>>>> after
>>>>>>>>>>>>>>> construction, the
>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>> reason I can think of is that user
>>> stores
>>>>>>> offset
>>>>>>>>>>>>> externally.
>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>>>>>>>>>> user currently needs to use the
>> offset
>>>>> which
>>>>>>> is
>>>>>>>>>> obtained
>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>>>>> position(..)
>>>>>>>>>>>>>>>>>>>>>> from the last run. With this KIP,
>> user
>>>>> needs
>>>>>>> to
>>>>>>>>> get
>>>>>>>>>> the
>>>>>>>>>>>>>>> offset
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> offsetEpoch using
>>>>>> positionAndOffsetEpoch(...)
>>>>>>>> and
>>>>>>>>>> stores
>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>>>>>>>> externally. The next time user starts
>>>>>>> consumer,
>>>>>>>>>> he/she
>>>>>>>>>>>>> needs
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> call
>>>>>>>>>>>>>>>>>>>>>> seek(..., offset, offsetEpoch) right
>>>>> after
>>>>>>>>>> construction.
>>>>>>>>>>>>>>> Then KIP
>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> able to ensure that we don't throw
>>>>>>>>>>>>> OffsetOutOfRangeException
>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>> there is
>>>>>>>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>>>>>>>> unclean leader election.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Does this sound OK?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tue, Jan 23, 2018 at 11:44 PM,
>>>>> Guozhang
>>>>>>> Wang
>>>>>>>> <
>>>>>>>>>>>>>>>>> wangguoz@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> "If consumer wants to consume
>> message
>>>>> with
>>>>>>>>> offset
>>>>>>>>>> 16,
>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>> consumer
>>>>>>>>>>>>>>>>>>>>> must
>>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>> already fetched message with offset
>>> 15"
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> --> this may not be always true
>>> right?
>>>>>> What
>>>>>>> if
>>>>>>>>>>>>> consumer
>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>> call
>>>>>>>>>>>>>>>>>>>>>> seek(16)
>>>>>>>>>>>>>>>>>>>>>>> after construction and then poll
>>>>> without
>>>>>>>>> committed
>>>>>>>>>>>>> offset
>>>>>>>>>>>>>>> ever
>>>>>>>>>>>>>>>>>>> stored
>>>>>>>>>>>>>>>>>>>>>>> before? Admittedly it is rare but
>> we
>>> do
>>>>>> not
>>>>>>>>>>>>> programmably
>>>>>>>>>>>>>>>>> disallow
>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jan 23, 2018 at 10:42 PM,
>>> Dong
>>>>>> Lin <
>>>>>>>>>>>>>>>>> lindong28@gmail.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks much for reviewing the
>> KIP!
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> In the scenario you described,
>>> let's
>>>>>>> assume
>>>>>>>>> that
>>>>>>>>>>>>> broker
>>>>>>>>>>>>>>> A has
>>>>>>>>>>>>>>>>>>>>> messages
>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>> offset up to 10, and broker B has
>>>>>> messages
>>>>>>>>> with
>>>>>>>>>>>>> offset
>>>>>>>>>>>>>>> up to
>>>>>>>>>>>>>>>>> 20.
>>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>>> consumer wants to consume message
>>>>> with
>>>>>>>> offset
>>>>>>>>>> 9, it
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> receive
>>>>>>>>>>>>>>>>>>>>>>>> OffsetOutOfRangeException
>>>>>>>>>>>>>>>>>>>>>>>> from broker A.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> If consumer wants to consume
>>> message
>>>>>> with
>>>>>>>>> offset
>>>>>>>>>>>>> 16, then
>>>>>>>>>>>>>>>>>>> consumer
>>>>>>>>>>>>>>>>>>>>> must
>>>>>>>>>>>>>>>>>>>>>>>> have already fetched message with
>>>>> offset
>>>>>>> 15,
>>>>>>>>>> which
>>>>>>>>>>>>> can
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>> come
>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> broker B. Because consumer will
>>> fetch
>>>>>> from
>>>>>>>>>> broker B
>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>> leaderEpoch
>>>>>>>>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>>>>>>>> 2, then the current consumer
>>>>> leaderEpoch
>>>>>>> can
>>>>>>>>>> not be
>>>>>>>>>>>>> 1
>>>>>>>>>>>>>>> since
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> KIP
>>>>>>>>>>>>>>>>>>>>>>>> prevents leaderEpoch rewind. Thus
>>> we
>>>>>> will
>>>>>>>> not
>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>> OffsetOutOfRangeException
>>>>>>>>>>>>>>>>>>>>>>>> in this case.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Does this address your question,
>> or
>>>>>> maybe
>>>>>>>>> there
>>>>>>>>>> is
>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> advanced
>>>>>>>>>>>>>>>>>>>>>> scenario
>>>>>>>>>>>>>>>>>>>>>>>> that the KIP does not handle?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jan 23, 2018 at 9:43 PM,
>>>>>> Guozhang
>>>>>>>>> Wang <
>>>>>>>>>>>>>>>>>>> wangguoz@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Dong, I made a pass over
>>> the
>>>>>> wiki
>>>>>>>> and
>>>>>>>>>> it
>>>>>>>>>>>>> lgtm.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Just a quick question: can we
>>>>>> completely
>>>>>>>>>>>>> eliminate the
>>>>>>>>>>>>>>>>>>>>>>>>> OffsetOutOfRangeException with
>>> this
>>>>>>>>> approach?
>>>>>>>>>> Say
>>>>>>>>>>>>> if
>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> consecutive
>>>>>>>>>>>>>>>>>>>>>>>>> leader changes such that the
>>> cached
>>>>>>>>> metadata's
>>>>>>>>>>>>>>> partition
>>>>>>>>>>>>>>>>> epoch
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> 1,
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> the metadata fetch response
>>> returns
>>>>>>> with
>>>>>>>>>>>>> partition
>>>>>>>>>>>>>>> epoch 2
>>>>>>>>>>>>>>>>>>>>> pointing
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> leader broker A, while the
>> actual
>>>>>>>> up-to-date
>>>>>>>>>>>>> metadata
>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>> partition
>>>>>>>>>>>>>>>>>>>>>>>> epoch 3
>>>>>>>>>>>>>>>>>>>>>>>>> whose leader is now broker B,
>> the
>>>>>>> metadata
>>>>>>>>>>>>> refresh will
>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>>>> succeed
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> the follow-up fetch request may
>>>>> still
>>>>>>> see
>>>>>>>>>> OORE?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Jan 23, 2018 at 3:47
>> PM,
>>>>> Dong
>>>>>>> Lin
>>>>>>>> <
>>>>>>>>>>>>>>>>> lindong28@gmail.com
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to start the
>>> voting
>>>>>>> process
>>>>>>>>> for
>>>>>>>>>>>>> KIP-232:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
>>>>>>>>>>>>>>> confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>>>>>>>>>>>>>>
>> 232%3A+Detect+outdated+metadat
>>>>>>>>>>>>> a+using+leaderEpoch+
>>>>>>>>>>>>>>>>>>>>>> and+partitionEpoch
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> The KIP will help fix a
>>>>> concurrency
>>>>>>>> issue
>>>>>>>>> in
>>>>>>>>>>>>> Kafka
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>> cause message loss or message
>>>>>>>> duplication
>>>>>>>>> in
>>>>>>>>>>>>>>> consumer.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Posted by Dong Lin <li...@gmail.com>.

For record purpose, this KIP is closed as its design has been merged into
KIP-320. See
https://cwiki.apache.org/confluence/display/KAFKA/KIP-320%3A+Allow+fetchers+to+detect+and+handle+log+truncation
.

Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Posted by Dong Lin <li...@gmail.com>.

Hey Jun,

Certainly. We can discuss later after KIP-320 settles.

Thanks!
Dong


On Wed, Jul 11, 2018 at 8:54 AM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Dong,
>
> Sorry for the late response. Since KIP-320 is covering some of the similar
> problems described in this KIP, perhaps we can wait until KIP-320 settles
> and see what's still left uncovered in this KIP.
>
> Thanks,
>
> Jun
>
> On Mon, Jun 4, 2018 at 7:03 PM, Dong Lin <li...@gmail.com> wrote:
>
> > Hey Jun,
> >
> > It seems that we have made considerable progress on the discussion of
> > KIP-253 since February. Do you think we should continue the discussion
> > there, or can we continue the voting for this KIP? I am happy to submit
> the
> > PR and move forward the progress for this KIP.
> >
> > Thanks!
> > Dong
> >
> >
> > On Wed, Feb 7, 2018 at 11:42 PM, Dong Lin <li...@gmail.com> wrote:
> >
> > > Hey Jun,
> > >
> > > Sure, I will come up with a KIP this week. I think there is a way to
> > allow
> > > partition expansion to arbitrary number without introducing new
> concepts
> > > such as read-only partition or repartition epoch.
> > >
> > > Thanks,
> > > Dong
> > >
> > > On Wed, Feb 7, 2018 at 5:28 PM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > >> Hi, Dong,
> > >>
> > >> Thanks for the reply. The general idea that you had for adding
> > partitions
> > >> is similar to what we had in mind. It would be useful to make this
> more
> > >> general, allowing adding an arbitrary number of partitions (instead of
> > >> just
> > >> doubling) and potentially removing partitions as well. The following
> is
> > >> the
> > >> high level idea from the discussion with Colin, Jason and Ismael.
> > >>
> > >> * To change the number of partitions from X to Y in a topic, the
> > >> controller
> > >> marks all existing X partitions as read-only and creates Y new
> > partitions.
> > >> The new partitions are writable and are tagged with a higher
> repartition
> > >> epoch (RE).
> > >>
> > >> * The controller propagates the new metadata to every broker. Once the
> > >> leader of a partition is marked as read-only, it rejects the produce
> > >> requests on this partition. The producer will then refresh the
> metadata
> > >> and
> > >> start publishing to the new writable partitions.
> > >>
> > >> * The consumers will then be consuming messages in RE order. The
> > consumer
> > >> coordinator will only assign partitions in the same RE to consumers.
> > Only
> > >> after all messages in an RE are consumed, will partitions in a higher
> RE
> > >> be
> > >> assigned to consumers.
> > >>
> > >> As Colin mentioned, if we do the above, we could potentially (1) use a
> > >> globally unique partition id, or (2) use a globally unique topic id to
> > >> distinguish recreated partitions due to topic deletion.
> > >>
> > >> So, perhaps we can sketch out the re-partitioning KIP a bit more and
> see
> > >> if
> > >> there is any overlap with KIP-232. Would you be interested in doing
> > that?
> > >> If not, we can do that next week.
> > >>
> > >> Jun
> > >>
> > >>
> > >> On Tue, Feb 6, 2018 at 11:30 AM, Dong Lin <li...@gmail.com>
> wrote:
> > >>
> > >> > Hey Jun,
> > >> >
> > >> > Interestingly I am also planning to sketch a KIP to allow partition
> > >> > expansion for keyed topics after this KIP. Since you are already
> doing
> > >> > that, I guess I will just share my high level idea here in case it
> is
> > >> > helpful.
> > >> >
> > >> > The motivation for the KIP is that we currently lose order guarantee
> > for
> > >> > messages with the same key if we expand partitions of keyed topic.
> > >> >
> > >> > The solution can probably be built upon the following ideas:
> > >> >
> > >> > - Partition number of the keyed topic should always be doubled (or
> > >> > multiplied by power of 2). Given that we select a partition based on
> > >> > hash(key) % partitionNum, this should help us ensure that, a message
> > >> > assigned to an existing partition will not be mapped to another
> > existing
> > >> > partition after partition expansion.
> > >> >
> > >> > - Producer includes in the ProduceRequest some information that
> helps
> > >> > ensure that messages produced ti a partition will monotonically
> > >> increase in
> > >> > the partitionNum of the topic. In other words, if broker receives a
> > >> > ProduceRequest and notices that the producer does not know the
> > partition
> > >> > number has increased, broker should reject this request. That
> > >> "information"
> > >> > maybe leaderEpoch, max partitionEpoch of the partitions of the
> topic,
> > or
> > >> > simply partitionNum of the topic. The benefit of this property is
> that
> > >> we
> > >> > can keep the new logic for in-order message consumption entirely in
> > how
> > >> > consumer leader determines the partition -> consumer mapping.
> > >> >
> > >> > - When consumer leader determines partition -> consumer mapping,
> > leader
> > >> > first reads the start position for each partition using
> > >> OffsetFetchRequest.
> > >> > If start position are all non-zero, then assignment can be done in
> its
> > >> > current manner. The assumption is that, a message in the new
> partition
> > >> > should only be consumed after all messages with the same key
> produced
> > >> > before it has been consumed. Since some messages in the new
> partition
> > >> has
> > >> > been consumed, we should not worry about consuming messages
> > >> out-of-order.
> > >> > This benefit of this approach is that we can avoid unnecessary
> > overhead
> > >> in
> > >> > the common case.
> > >> >
> > >> > - If the consumer leader finds that the start position for some
> > >> partition
> > >> > is 0. Say the current partition number is 18 and the partition index
> > is
> > >> 12,
> > >> > then consumer leader should ensure that messages produced to
> partition
> > >> 12 -
> > >> > 18/2 = 3 before the first message of partition 12 is consumed,
> before
> > it
> > >> > assigned partition 12 to any consumer in the consumer group. Since
> we
> > >> have
> > >> > a "information" that is monotonically increasing per partition,
> > consumer
> > >> > can read the value of this information from the first message in
> > >> partition
> > >> > 12, get the offset corresponding to this value in partition 3,
> assign
> > >> > partition except for partition 12 (and probably other new
> partitions)
> > to
> > >> > the existing consumers, waiting for the committed offset to go
> beyond
> > >> this
> > >> > offset for partition 3, and trigger rebalance again so that
> partition
> > 3
> > >> can
> > >> > be reassigned to some consumer.
> > >> >
> > >> >
> > >> > Thanks,
> > >> > Dong
> > >> >
> > >> >
> > >> > On Tue, Feb 6, 2018 at 10:10 AM, Jun Rao <ju...@confluent.io> wrote:
> > >> >
> > >> > > Hi, Dong,
> > >> > >
> > >> > > Thanks for the KIP. It looks good overall. We are working on a
> > >> separate
> > >> > KIP
> > >> > > for adding partitions while preserving the ordering guarantees.
> That
> > >> may
> > >> > > require another flavor of partition epoch. It's not very clear
> > whether
> > >> > that
> > >> > > partition epoch can be merged with the partition epoch in this
> KIP.
> > >> So,
> > >> > > perhaps you can wait on this a bit until we post the other KIP in
> > the
> > >> > next
> > >> > > few days.
> > >> > >
> > >> > > Jun
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Mon, Feb 5, 2018 at 2:43 PM, Becket Qin <be...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > +1 on the KIP.
> > >> > > >
> > >> > > > I think the KIP is mainly about adding the capability of
> tracking
> > >> the
> > >> > > > system state change lineage. It does not seem necessary to
> bundle
> > >> this
> > >> > > KIP
> > >> > > > with replacing the topic partition with partition epoch in
> > >> > produce/fetch.
> > >> > > > Replacing topic-partition string with partition epoch is
> > >> essentially a
> > >> > > > performance improvement on top of this KIP. That can probably be
> > >> done
> > >> > > > separately.
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Jiangjie (Becket) Qin
> > >> > > >
> > >> > > > On Mon, Jan 29, 2018 at 11:52 AM, Dong Lin <lindong28@gmail.com
> >
> > >> > wrote:
> > >> > > >
> > >> > > > > Hey Colin,
> > >> > > > >
> > >> > > > > On Mon, Jan 29, 2018 at 11:23 AM, Colin McCabe <
> > >> cmccabe@apache.org>
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > > On Mon, Jan 29, 2018 at 10:35 AM, Dong Lin <
> > >> lindong28@gmail.com>
> > >> > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hey Colin,
> > >> > > > > > > >
> > >> > > > > > > > I understand that the KIP will adds overhead by
> > introducing
> > >> > > > > > per-partition
> > >> > > > > > > > partitionEpoch. I am open to alternative solutions that
> > does
> > >> > not
> > >> > > > > incur
> > >> > > > > > > > additional overhead. But I don't see a better way now.
> > >> > > > > > > >
> > >> > > > > > > > IMO the overhead in the FetchResponse may not be that
> > much.
> > >> We
> > >> > > > > probably
> > >> > > > > > > > should discuss the percentage increase rather than the
> > >> absolute
> > >> > > > > number
> > >> > > > > > > > increase. Currently after KIP-227, per-partition header
> > has
> > >> 23
> > >> > > > bytes.
> > >> > > > > > This
> > >> > > > > > > > KIP adds another 4 bytes. Assume the records size is
> 10KB,
> > >> the
> > >> > > > > > percentage
> > >> > > > > > > > increase is 4 / (23 + 10000) = 0.03%. It seems
> negligible,
> > >> > right?
> > >> > > > > >
> > >> > > > > > Hi Dong,
> > >> > > > > >
> > >> > > > > > Thanks for the response.  I agree that the FetchRequest /
> > >> > > FetchResponse
> > >> > > > > > overhead should be OK, now that we have incremental fetch
> > >> requests
> > >> > > and
> > >> > > > > > responses.  However, there are a lot of cases where the
> > >> percentage
> > >> > > > > increase
> > >> > > > > > is much greater.  For example, if a client is doing full
> > >> > > > > MetadataRequests /
> > >> > > > > > Responses, we have some math kind of like this per
> partition:
> > >> > > > > >
> > >> > > > > > > UpdateMetadataRequestPartitionState => topic partition
> > >> > > > > controller_epoch
> > >> > > > > > leader  leader_epoch partition_epoch isr zk_version replicas
> > >> > > > > > offline_replicas
> > >> > > > > > > 14 bytes:  topic => string (assuming about 10 byte topic
> > >> names)
> > >> > > > > > > 4 bytes:  partition => int32
> > >> > > > > > > 4  bytes: conroller_epoch => int32
> > >> > > > > > > 4  bytes: leader => int32
> > >> > > > > > > 4  bytes: leader_epoch => int32
> > >> > > > > > > +4 EXTRA bytes: partition_epoch => int32        <-- NEW
> > >> > > > > > > 2+4+4+4 bytes: isr => [int32] (assuming 3 in the ISR)
> > >> > > > > > > 4 bytes: zk_version => int32
> > >> > > > > > > 2+4+4+4 bytes: replicas => [int32] (assuming 3 replicas)
> > >> > > > > > > 2  offline_replicas => [int32] (assuming no offline
> > replicas)
> > >> > > > > >
> > >> > > > > > Assuming I added that up correctly, the per-partition
> overhead
> > >> goes
> > >> > > > from
> > >> > > > > > 64 bytes per partition to 68, a 6.2% increase.
> > >> > > > > >
> > >> > > > > > We could do similar math for a lot of the other RPCs.  And
> you
> > >> will
> > >> > > > have
> > >> > > > > a
> > >> > > > > > similar memory and garbage collection impact on the brokers
> > >> since
> > >> > you
> > >> > > > > have
> > >> > > > > > to store all this extra state as well.
> > >> > > > > >
> > >> > > > >
> > >> > > > > That is correct. IMO the Metadata is only updated periodically
> > >> and is
> > >> > > > > probably not a big deal if we increase it by 6%. The
> > FetchResponse
> > >> > and
> > >> > > > > ProduceRequest are probably the only requests that are bounded
> > by
> > >> the
> > >> > > > > bandwidth throughput.
> > >> > > > >
> > >> > > > >
> > >> > > > > >
> > >> > > > > > > >
> > >> > > > > > > > I agree that we can probably save more space by using
> > >> partition
> > >> > > ID
> > >> > > > so
> > >> > > > > > that
> > >> > > > > > > > we no longer needs the string topic name. The similar
> idea
> > >> has
> > >> > > also
> > >> > > > > > been
> > >> > > > > > > > put in the Rejected Alternative section in KIP-227.
> While
> > >> this
> > >> > > idea
> > >> > > > > is
> > >> > > > > > > > promising, it seems orthogonal to the goal of this KIP.
> > >> Given
> > >> > > that
> > >> > > > > > there is
> > >> > > > > > > > already many work to do in this KIP, maybe we can do the
> > >> > > partition
> > >> > > > ID
> > >> > > > > > in a
> > >> > > > > > > > separate KIP?
> > >> > > > > >
> > >> > > > > > I guess my thinking is that the goal here is to replace an
> > >> > identifier
> > >> > > > > > which can be re-used (the tuple of topic name, partition ID)
> > >> with
> > >> > an
> > >> > > > > > identifier that cannot be re-used (the tuple of topic name,
> > >> > partition
> > >> > > > ID,
> > >> > > > > > partition epoch) in order to gain better semantics.  As long
> > as
> > >> we
> > >> > > are
> > >> > > > > > replacing the identifier, why not replace it with an
> > identifier
> > >> > that
> > >> > > > has
> > >> > > > > > important performance advantages?  The KIP freeze for the
> next
> > >> > > release
> > >> > > > > has
> > >> > > > > > already passed, so there is time to do this.
> > >> > > > > >
> > >> > > > >
> > >> > > > > In general it can be easier for discussion and implementation
> if
> > >> we
> > >> > can
> > >> > > > > split a larger task into smaller and independent tasks. For
> > >> example,
> > >> > > > > KIP-112 and KIP-113 both deals with the JBOD support. KIP-31,
> > >> KIP-32
> > >> > > and
> > >> > > > > KIP-33 are about timestamp support. The option on this can be
> > >> subject
> > >> > > > > though.
> > >> > > > >
> > >> > > > > IMO the change to switch from (topic, partition ID) to
> > >> partitionEpch
> > >> > in
> > >> > > > all
> > >> > > > > request/response requires us to going through all request one
> by
> > >> one.
> > >> > > It
> > >> > > > > may not be hard but it can be time consuming and tedious. At
> > high
> > >> > level
> > >> > > > the
> > >> > > > > goal and the change for that will be orthogonal to the changes
> > >> > required
> > >> > > > in
> > >> > > > > this KIP. That is the main reason I think we can split them
> into
> > >> two
> > >> > > > KIPs.
> > >> > > > >
> > >> > > > >
> > >> > > > > > On Mon, Jan 29, 2018, at 10:54, Dong Lin wrote:
> > >> > > > > > > I think it is possible to move to entirely use
> > partitionEpoch
> > >> > > instead
> > >> > > > > of
> > >> > > > > > > (topic, partition) to identify a partition. Client can
> > obtain
> > >> the
> > >> > > > > > > partitionEpoch -> (topic, partition) mapping from
> > >> > MetadataResponse.
> > >> > > > We
> > >> > > > > > > probably need to figure out a way to assign partitionEpoch
> > to
> > >> > > > existing
> > >> > > > > > > partitions in the cluster. But this should be doable.
> > >> > > > > > >
> > >> > > > > > > This is a good idea. I think it will save us some space in
> > the
> > >> > > > > > > request/response. The actual space saving in percentage
> > >> probably
> > >> > > > > depends
> > >> > > > > > on
> > >> > > > > > > the amount of data and the number of partitions of the
> same
> > >> > topic.
> > >> > > I
> > >> > > > > just
> > >> > > > > > > think we can do it in a separate KIP.
> > >> > > > > >
> > >> > > > > > Hmm.  How much extra work would be required?  It seems like
> we
> > >> are
> > >> > > > > already
> > >> > > > > > changing almost every RPC that involves topics and
> partitions,
> > >> > > already
> > >> > > > > > adding new per-partition state to ZooKeeper, already
> changing
> > >> how
> > >> > > > clients
> > >> > > > > > interact with partitions.  Is there some other big piece of
> > work
> > >> > we'd
> > >> > > > > have
> > >> > > > > > to do to move to partition IDs that we wouldn't need for
> > >> partition
> > >> > > > > epochs?
> > >> > > > > > I guess we'd have to find a way to support regular
> > >> expression-based
> > >> > > > topic
> > >> > > > > > subscriptions.  If we split this into multiple KIPs,
> wouldn't
> > we
> > >> > end
> > >> > > up
> > >> > > > > > changing all that RPCs and ZK state a second time?  Also,
> I'm
> > >> > curious
> > >> > > > if
> > >> > > > > > anyone has done any proof of concept GC, memory, and network
> > >> usage
> > >> > > > > > measurements on switching topic names for topic IDs.
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > We will need to go over all requests/responses to check how to
> > >> > replace
> > >> > > > > (topic, partition ID) with partition epoch. It requires
> > >> non-trivial
> > >> > > work
> > >> > > > > and could take time. As you mentioned, we may want to see how
> > much
> > >> > > saving
> > >> > > > > we can get by switching from topic names to partition epoch.
> > That
> > >> > > itself
> > >> > > > > requires time and experiment. It seems that the new idea does
> > not
> > >> > > > rollback
> > >> > > > > any change proposed in this KIP. So I am not sure we can get
> > much
> > >> by
> > >> > > > > putting them into the same KIP.
> > >> > > > >
> > >> > > > > Anyway, if more people are interested in seeing the new idea
> in
> > >> the
> > >> > > same
> > >> > > > > KIP, I can try that.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > >
> > >> > > > > > best,
> > >> > > > > > Colin
> > >> > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Mon, Jan 29, 2018 at 10:18 AM, Colin McCabe <
> > >> > > cmccabe@apache.org
> > >> > > > >
> > >> > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > >> On Fri, Jan 26, 2018, at 12:17, Dong Lin wrote:
> > >> > > > > > > >> > Hey Colin,
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > On Fri, Jan 26, 2018 at 10:16 AM, Colin McCabe <
> > >> > > > > cmccabe@apache.org>
> > >> > > > > > > >> wrote:
> > >> > > > > > > >> >
> > >> > > > > > > >> > > On Thu, Jan 25, 2018, at 16:47, Dong Lin wrote:
> > >> > > > > > > >> > > > Hey Colin,
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > Thanks for the comment.
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > On Thu, Jan 25, 2018 at 4:15 PM, Colin McCabe <
> > >> > > > > > cmccabe@apache.org>
> > >> > > > > > > >> > > wrote:
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > > On Wed, Jan 24, 2018, at 21:07, Dong Lin wrote:
> > >> > > > > > > >> > > > > > Hey Colin,
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > Thanks for reviewing the KIP.
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > If I understand you right, you maybe
> suggesting
> > >> that
> > >> > > we
> > >> > > > > can
> > >> > > > > > use
> > >> > > > > > > >> a
> > >> > > > > > > >> > > global
> > >> > > > > > > >> > > > > > metadataEpoch that is incremented every time
> > >> > > controller
> > >> > > > > > updates
> > >> > > > > > > >> > > metadata.
> > >> > > > > > > >> > > > > > The problem with this solution is that, if a
> > >> topic
> > >> > is
> > >> > > > > > deleted
> > >> > > > > > > >> and
> > >> > > > > > > >> > > created
> > >> > > > > > > >> > > > > > again, user will not know whether that the
> > offset
> > >> > > which
> > >> > > > is
> > >> > > > > > > >> stored
> > >> > > > > > > >> > > before
> > >> > > > > > > >> > > > > > the topic deletion is no longer valid. This
> > >> > motivates
> > >> > > > the
> > >> > > > > > idea
> > >> > > > > > > >> to
> > >> > > > > > > >> > > include
> > >> > > > > > > >> > > > > > per-partition partitionEpoch. Does this sound
> > >> > > > reasonable?
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > Hi Dong,
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > Perhaps we can store the last valid offset of
> > each
> > >> > > deleted
> > >> > > > > > topic
> > >> > > > > > > >> in
> > >> > > > > > > >> > > > > ZooKeeper.  Then, when a topic with one of
> those
> > >> names
> > >> > > > gets
> > >> > > > > > > >> > > re-created, we
> > >> > > > > > > >> > > > > can start the topic at the previous end offset
> > >> rather
> > >> > > than
> > >> > > > > at
> > >> > > > > > 0.
> > >> > > > > > > >> This
> > >> > > > > > > >> > > > > preserves immutability.  It is no more
> burdensome
> > >> than
> > >> > > > > having
> > >> > > > > > to
> > >> > > > > > > >> > > preserve a
> > >> > > > > > > >> > > > > "last epoch" for the deleted partition
> somewhere,
> > >> > right?
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > My concern with this solution is that the number
> of
> > >> > > > zookeeper
> > >> > > > > > nodes
> > >> > > > > > > >> get
> > >> > > > > > > >> > > > more and more over time if some users keep
> deleting
> > >> and
> > >> > > > > creating
> > >> > > > > > > >> topics.
> > >> > > > > > > >> > > Do
> > >> > > > > > > >> > > > you think this can be a problem?
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > Hi Dong,
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > We could expire the "partition tombstones" after an
> > >> hour
> > >> > or
> > >> > > > so.
> > >> > > > > > In
> > >> > > > > > > >> > > practice this would solve the issue for clients
> that
> > >> like
> > >> > to
> > >> > > > > > destroy
> > >> > > > > > > >> and
> > >> > > > > > > >> > > re-create topics all the time.  In any case,
> doesn't
> > >> the
> > >> > > > current
> > >> > > > > > > >> proposal
> > >> > > > > > > >> > > add per-partition znodes as well that we have to
> > track
> > >> > even
> > >> > > > > after
> > >> > > > > > the
> > >> > > > > > > >> > > partition is deleted?  Or did I misunderstand that?
> > >> > > > > > > >> > >
> > >> > > > > > > >> >
> > >> > > > > > > >> > Actually the current KIP does not add per-partition
> > >> znodes.
> > >> > > > Could
> > >> > > > > > you
> > >> > > > > > > >> > double check? I can fix the KIP wiki if there is
> > anything
> > >> > > > > > misleading.
> > >> > > > > > > >>
> > >> > > > > > > >> Hi Dong,
> > >> > > > > > > >>
> > >> > > > > > > >> I double-checked the KIP, and I can see that you are in
> > >> fact
> > >> > > > using a
> > >> > > > > > > >> global counter for initializing partition epochs.  So,
> > you
> > >> are
> > >> > > > > > correct, it
> > >> > > > > > > >> doesn't add per-partition znodes for partitions that no
> > >> longer
> > >> > > > > exist.
> > >> > > > > > > >>
> > >> > > > > > > >> >
> > >> > > > > > > >> > If we expire the "partition tomstones" after an hour,
> > and
> > >> > the
> > >> > > > > topic
> > >> > > > > > is
> > >> > > > > > > >> > re-created after more than an hour since the topic
> > >> deletion,
> > >> > > > then
> > >> > > > > > we are
> > >> > > > > > > >> > back to the situation where user can not tell whether
> > the
> > >> > > topic
> > >> > > > > has
> > >> > > > > > been
> > >> > > > > > > >> > re-created or not, right?
> > >> > > > > > > >>
> > >> > > > > > > >> Yes, with an expiration period, it would not ensure
> > >> > > immutability--
> > >> > > > > you
> > >> > > > > > > >> could effectively reuse partition names and they would
> > look
> > >> > the
> > >> > > > > same.
> > >> > > > > > > >>
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > It's not really clear to me what should happen
> when a
> > >> > topic
> > >> > > is
> > >> > > > > > > >> destroyed
> > >> > > > > > > >> > > and re-created with new data.  Should consumers
> > >> continue
> > >> > to
> > >> > > be
> > >> > > > > > able to
> > >> > > > > > > >> > > consume?  We don't know where they stopped
> consuming
> > >> from
> > >> > > the
> > >> > > > > > previous
> > >> > > > > > > >> > > incarnation of the topic, so messages may have been
> > >> lost.
> > >> > > > > > Certainly
> > >> > > > > > > >> > > consuming data from offset X of the new incarnation
> > of
> > >> the
> > >> > > > topic
> > >> > > > > > may
> > >> > > > > > > >> give
> > >> > > > > > > >> > > something totally different from what you would
> have
> > >> > gotten
> > >> > > > from
> > >> > > > > > > >> offset X
> > >> > > > > > > >> > > of the previous incarnation of the topic.
> > >> > > > > > > >> > >
> > >> > > > > > > >> >
> > >> > > > > > > >> > With the current KIP, if a consumer consumes a topic
> > >> based
> > >> > on
> > >> > > > the
> > >> > > > > > last
> > >> > > > > > > >> > remembered (offset, partitionEpoch, leaderEpoch), and
> > if
> > >> the
> > >> > > > topic
> > >> > > > > > is
> > >> > > > > > > >> > re-created, consume will throw
> > >> > InvalidPartitionEpochException
> > >> > > > > > because
> > >> > > > > > > >> the
> > >> > > > > > > >> > previous partitionEpoch will be different from the
> > >> current
> > >> > > > > > > >> partitionEpoch.
> > >> > > > > > > >> > This is described in the Proposed Changes ->
> > Consumption
> > >> > after
> > >> > > > > topic
> > >> > > > > > > >> > deletion in the KIP. I can improve the KIP if there
> is
> > >> > > anything
> > >> > > > > not
> > >> > > > > > > >> clear.
> > >> > > > > > > >>
> > >> > > > > > > >> Thanks for the clarification.  It sounds like what you
> > >> really
> > >> > > want
> > >> > > > > is
> > >> > > > > > > >> immutability-- i.e., to never "really" reuse partition
> > >> > > > identifiers.
> > >> > > > > > And
> > >> > > > > > > >> you do this by making the partition name no longer the
> > >> "real"
> > >> > > > > > identifier.
> > >> > > > > > > >>
> > >> > > > > > > >> My big concern about this KIP is that it seems like an
> > >> > > > > > anti-scalability
> > >> > > > > > > >> feature.  Now we are adding 4 extra bytes for every
> > >> partition
> > >> > in
> > >> > > > the
> > >> > > > > > > >> FetchResponse and Request, for example.  That could be
> 40
> > >> kb
> > >> > per
> > >> > > > > > request,
> > >> > > > > > > >> if the user has 10,000 partitions.  And of course, the
> > KIP
> > >> > also
> > >> > > > > makes
> > >> > > > > > > >> massive changes to UpdateMetadataRequest,
> > MetadataResponse,
> > >> > > > > > > >> OffsetCommitRequest, OffsetFetchResponse,
> > >> LeaderAndIsrRequest,
> > >> > > > > > > >> ListOffsetResponse, etc. which will also increase their
> > >> size
> > >> > on
> > >> > > > the
> > >> > > > > > wire
> > >> > > > > > > >> and in memory.
> > >> > > > > > > >>
> > >> > > > > > > >> One thing that we talked a lot about in the past is
> > >> replacing
> > >> > > > > > partition
> > >> > > > > > > >> names with IDs.  IDs have a lot of really nice
> features.
> > >> They
> > >> > > > take
> > >> > > > > > up much
> > >> > > > > > > >> less space in memory than strings (especially 2-byte
> Java
> > >> > > > strings).
> > >> > > > > > They
> > >> > > > > > > >> can often be allocated on the stack rather than the
> heap
> > >> > > > (important
> > >> > > > > > when
> > >> > > > > > > >> you are dealing with hundreds of thousands of them).
> > They
> > >> can
> > >> > > be
> > >> > > > > > > >> efficiently deserialized and serialized.  If we use
> > 64-bit
> > >> > ones,
> > >> > > > we
> > >> > > > > > will
> > >> > > > > > > >> never run out of IDs, which means that they can always
> be
> > >> > unique
> > >> > > > per
> > >> > > > > > > >> partition.
> > >> > > > > > > >>
> > >> > > > > > > >> Given that the partition name is no longer the "real"
> > >> > identifier
> > >> > > > for
> > >> > > > > > > >> partitions in the current KIP-232 proposal, why not
> just
> > >> move
> > >> > to
> > >> > > > > using
> > >> > > > > > > >> partition IDs entirely instead of strings?  You have to
> > >> change
> > >> > > all
> > >> > > > > the
> > >> > > > > > > >> messages anyway.  There isn't much point any more to
> > >> carrying
> > >> > > > around
> > >> > > > > > the
> > >> > > > > > > >> partition name in every RPC, since you really need
> (name,
> > >> > epoch)
> > >> > > > to
> > >> > > > > > > >> identify the partition.
> > >> > > > > > > >> Probably the metadata response and a few other messages
> > >> would
> > >> > > have
> > >> > > > > to
> > >> > > > > > > >> still carry the partition name, to allow clients to go
> > from
> > >> > name
> > >> > > > to
> > >> > > > > > id.
> > >> > > > > > > >> But we could mostly forget about the strings.  And then
> > >> this
> > >> > > would
> > >> > > > > be
> > >> > > > > > a
> > >> > > > > > > >> scalability improvement rather than a scalability
> > problem.
> > >> > > > > > > >>
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > > By choosing to reuse the same (topic, partition,
> > >> offset)
> > >> > > > > 3-tuple,
> > >> > > > > > we
> > >> > > > > > > >> have
> > >> > > > > > > >> >
> > >> > > > > > > >> > chosen to give up immutability.  That was a really
> bad
> > >> > > decision.
> > >> > > > > > And
> > >> > > > > > > >> now
> > >> > > > > > > >> > > we have to worry about time dependencies, stale
> > cached
> > >> > data,
> > >> > > > and
> > >> > > > > > all
> > >> > > > > > > >> the
> > >> > > > > > > >> > > rest.  We can't completely fix this inside Kafka no
> > >> matter
> > >> > > > what
> > >> > > > > > we do,
> > >> > > > > > > >> > > because not all that cached data is inside Kafka
> > >> itself.
> > >> > > Some
> > >> > > > > of
> > >> > > > > > it
> > >> > > > > > > >> may be
> > >> > > > > > > >> > > in systems that Kafka has sent data to, such as
> other
> > >> > > daemons,
> > >> > > > > SQL
> > >> > > > > > > >> > > databases, streams, and so forth.
> > >> > > > > > > >> > >
> > >> > > > > > > >> >
> > >> > > > > > > >> > The current KIP will uniquely identify a message
> using
> > >> > (topic,
> > >> > > > > > > >> partition,
> > >> > > > > > > >> > offset, partitionEpoch) 4-tuple. This addresses the
> > >> message
> > >> > > > > > immutability
> > >> > > > > > > >> > issue that you mentioned. Is there any corner case
> > where
> > >> the
> > >> > > > > message
> > >> > > > > > > >> > immutability is still not preserved with the current
> > KIP?
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > I guess the idea here is that mirror maker should
> > work
> > >> as
> > >> > > > > expected
> > >> > > > > > > >> when
> > >> > > > > > > >> > > users destroy a topic and re-create it with the
> same
> > >> name.
> > >> > > > > That's
> > >> > > > > > > >> kind of
> > >> > > > > > > >> > > tough, though, since in that scenario, mirror maker
> > >> > probably
> > >> > > > > > should
> > >> > > > > > > >> destroy
> > >> > > > > > > >> > > and re-create the topic on the other end, too,
> right?
> > >> > > > > Otherwise,
> > >> > > > > > > >> what you
> > >> > > > > > > >> > > end up with on the other end could be half of one
> > >> > > incarnation
> > >> > > > of
> > >> > > > > > the
> > >> > > > > > > >> topic,
> > >> > > > > > > >> > > and half of another.
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > What mirror maker really needs is to be able to
> > follow
> > >> a
> > >> > > > stream
> > >> > > > > of
> > >> > > > > > > >> events
> > >> > > > > > > >> > > about the kafka cluster itself.  We could have some
> > >> master
> > >> > > > topic
> > >> > > > > > > >> which is
> > >> > > > > > > >> > > always present and which contains data about all
> > topic
> > >> > > > > deletions,
> > >> > > > > > > >> > > creations, etc.  Then MM can simply follow this
> topic
> > >> and
> > >> > do
> > >> > > > > what
> > >> > > > > > is
> > >> > > > > > > >> needed.
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > Then the next question maybe, should we use a
> > >> global
> > >> > > > > > > >> metadataEpoch +
> > >> > > > > > > >> > > > > > per-partition partitionEpoch, instead of
> using
> > >> > > > > per-partition
> > >> > > > > > > >> > > leaderEpoch
> > >> > > > > > > >> > > > > +
> > >> > > > > > > >> > > > > > per-partition leaderEpoch. The former
> solution
> > >> using
> > >> > > > > > > >> metadataEpoch
> > >> > > > > > > >> > > would
> > >> > > > > > > >> > > > > > not work due to the following scenario
> > (provided
> > >> by
> > >> > > > Jun):
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > "Consider the following scenario. In metadata
> > v1,
> > >> > the
> > >> > > > > leader
> > >> > > > > > > >> for a
> > >> > > > > > > >> > > > > > partition is at broker 1. In metadata v2,
> > leader
> > >> is
> > >> > at
> > >> > > > > > broker
> > >> > > > > > > >> 2. In
> > >> > > > > > > >> > > > > > metadata v3, leader is at broker 1 again. The
> > >> last
> > >> > > > > committed
> > >> > > > > > > >> offset
> > >> > > > > > > >> > > in
> > >> > > > > > > >> > > > > v1,
> > >> > > > > > > >> > > > > > v2 and v3 are 10, 20 and 30, respectively. A
> > >> > consumer
> > >> > > is
> > >> > > > > > > >> started and
> > >> > > > > > > >> > > > > reads
> > >> > > > > > > >> > > > > > metadata v1 and reads messages from offset 0
> to
> > >> 25
> > >> > > from
> > >> > > > > > broker
> > >> > > > > > > >> 1. My
> > >> > > > > > > >> > > > > > understanding is that in the current
> proposal,
> > >> the
> > >> > > > > metadata
> > >> > > > > > > >> version
> > >> > > > > > > >> > > > > > associated with offset 25 is v1. The consumer
> > is
> > >> > then
> > >> > > > > > restarted
> > >> > > > > > > >> and
> > >> > > > > > > >> > > > > fetches
> > >> > > > > > > >> > > > > > metadata v2. The consumer tries to read from
> > >> broker
> > >> > 2,
> > >> > > > > > which is
> > >> > > > > > > >> the
> > >> > > > > > > >> > > old
> > >> > > > > > > >> > > > > > leader with the last offset at 20. In this
> > case,
> > >> the
> > >> > > > > > consumer
> > >> > > > > > > >> will
> > >> > > > > > > >> > > still
> > >> > > > > > > >> > > > > > get OffsetOutOfRangeException incorrectly."
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > Regarding your comment "For the second
> purpose,
> > >> this
> > >> > > is
> > >> > > > > > "soft
> > >> > > > > > > >> state"
> > >> > > > > > > >> > > > > > anyway.  If the client thinks X is the leader
> > >> but Y
> > >> > is
> > >> > > > > > really
> > >> > > > > > > >> the
> > >> > > > > > > >> > > leader,
> > >> > > > > > > >> > > > > > the client will talk to X, and X will point
> out
> > >> its
> > >> > > > > mistake
> > >> > > > > > by
> > >> > > > > > > >> > > sending
> > >> > > > > > > >> > > > > back
> > >> > > > > > > >> > > > > > a NOT_LEADER_FOR_PARTITION.", it is probably
> no
> > >> > true.
> > >> > > > The
> > >> > > > > > > >> problem
> > >> > > > > > > >> > > here is
> > >> > > > > > > >> > > > > > that the old leader X may still think it is
> the
> > >> > leader
> > >> > > > of
> > >> > > > > > the
> > >> > > > > > > >> > > partition
> > >> > > > > > > >> > > > > and
> > >> > > > > > > >> > > > > > thus it will not send back
> > >> NOT_LEADER_FOR_PARTITION.
> > >> > > The
> > >> > > > > > reason
> > >> > > > > > > >> is
> > >> > > > > > > >> > > > > provided
> > >> > > > > > > >> > > > > > in KAFKA-6262. Can you check if that makes
> > sense?
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > This is solvable with a timeout, right?  If the
> > >> leader
> > >> > > > can't
> > >> > > > > > > >> > > communicate
> > >> > > > > > > >> > > > > with the controller for a certain period of
> time,
> > >> it
> > >> > > > should
> > >> > > > > > stop
> > >> > > > > > > >> > > acting as
> > >> > > > > > > >> > > > > the leader.  We have to solve this problem,
> > >> anyway, in
> > >> > > > order
> > >> > > > > > to
> > >> > > > > > > >> fix
> > >> > > > > > > >> > > all the
> > >> > > > > > > >> > > > > corner cases.
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > Not sure if I fully understand your proposal. The
> > >> > proposal
> > >> > > > > > seems to
> > >> > > > > > > >> > > require
> > >> > > > > > > >> > > > non-trivial changes to our existing leadership
> > >> election
> > >> > > > > > mechanism.
> > >> > > > > > > >> Could
> > >> > > > > > > >> > > > you provide more detail regarding how it works?
> For
> > >> > > example,
> > >> > > > > how
> > >> > > > > > > >> should
> > >> > > > > > > >> > > > user choose this timeout, how leader determines
> > >> whether
> > >> > it
> > >> > > > can
> > >> > > > > > still
> > >> > > > > > > >> > > > communicate with controller, and how this
> triggers
> > >> > > > controller
> > >> > > > > to
> > >> > > > > > > >> elect
> > >> > > > > > > >> > > new
> > >> > > > > > > >> > > > leader?
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > Before I come up with any proposal, let me make
> sure
> > I
> > >> > > > > understand
> > >> > > > > > the
> > >> > > > > > > >> > > problem correctly.  My big question was, what
> > prevents
> > >> > > > > split-brain
> > >> > > > > > > >> here?
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > Let's say I have a partition which is on nodes A,
> B,
> > >> and
> > >> > C,
> > >> > > > with
> > >> > > > > > > >> min-ISR
> > >> > > > > > > >> > > 2.  The controller is D.  At some point, there is a
> > >> > network
> > >> > > > > > partition
> > >> > > > > > > >> > > between A and B and the rest of the cluster.  The
> > >> > Controller
> > >> > > > > > > >> re-assigns the
> > >> > > > > > > >> > > partition to nodes C, D, and E.  But A and B keep
> > >> chugging
> > >> > > > away,
> > >> > > > > > even
> > >> > > > > > > >> > > though they can no longer communicate with the
> > >> controller.
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > At some point, a client with stale metadata writes
> to
> > >> the
> > >> > > > > > partition.
> > >> > > > > > > >> It
> > >> > > > > > > >> > > still thinks the partition is on node A, B, and C,
> so
> > >> > that's
> > >> > > > > > where it
> > >> > > > > > > >> sends
> > >> > > > > > > >> > > the data.  It's unable to talk to C, but A and B
> > reply
> > >> > back
> > >> > > > that
> > >> > > > > > all
> > >> > > > > > > >> is
> > >> > > > > > > >> > > well.
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > Is this not a case where we could lose data due to
> > >> split
> > >> > > > brain?
> > >> > > > > > Or is
> > >> > > > > > > >> > > there a mechanism for preventing this that I
> missed?
> > >> If
> > >> > it
> > >> > > > is,
> > >> > > > > it
> > >> > > > > > > >> seems
> > >> > > > > > > >> > > like a pretty serious failure case that we should
> be
> > >> > > handling
> > >> > > > > > with our
> > >> > > > > > > >> > > metadata rework.  And I think epoch numbers and
> > >> timeouts
> > >> > > might
> > >> > > > > be
> > >> > > > > > > >> part of
> > >> > > > > > > >> > > the solution.
> > >> > > > > > > >> > >
> > >> > > > > > > >> >
> > >> > > > > > > >> > Right, split brain can happen if RF=4 and minIsr=2.
> > >> > However, I
> > >> > > > am
> > >> > > > > > not
> > >> > > > > > > >> sure
> > >> > > > > > > >> > it is a pretty serious issue which we need to address
> > >> today.
> > >> > > > This
> > >> > > > > > can be
> > >> > > > > > > >> > prevented by configuring the Kafka topic so that
> > minIsr >
> > >> > > RF/2.
> > >> > > > > > > >> Actually,
> > >> > > > > > > >> > if user sets minIsr=2, is there anything reason that
> > user
> > >> > > wants
> > >> > > > to
> > >> > > > > > set
> > >> > > > > > > >> RF=4
> > >> > > > > > > >> > instead of 4?
> > >> > > > > > > >> >
> > >> > > > > > > >> > Introducing timeout in leader election mechanism is
> > >> > > > non-trivial. I
> > >> > > > > > > >> think we
> > >> > > > > > > >> > probably want to do that only if there is good
> use-case
> > >> that
> > >> > > can
> > >> > > > > not
> > >> > > > > > > >> > otherwise be addressed with the current mechanism.
> > >> > > > > > > >>
> > >> > > > > > > >> I still would like to think about these corner cases
> > more.
> > >> > But
> > >> > > > > > perhaps
> > >> > > > > > > >> it's not directly related to this KIP.
> > >> > > > > > > >>
> > >> > > > > > > >> regards,
> > >> > > > > > > >> Colin
> > >> > > > > > > >>
> > >> > > > > > > >>
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > > best,
> > >> > > > > > > >> > > Colin
> > >> > > > > > > >> > >
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > > best,
> > >> > > > > > > >> > > > > Colin
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > Regards,
> > >> > > > > > > >> > > > > > Dong
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > On Wed, Jan 24, 2018 at 10:39 AM, Colin
> McCabe
> > <
> > >> > > > > > > >> cmccabe@apache.org>
> > >> > > > > > > >> > > > > wrote:
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > > Hi Dong,
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > Thanks for proposing this KIP.  I think a
> > >> metadata
> > >> > > > epoch
> > >> > > > > > is a
> > >> > > > > > > >> > > really
> > >> > > > > > > >> > > > > good
> > >> > > > > > > >> > > > > > > idea.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > I read through the DISCUSS thread, but I
> > still
> > >> > don't
> > >> > > > > have
> > >> > > > > > a
> > >> > > > > > > >> clear
> > >> > > > > > > >> > > > > picture
> > >> > > > > > > >> > > > > > > of why the proposal uses a metadata epoch
> per
> > >> > > > partition
> > >> > > > > > rather
> > >> > > > > > > >> > > than a
> > >> > > > > > > >> > > > > > > global metadata epoch.  A metadata epoch
> per
> > >> > > partition
> > >> > > > > is
> > >> > > > > > > >> kind of
> > >> > > > > > > >> > > > > > > unpleasant-- it's at least 4 extra bytes
> per
> > >> > > partition
> > >> > > > > > that we
> > >> > > > > > > >> > > have to
> > >> > > > > > > >> > > > > send
> > >> > > > > > > >> > > > > > > over the wire in every full metadata
> request,
> > >> > which
> > >> > > > > could
> > >> > > > > > > >> become
> > >> > > > > > > >> > > extra
> > >> > > > > > > >> > > > > > > kilobytes on the wire when the number of
> > >> > partitions
> > >> > > > > > becomes
> > >> > > > > > > >> large.
> > >> > > > > > > >> > > > > Plus,
> > >> > > > > > > >> > > > > > > we have to update all the auxillary classes
> > to
> > >> > > include
> > >> > > > > an
> > >> > > > > > > >> epoch.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > We need to have a global metadata epoch
> > anyway
> > >> to
> > >> > > > handle
> > >> > > > > > > >> partition
> > >> > > > > > > >> > > > > > > addition and deletion.  For example, if I
> > give
> > >> you
> > >> > > > > > > >> > > > > > > MetadataResponse{part1,epoch 1, part2,
> epoch
> > 1}
> > >> > and
> > >> > > > > > {part1,
> > >> > > > > > > >> > > epoch1},
> > >> > > > > > > >> > > > > which
> > >> > > > > > > >> > > > > > > MetadataResponse is newer?  You have no way
> > of
> > >> > > > knowing.
> > >> > > > > > It
> > >> > > > > > > >> could
> > >> > > > > > > >> > > be
> > >> > > > > > > >> > > > > that
> > >> > > > > > > >> > > > > > > part2 has just been created, and the
> response
> > >> > with 2
> > >> > > > > > > >> partitions is
> > >> > > > > > > >> > > > > newer.
> > >> > > > > > > >> > > > > > > Or it coudl be that part2 has just been
> > >> deleted,
> > >> > and
> > >> > > > > > > >> therefore the
> > >> > > > > > > >> > > > > response
> > >> > > > > > > >> > > > > > > with 1 partition is newer.  You must have a
> > >> global
> > >> > > > epoch
> > >> > > > > > to
> > >> > > > > > > >> > > > > disambiguate
> > >> > > > > > > >> > > > > > > these two cases.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > Previously, I worked on the Ceph
> distributed
> > >> > > > filesystem.
> > >> > > > > > > >> Ceph had
> > >> > > > > > > >> > > the
> > >> > > > > > > >> > > > > > > concept of a map of the whole cluster,
> > >> maintained
> > >> > > by a
> > >> > > > > few
> > >> > > > > > > >> servers
> > >> > > > > > > >> > > > > doing
> > >> > > > > > > >> > > > > > > paxos.  This map was versioned by a single
> > >> 64-bit
> > >> > > > epoch
> > >> > > > > > number
> > >> > > > > > > >> > > which
> > >> > > > > > > >> > > > > > > increased on every change.  It was
> propagated
> > >> to
> > >> > > > clients
> > >> > > > > > > >> through
> > >> > > > > > > >> > > > > gossip.  I
> > >> > > > > > > >> > > > > > > wonder if something similar could work
> here?
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > It seems like the the Kafka
> MetadataResponse
> > >> > serves
> > >> > > > two
> > >> > > > > > > >> somewhat
> > >> > > > > > > >> > > > > unrelated
> > >> > > > > > > >> > > > > > > purposes.  Firstly, it lets clients know
> what
> > >> > > > partitions
> > >> > > > > > > >> exist in
> > >> > > > > > > >> > > the
> > >> > > > > > > >> > > > > > > system and where they live.  Secondly, it
> > lets
> > >> > > clients
> > >> > > > > > know
> > >> > > > > > > >> which
> > >> > > > > > > >> > > nodes
> > >> > > > > > > >> > > > > > > within the partition are in-sync (in the
> ISR)
> > >> and
> > >> > > > which
> > >> > > > > > node
> > >> > > > > > > >> is the
> > >> > > > > > > >> > > > > leader.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > The first purpose is what you really need a
> > >> > metadata
> > >> > > > > epoch
> > >> > > > > > > >> for, I
> > >> > > > > > > >> > > > > think.
> > >> > > > > > > >> > > > > > > You want to know whether a partition exists
> > or
> > >> > not,
> > >> > > or
> > >> > > > > you
> > >> > > > > > > >> want to
> > >> > > > > > > >> > > know
> > >> > > > > > > >> > > > > > > which nodes you should talk to in order to
> > >> write
> > >> > to
> > >> > > a
> > >> > > > > > given
> > >> > > > > > > >> > > > > partition.  A
> > >> > > > > > > >> > > > > > > single metadata epoch for the whole
> response
> > >> > should
> > >> > > be
> > >> > > > > > > >> adequate
> > >> > > > > > > >> > > here.
> > >> > > > > > > >> > > > > We
> > >> > > > > > > >> > > > > > > should not change the partition assignment
> > >> without
> > >> > > > going
> > >> > > > > > > >> through
> > >> > > > > > > >> > > > > zookeeper
> > >> > > > > > > >> > > > > > > (or a similar system), and this inherently
> > >> > > serializes
> > >> > > > > > updates
> > >> > > > > > > >> into
> > >> > > > > > > >> > > a
> > >> > > > > > > >> > > > > > > numbered stream.  Brokers should also stop
> > >> > > responding
> > >> > > > to
> > >> > > > > > > >> requests
> > >> > > > > > > >> > > when
> > >> > > > > > > >> > > > > they
> > >> > > > > > > >> > > > > > > are unable to contact ZK for a certain time
> > >> > period.
> > >> > > > > This
> > >> > > > > > > >> prevents
> > >> > > > > > > >> > > the
> > >> > > > > > > >> > > > > case
> > >> > > > > > > >> > > > > > > where a given partition has been moved off
> > some
> > >> > set
> > >> > > of
> > >> > > > > > nodes,
> > >> > > > > > > >> but a
> > >> > > > > > > >> > > > > client
> > >> > > > > > > >> > > > > > > still ends up talking to those nodes and
> > >> writing
> > >> > > data
> > >> > > > > > there.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > For the second purpose, this is "soft
> state"
> > >> > anyway.
> > >> > > > If
> > >> > > > > > the
> > >> > > > > > > >> client
> > >> > > > > > > >> > > > > thinks
> > >> > > > > > > >> > > > > > > X is the leader but Y is really the leader,
> > the
> > >> > > client
> > >> > > > > > will
> > >> > > > > > > >> talk
> > >> > > > > > > >> > > to X,
> > >> > > > > > > >> > > > > and
> > >> > > > > > > >> > > > > > > X will point out its mistake by sending
> back
> > a
> > >> > > > > > > >> > > > > NOT_LEADER_FOR_PARTITION.
> > >> > > > > > > >> > > > > > > Then the client can update its metadata
> again
> > >> and
> > >> > > find
> > >> > > > > > the new
> > >> > > > > > > >> > > leader,
> > >> > > > > > > >> > > > > if
> > >> > > > > > > >> > > > > > > there is one.  There is no need for an
> epoch
> > to
> > >> > > handle
> > >> > > > > > this.
> > >> > > > > > > >> > > > > Similarly, I
> > >> > > > > > > >> > > > > > > can't think of a reason why changing the
> > >> in-sync
> > >> > > > replica
> > >> > > > > > set
> > >> > > > > > > >> needs
> > >> > > > > > > >> > > to
> > >> > > > > > > >> > > > > bump
> > >> > > > > > > >> > > > > > > the epoch.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > best,
> > >> > > > > > > >> > > > > > > Colin
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > On Wed, Jan 24, 2018, at 09:45, Dong Lin
> > wrote:
> > >> > > > > > > >> > > > > > > > Thanks much for reviewing the KIP!
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > Dong
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > On Wed, Jan 24, 2018 at 7:10 AM, Guozhang
> > >> Wang <
> > >> > > > > > > >> > > wangguoz@gmail.com>
> > >> > > > > > > >> > > > > > > wrote:
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > > Yeah that makes sense, again I'm just
> > >> making
> > >> > > sure
> > >> > > > we
> > >> > > > > > > >> understand
> > >> > > > > > > >> > > > > all the
> > >> > > > > > > >> > > > > > > > > scenarios and what to expect.
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > I agree that if, more generally
> speaking,
> > >> say
> > >> > > > users
> > >> > > > > > have
> > >> > > > > > > >> only
> > >> > > > > > > >> > > > > consumed
> > >> > > > > > > >> > > > > > > to
> > >> > > > > > > >> > > > > > > > > offset 8, and then call seek(16) to
> > "jump"
> > >> to
> > >> > a
> > >> > > > > > further
> > >> > > > > > > >> > > position,
> > >> > > > > > > >> > > > > then
> > >> > > > > > > >> > > > > > > she
> > >> > > > > > > >> > > > > > > > > needs to be aware that OORE maybe
> thrown
> > >> and
> > >> > she
> > >> > > > > > needs to
> > >> > > > > > > >> > > handle
> > >> > > > > > > >> > > > > it or
> > >> > > > > > > >> > > > > > > rely
> > >> > > > > > > >> > > > > > > > > on reset policy which should not
> surprise
> > >> her.
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > I'm +1 on the KIP.
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > Guozhang
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > On Wed, Jan 24, 2018 at 12:31 AM, Dong
> > Lin
> > >> <
> > >> > > > > > > >> > > lindong28@gmail.com>
> > >> > > > > > > >> > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > Yes, in general we can not prevent
> > >> > > > > > > >> OffsetOutOfRangeException
> > >> > > > > > > >> > > if
> > >> > > > > > > >> > > > > user
> > >> > > > > > > >> > > > > > > > > seeks
> > >> > > > > > > >> > > > > > > > > > to a wrong offset. The main goal is
> to
> > >> > prevent
> > >> > > > > > > >> > > > > > > OffsetOutOfRangeException
> > >> > > > > > > >> > > > > > > > > if
> > >> > > > > > > >> > > > > > > > > > user has done things in the right
> way,
> > >> e.g.
> > >> > > user
> > >> > > > > > should
> > >> > > > > > > >> know
> > >> > > > > > > >> > > that
> > >> > > > > > > >> > > > > > > there
> > >> > > > > > > >> > > > > > > > > is
> > >> > > > > > > >> > > > > > > > > > message with this offset.
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > For example, if user calls seek(..)
> > right
> > >> > > after
> > >> > > > > > > >> > > construction, the
> > >> > > > > > > >> > > > > > > only
> > >> > > > > > > >> > > > > > > > > > reason I can think of is that user
> > stores
> > >> > > offset
> > >> > > > > > > >> externally.
> > >> > > > > > > >> > > In
> > >> > > > > > > >> > > > > this
> > >> > > > > > > >> > > > > > > > > case,
> > >> > > > > > > >> > > > > > > > > > user currently needs to use the
> offset
> > >> which
> > >> > > is
> > >> > > > > > obtained
> > >> > > > > > > >> > > using
> > >> > > > > > > >> > > > > > > > > position(..)
> > >> > > > > > > >> > > > > > > > > > from the last run. With this KIP,
> user
> > >> needs
> > >> > > to
> > >> > > > > get
> > >> > > > > > the
> > >> > > > > > > >> > > offset
> > >> > > > > > > >> > > > > and
> > >> > > > > > > >> > > > > > > the
> > >> > > > > > > >> > > > > > > > > > offsetEpoch using
> > >> > positionAndOffsetEpoch(...)
> > >> > > > and
> > >> > > > > > stores
> > >> > > > > > > >> > > these
> > >> > > > > > > >> > > > > > > > > information
> > >> > > > > > > >> > > > > > > > > > externally. The next time user starts
> > >> > > consumer,
> > >> > > > > > he/she
> > >> > > > > > > >> needs
> > >> > > > > > > >> > > to
> > >> > > > > > > >> > > > > call
> > >> > > > > > > >> > > > > > > > > > seek(..., offset, offsetEpoch) right
> > >> after
> > >> > > > > > construction.
> > >> > > > > > > >> > > Then KIP
> > >> > > > > > > >> > > > > > > should
> > >> > > > > > > >> > > > > > > > > be
> > >> > > > > > > >> > > > > > > > > > able to ensure that we don't throw
> > >> > > > > > > >> OffsetOutOfRangeException
> > >> > > > > > > >> > > if
> > >> > > > > > > >> > > > > > > there is
> > >> > > > > > > >> > > > > > > > > no
> > >> > > > > > > >> > > > > > > > > > unclean leader election.
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > Does this sound OK?
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > Regards,
> > >> > > > > > > >> > > > > > > > > > Dong
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > On Tue, Jan 23, 2018 at 11:44 PM,
> > >> Guozhang
> > >> > > Wang
> > >> > > > <
> > >> > > > > > > >> > > > > wangguoz@gmail.com>
> > >> > > > > > > >> > > > > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > "If consumer wants to consume
> message
> > >> with
> > >> > > > > offset
> > >> > > > > > 16,
> > >> > > > > > > >> then
> > >> > > > > > > >> > > > > consumer
> > >> > > > > > > >> > > > > > > > > must
> > >> > > > > > > >> > > > > > > > > > > have
> > >> > > > > > > >> > > > > > > > > > > already fetched message with offset
> > 15"
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > --> this may not be always true
> > right?
> > >> > What
> > >> > > if
> > >> > > > > > > >> consumer
> > >> > > > > > > >> > > just
> > >> > > > > > > >> > > > > call
> > >> > > > > > > >> > > > > > > > > > seek(16)
> > >> > > > > > > >> > > > > > > > > > > after construction and then poll
> > >> without
> > >> > > > > committed
> > >> > > > > > > >> offset
> > >> > > > > > > >> > > ever
> > >> > > > > > > >> > > > > > > stored
> > >> > > > > > > >> > > > > > > > > > > before? Admittedly it is rare but
> we
> > do
> > >> > not
> > >> > > > > > > >> programmably
> > >> > > > > > > >> > > > > disallow
> > >> > > > > > > >> > > > > > > it.
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > Guozhang
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > On Tue, Jan 23, 2018 at 10:42 PM,
> > Dong
> > >> > Lin <
> > >> > > > > > > >> > > > > lindong28@gmail.com>
> > >> > > > > > > >> > > > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > Hey Guozhang,
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > Thanks much for reviewing the
> KIP!
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > In the scenario you described,
> > let's
> > >> > > assume
> > >> > > > > that
> > >> > > > > > > >> broker
> > >> > > > > > > >> > > A has
> > >> > > > > > > >> > > > > > > > > messages
> > >> > > > > > > >> > > > > > > > > > > with
> > >> > > > > > > >> > > > > > > > > > > > offset up to 10, and broker B has
> > >> > messages
> > >> > > > > with
> > >> > > > > > > >> offset
> > >> > > > > > > >> > > up to
> > >> > > > > > > >> > > > > 20.
> > >> > > > > > > >> > > > > > > If
> > >> > > > > > > >> > > > > > > > > > > > consumer wants to consume message
> > >> with
> > >> > > > offset
> > >> > > > > > 9, it
> > >> > > > > > > >> will
> > >> > > > > > > >> > > not
> > >> > > > > > > >> > > > > > > receive
> > >> > > > > > > >> > > > > > > > > > > > OffsetOutOfRangeException
> > >> > > > > > > >> > > > > > > > > > > > from broker A.
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > If consumer wants to consume
> > message
> > >> > with
> > >> > > > > offset
> > >> > > > > > > >> 16, then
> > >> > > > > > > >> > > > > > > consumer
> > >> > > > > > > >> > > > > > > > > must
> > >> > > > > > > >> > > > > > > > > > > > have already fetched message with
> > >> offset
> > >> > > 15,
> > >> > > > > > which
> > >> > > > > > > >> can
> > >> > > > > > > >> > > only
> > >> > > > > > > >> > > > > come
> > >> > > > > > > >> > > > > > > from
> > >> > > > > > > >> > > > > > > > > > > > broker B. Because consumer will
> > fetch
> > >> > from
> > >> > > > > > broker B
> > >> > > > > > > >> only
> > >> > > > > > > >> > > if
> > >> > > > > > > >> > > > > > > > > leaderEpoch
> > >> > > > > > > >> > > > > > > > > > > >=
> > >> > > > > > > >> > > > > > > > > > > > 2, then the current consumer
> > >> leaderEpoch
> > >> > > can
> > >> > > > > > not be
> > >> > > > > > > >> 1
> > >> > > > > > > >> > > since
> > >> > > > > > > >> > > > > this
> > >> > > > > > > >> > > > > > > KIP
> > >> > > > > > > >> > > > > > > > > > > > prevents leaderEpoch rewind. Thus
> > we
> > >> > will
> > >> > > > not
> > >> > > > > > have
> > >> > > > > > > >> > > > > > > > > > > > OffsetOutOfRangeException
> > >> > > > > > > >> > > > > > > > > > > > in this case.
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > Does this address your question,
> or
> > >> > maybe
> > >> > > > > there
> > >> > > > > > is
> > >> > > > > > > >> more
> > >> > > > > > > >> > > > > advanced
> > >> > > > > > > >> > > > > > > > > > scenario
> > >> > > > > > > >> > > > > > > > > > > > that the KIP does not handle?
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > Thanks,
> > >> > > > > > > >> > > > > > > > > > > > Dong
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > On Tue, Jan 23, 2018 at 9:43 PM,
> > >> > Guozhang
> > >> > > > > Wang <
> > >> > > > > > > >> > > > > > > wangguoz@gmail.com>
> > >> > > > > > > >> > > > > > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > Thanks Dong, I made a pass over
> > the
> > >> > wiki
> > >> > > > and
> > >> > > > > > it
> > >> > > > > > > >> lgtm.
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > Just a quick question: can we
> > >> > completely
> > >> > > > > > > >> eliminate the
> > >> > > > > > > >> > > > > > > > > > > > > OffsetOutOfRangeException with
> > this
> > >> > > > > approach?
> > >> > > > > > Say
> > >> > > > > > > >> if
> > >> > > > > > > >> > > there
> > >> > > > > > > >> > > > > is
> > >> > > > > > > >> > > > > > > > > > > consecutive
> > >> > > > > > > >> > > > > > > > > > > > > leader changes such that the
> > cached
> > >> > > > > metadata's
> > >> > > > > > > >> > > partition
> > >> > > > > > > >> > > > > epoch
> > >> > > > > > > >> > > > > > > is
> > >> > > > > > > >> > > > > > > > > 1,
> > >> > > > > > > >> > > > > > > > > > > and
> > >> > > > > > > >> > > > > > > > > > > > > the metadata fetch response
> > returns
> > >> > > with
> > >> > > > > > > >> partition
> > >> > > > > > > >> > > epoch 2
> > >> > > > > > > >> > > > > > > > > pointing
> > >> > > > > > > >> > > > > > > > > > to
> > >> > > > > > > >> > > > > > > > > > > > > leader broker A, while the
> actual
> > >> > > > up-to-date
> > >> > > > > > > >> metadata
> > >> > > > > > > >> > > has
> > >> > > > > > > >> > > > > > > partition
> > >> > > > > > > >> > > > > > > > > > > > epoch 3
> > >> > > > > > > >> > > > > > > > > > > > > whose leader is now broker B,
> the
> > >> > > metadata
> > >> > > > > > > >> refresh will
> > >> > > > > > > >> > > > > still
> > >> > > > > > > >> > > > > > > > > succeed
> > >> > > > > > > >> > > > > > > > > > > and
> > >> > > > > > > >> > > > > > > > > > > > > the follow-up fetch request may
> > >> still
> > >> > > see
> > >> > > > > > OORE?
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > Guozhang
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > On Tue, Jan 23, 2018 at 3:47
> PM,
> > >> Dong
> > >> > > Lin
> > >> > > > <
> > >> > > > > > > >> > > > > lindong28@gmail.com
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > > Hi all,
> > >> > > > > > > >> > > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > > I would like to start the
> > voting
> > >> > > process
> > >> > > > > for
> > >> > > > > > > >> KIP-232:
> > >> > > > > > > >> > > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > > https://cwiki.apache.org/
> > >> > > > > > > >> > > confluence/display/KAFKA/KIP-
> > >> > > > > > > >> > > > > > > > > > > > > >
> 232%3A+Detect+outdated+metadat
> > >> > > > > > > >> a+using+leaderEpoch+
> > >> > > > > > > >> > > > > > > > > > and+partitionEpoch
> > >> > > > > > > >> > > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > > The KIP will help fix a
> > >> concurrency
> > >> > > > issue
> > >> > > > > in
> > >> > > > > > > >> Kafka
> > >> > > > > > > >> > > which
> > >> > > > > > > >> > > > > > > > > currently
> > >> > > > > > > >> > > > > > > > > > > can
> > >> > > > > > > >> > > > > > > > > > > > > > cause message loss or message
> > >> > > > duplication
> > >> > > > > in
> > >> > > > > > > >> > > consumer.
> > >> > > > > > > >> > > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > > Regards,
> > >> > > > > > > >> > > > > > > > > > > > > > Dong
> > >> > > > > > > >> > > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > --
> > >> > > > > > > >> > > > > > > > > > > > > -- Guozhang
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > --
> > >> > > > > > > >> > > > > > > > > > > -- Guozhang
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > --
> > >> > > > > > > >> > > > > > > > > -- Guozhang
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > >
> > >> > > > > > > >>
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Posted by Jun Rao <ju...@confluent.io>.

Hi, Dong,

Sorry for the late response. Since KIP-320 is covering some of the similar
problems described in this KIP, perhaps we can wait until KIP-320 settles
and see what's still left uncovered in this KIP.

Thanks,

Jun

On Mon, Jun 4, 2018 at 7:03 PM, Dong Lin <li...@gmail.com> wrote:

> Hey Jun,
>
> It seems that we have made considerable progress on the discussion of
> KIP-253 since February. Do you think we should continue the discussion
> there, or can we continue the voting for this KIP? I am happy to submit the
> PR and move forward the progress for this KIP.
>
> Thanks!
> Dong
>
>
> On Wed, Feb 7, 2018 at 11:42 PM, Dong Lin <li...@gmail.com> wrote:
>
> > Hey Jun,
> >
> > Sure, I will come up with a KIP this week. I think there is a way to
> allow
> > partition expansion to arbitrary number without introducing new concepts
> > such as read-only partition or repartition epoch.
> >
> > Thanks,
> > Dong
> >
> > On Wed, Feb 7, 2018 at 5:28 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> >> Hi, Dong,
> >>
> >> Thanks for the reply. The general idea that you had for adding
> partitions
> >> is similar to what we had in mind. It would be useful to make this more
> >> general, allowing adding an arbitrary number of partitions (instead of
> >> just
> >> doubling) and potentially removing partitions as well. The following is
> >> the
> >> high level idea from the discussion with Colin, Jason and Ismael.
> >>
> >> * To change the number of partitions from X to Y in a topic, the
> >> controller
> >> marks all existing X partitions as read-only and creates Y new
> partitions.
> >> The new partitions are writable and are tagged with a higher repartition
> >> epoch (RE).
> >>
> >> * The controller propagates the new metadata to every broker. Once the
> >> leader of a partition is marked as read-only, it rejects the produce
> >> requests on this partition. The producer will then refresh the metadata
> >> and
> >> start publishing to the new writable partitions.
> >>
> >> * The consumers will then be consuming messages in RE order. The
> consumer
> >> coordinator will only assign partitions in the same RE to consumers.
> Only
> >> after all messages in an RE are consumed, will partitions in a higher RE
> >> be
> >> assigned to consumers.
> >>
> >> As Colin mentioned, if we do the above, we could potentially (1) use a
> >> globally unique partition id, or (2) use a globally unique topic id to
> >> distinguish recreated partitions due to topic deletion.
> >>
> >> So, perhaps we can sketch out the re-partitioning KIP a bit more and see
> >> if
> >> there is any overlap with KIP-232. Would you be interested in doing
> that?
> >> If not, we can do that next week.
> >>
> >> Jun
> >>
> >>
> >> On Tue, Feb 6, 2018 at 11:30 AM, Dong Lin <li...@gmail.com> wrote:
> >>
> >> > Hey Jun,
> >> >
> >> > Interestingly I am also planning to sketch a KIP to allow partition
> >> > expansion for keyed topics after this KIP. Since you are already doing
> >> > that, I guess I will just share my high level idea here in case it is
> >> > helpful.
> >> >
> >> > The motivation for the KIP is that we currently lose order guarantee
> for
> >> > messages with the same key if we expand partitions of keyed topic.
> >> >
> >> > The solution can probably be built upon the following ideas:
> >> >
> >> > - Partition number of the keyed topic should always be doubled (or
> >> > multiplied by power of 2). Given that we select a partition based on
> >> > hash(key) % partitionNum, this should help us ensure that, a message
> >> > assigned to an existing partition will not be mapped to another
> existing
> >> > partition after partition expansion.
> >> >
> >> > - Producer includes in the ProduceRequest some information that helps
> >> > ensure that messages produced ti a partition will monotonically
> >> increase in
> >> > the partitionNum of the topic. In other words, if broker receives a
> >> > ProduceRequest and notices that the producer does not know the
> partition
> >> > number has increased, broker should reject this request. That
> >> "information"
> >> > maybe leaderEpoch, max partitionEpoch of the partitions of the topic,
> or
> >> > simply partitionNum of the topic. The benefit of this property is that
> >> we
> >> > can keep the new logic for in-order message consumption entirely in
> how
> >> > consumer leader determines the partition -> consumer mapping.
> >> >
> >> > - When consumer leader determines partition -> consumer mapping,
> leader
> >> > first reads the start position for each partition using
> >> OffsetFetchRequest.
> >> > If start position are all non-zero, then assignment can be done in its
> >> > current manner. The assumption is that, a message in the new partition
> >> > should only be consumed after all messages with the same key produced
> >> > before it has been consumed. Since some messages in the new partition
> >> has
> >> > been consumed, we should not worry about consuming messages
> >> out-of-order.
> >> > This benefit of this approach is that we can avoid unnecessary
> overhead
> >> in
> >> > the common case.
> >> >
> >> > - If the consumer leader finds that the start position for some
> >> partition
> >> > is 0. Say the current partition number is 18 and the partition index
> is
> >> 12,
> >> > then consumer leader should ensure that messages produced to partition
> >> 12 -
> >> > 18/2 = 3 before the first message of partition 12 is consumed, before
> it
> >> > assigned partition 12 to any consumer in the consumer group. Since we
> >> have
> >> > a "information" that is monotonically increasing per partition,
> consumer
> >> > can read the value of this information from the first message in
> >> partition
> >> > 12, get the offset corresponding to this value in partition 3, assign
> >> > partition except for partition 12 (and probably other new partitions)
> to
> >> > the existing consumers, waiting for the committed offset to go beyond
> >> this
> >> > offset for partition 3, and trigger rebalance again so that partition
> 3
> >> can
> >> > be reassigned to some consumer.
> >> >
> >> >
> >> > Thanks,
> >> > Dong
> >> >
> >> >
> >> > On Tue, Feb 6, 2018 at 10:10 AM, Jun Rao <ju...@confluent.io> wrote:
> >> >
> >> > > Hi, Dong,
> >> > >
> >> > > Thanks for the KIP. It looks good overall. We are working on a
> >> separate
> >> > KIP
> >> > > for adding partitions while preserving the ordering guarantees. That
> >> may
> >> > > require another flavor of partition epoch. It's not very clear
> whether
> >> > that
> >> > > partition epoch can be merged with the partition epoch in this KIP.
> >> So,
> >> > > perhaps you can wait on this a bit until we post the other KIP in
> the
> >> > next
> >> > > few days.
> >> > >
> >> > > Jun
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Feb 5, 2018 at 2:43 PM, Becket Qin <be...@gmail.com>
> >> wrote:
> >> > >
> >> > > > +1 on the KIP.
> >> > > >
> >> > > > I think the KIP is mainly about adding the capability of tracking
> >> the
> >> > > > system state change lineage. It does not seem necessary to bundle
> >> this
> >> > > KIP
> >> > > > with replacing the topic partition with partition epoch in
> >> > produce/fetch.
> >> > > > Replacing topic-partition string with partition epoch is
> >> essentially a
> >> > > > performance improvement on top of this KIP. That can probably be
> >> done
> >> > > > separately.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jiangjie (Becket) Qin
> >> > > >
> >> > > > On Mon, Jan 29, 2018 at 11:52 AM, Dong Lin <li...@gmail.com>
> >> > wrote:
> >> > > >
> >> > > > > Hey Colin,
> >> > > > >
> >> > > > > On Mon, Jan 29, 2018 at 11:23 AM, Colin McCabe <
> >> cmccabe@apache.org>
> >> > > > wrote:
> >> > > > >
> >> > > > > > > On Mon, Jan 29, 2018 at 10:35 AM, Dong Lin <
> >> lindong28@gmail.com>
> >> > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hey Colin,
> >> > > > > > > >
> >> > > > > > > > I understand that the KIP will adds overhead by
> introducing
> >> > > > > > per-partition
> >> > > > > > > > partitionEpoch. I am open to alternative solutions that
> does
> >> > not
> >> > > > > incur
> >> > > > > > > > additional overhead. But I don't see a better way now.
> >> > > > > > > >
> >> > > > > > > > IMO the overhead in the FetchResponse may not be that
> much.
> >> We
> >> > > > > probably
> >> > > > > > > > should discuss the percentage increase rather than the
> >> absolute
> >> > > > > number
> >> > > > > > > > increase. Currently after KIP-227, per-partition header
> has
> >> 23
> >> > > > bytes.
> >> > > > > > This
> >> > > > > > > > KIP adds another 4 bytes. Assume the records size is 10KB,
> >> the
> >> > > > > > percentage
> >> > > > > > > > increase is 4 / (23 + 10000) = 0.03%. It seems negligible,
> >> > right?
> >> > > > > >
> >> > > > > > Hi Dong,
> >> > > > > >
> >> > > > > > Thanks for the response.  I agree that the FetchRequest /
> >> > > FetchResponse
> >> > > > > > overhead should be OK, now that we have incremental fetch
> >> requests
> >> > > and
> >> > > > > > responses.  However, there are a lot of cases where the
> >> percentage
> >> > > > > increase
> >> > > > > > is much greater.  For example, if a client is doing full
> >> > > > > MetadataRequests /
> >> > > > > > Responses, we have some math kind of like this per partition:
> >> > > > > >
> >> > > > > > > UpdateMetadataRequestPartitionState => topic partition
> >> > > > > controller_epoch
> >> > > > > > leader  leader_epoch partition_epoch isr zk_version replicas
> >> > > > > > offline_replicas
> >> > > > > > > 14 bytes:  topic => string (assuming about 10 byte topic
> >> names)
> >> > > > > > > 4 bytes:  partition => int32
> >> > > > > > > 4  bytes: conroller_epoch => int32
> >> > > > > > > 4  bytes: leader => int32
> >> > > > > > > 4  bytes: leader_epoch => int32
> >> > > > > > > +4 EXTRA bytes: partition_epoch => int32        <-- NEW
> >> > > > > > > 2+4+4+4 bytes: isr => [int32] (assuming 3 in the ISR)
> >> > > > > > > 4 bytes: zk_version => int32
> >> > > > > > > 2+4+4+4 bytes: replicas => [int32] (assuming 3 replicas)
> >> > > > > > > 2  offline_replicas => [int32] (assuming no offline
> replicas)
> >> > > > > >
> >> > > > > > Assuming I added that up correctly, the per-partition overhead
> >> goes
> >> > > > from
> >> > > > > > 64 bytes per partition to 68, a 6.2% increase.
> >> > > > > >
> >> > > > > > We could do similar math for a lot of the other RPCs.  And you
> >> will
> >> > > > have
> >> > > > > a
> >> > > > > > similar memory and garbage collection impact on the brokers
> >> since
> >> > you
> >> > > > > have
> >> > > > > > to store all this extra state as well.
> >> > > > > >
> >> > > > >
> >> > > > > That is correct. IMO the Metadata is only updated periodically
> >> and is
> >> > > > > probably not a big deal if we increase it by 6%. The
> FetchResponse
> >> > and
> >> > > > > ProduceRequest are probably the only requests that are bounded
> by
> >> the
> >> > > > > bandwidth throughput.
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > > >
> >> > > > > > > > I agree that we can probably save more space by using
> >> partition
> >> > > ID
> >> > > > so
> >> > > > > > that
> >> > > > > > > > we no longer needs the string topic name. The similar idea
> >> has
> >> > > also
> >> > > > > > been
> >> > > > > > > > put in the Rejected Alternative section in KIP-227. While
> >> this
> >> > > idea
> >> > > > > is
> >> > > > > > > > promising, it seems orthogonal to the goal of this KIP.
> >> Given
> >> > > that
> >> > > > > > there is
> >> > > > > > > > already many work to do in this KIP, maybe we can do the
> >> > > partition
> >> > > > ID
> >> > > > > > in a
> >> > > > > > > > separate KIP?
> >> > > > > >
> >> > > > > > I guess my thinking is that the goal here is to replace an
> >> > identifier
> >> > > > > > which can be re-used (the tuple of topic name, partition ID)
> >> with
> >> > an
> >> > > > > > identifier that cannot be re-used (the tuple of topic name,
> >> > partition
> >> > > > ID,
> >> > > > > > partition epoch) in order to gain better semantics.  As long
> as
> >> we
> >> > > are
> >> > > > > > replacing the identifier, why not replace it with an
> identifier
> >> > that
> >> > > > has
> >> > > > > > important performance advantages?  The KIP freeze for the next
> >> > > release
> >> > > > > has
> >> > > > > > already passed, so there is time to do this.
> >> > > > > >
> >> > > > >
> >> > > > > In general it can be easier for discussion and implementation if
> >> we
> >> > can
> >> > > > > split a larger task into smaller and independent tasks. For
> >> example,
> >> > > > > KIP-112 and KIP-113 both deals with the JBOD support. KIP-31,
> >> KIP-32
> >> > > and
> >> > > > > KIP-33 are about timestamp support. The option on this can be
> >> subject
> >> > > > > though.
> >> > > > >
> >> > > > > IMO the change to switch from (topic, partition ID) to
> >> partitionEpch
> >> > in
> >> > > > all
> >> > > > > request/response requires us to going through all request one by
> >> one.
> >> > > It
> >> > > > > may not be hard but it can be time consuming and tedious. At
> high
> >> > level
> >> > > > the
> >> > > > > goal and the change for that will be orthogonal to the changes
> >> > required
> >> > > > in
> >> > > > > this KIP. That is the main reason I think we can split them into
> >> two
> >> > > > KIPs.
> >> > > > >
> >> > > > >
> >> > > > > > On Mon, Jan 29, 2018, at 10:54, Dong Lin wrote:
> >> > > > > > > I think it is possible to move to entirely use
> partitionEpoch
> >> > > instead
> >> > > > > of
> >> > > > > > > (topic, partition) to identify a partition. Client can
> obtain
> >> the
> >> > > > > > > partitionEpoch -> (topic, partition) mapping from
> >> > MetadataResponse.
> >> > > > We
> >> > > > > > > probably need to figure out a way to assign partitionEpoch
> to
> >> > > > existing
> >> > > > > > > partitions in the cluster. But this should be doable.
> >> > > > > > >
> >> > > > > > > This is a good idea. I think it will save us some space in
> the
> >> > > > > > > request/response. The actual space saving in percentage
> >> probably
> >> > > > > depends
> >> > > > > > on
> >> > > > > > > the amount of data and the number of partitions of the same
> >> > topic.
> >> > > I
> >> > > > > just
> >> > > > > > > think we can do it in a separate KIP.
> >> > > > > >
> >> > > > > > Hmm.  How much extra work would be required?  It seems like we
> >> are
> >> > > > > already
> >> > > > > > changing almost every RPC that involves topics and partitions,
> >> > > already
> >> > > > > > adding new per-partition state to ZooKeeper, already changing
> >> how
> >> > > > clients
> >> > > > > > interact with partitions.  Is there some other big piece of
> work
> >> > we'd
> >> > > > > have
> >> > > > > > to do to move to partition IDs that we wouldn't need for
> >> partition
> >> > > > > epochs?
> >> > > > > > I guess we'd have to find a way to support regular
> >> expression-based
> >> > > > topic
> >> > > > > > subscriptions.  If we split this into multiple KIPs, wouldn't
> we
> >> > end
> >> > > up
> >> > > > > > changing all that RPCs and ZK state a second time?  Also, I'm
> >> > curious
> >> > > > if
> >> > > > > > anyone has done any proof of concept GC, memory, and network
> >> usage
> >> > > > > > measurements on switching topic names for topic IDs.
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > > We will need to go over all requests/responses to check how to
> >> > replace
> >> > > > > (topic, partition ID) with partition epoch. It requires
> >> non-trivial
> >> > > work
> >> > > > > and could take time. As you mentioned, we may want to see how
> much
> >> > > saving
> >> > > > > we can get by switching from topic names to partition epoch.
> That
> >> > > itself
> >> > > > > requires time and experiment. It seems that the new idea does
> not
> >> > > > rollback
> >> > > > > any change proposed in this KIP. So I am not sure we can get
> much
> >> by
> >> > > > > putting them into the same KIP.
> >> > > > >
> >> > > > > Anyway, if more people are interested in seeing the new idea in
> >> the
> >> > > same
> >> > > > > KIP, I can try that.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > best,
> >> > > > > > Colin
> >> > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Mon, Jan 29, 2018 at 10:18 AM, Colin McCabe <
> >> > > cmccabe@apache.org
> >> > > > >
> >> > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > >> On Fri, Jan 26, 2018, at 12:17, Dong Lin wrote:
> >> > > > > > > >> > Hey Colin,
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > On Fri, Jan 26, 2018 at 10:16 AM, Colin McCabe <
> >> > > > > cmccabe@apache.org>
> >> > > > > > > >> wrote:
> >> > > > > > > >> >
> >> > > > > > > >> > > On Thu, Jan 25, 2018, at 16:47, Dong Lin wrote:
> >> > > > > > > >> > > > Hey Colin,
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > Thanks for the comment.
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > On Thu, Jan 25, 2018 at 4:15 PM, Colin McCabe <
> >> > > > > > cmccabe@apache.org>
> >> > > > > > > >> > > wrote:
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > > On Wed, Jan 24, 2018, at 21:07, Dong Lin wrote:
> >> > > > > > > >> > > > > > Hey Colin,
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > Thanks for reviewing the KIP.
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > If I understand you right, you maybe suggesting
> >> that
> >> > > we
> >> > > > > can
> >> > > > > > use
> >> > > > > > > >> a
> >> > > > > > > >> > > global
> >> > > > > > > >> > > > > > metadataEpoch that is incremented every time
> >> > > controller
> >> > > > > > updates
> >> > > > > > > >> > > metadata.
> >> > > > > > > >> > > > > > The problem with this solution is that, if a
> >> topic
> >> > is
> >> > > > > > deleted
> >> > > > > > > >> and
> >> > > > > > > >> > > created
> >> > > > > > > >> > > > > > again, user will not know whether that the
> offset
> >> > > which
> >> > > > is
> >> > > > > > > >> stored
> >> > > > > > > >> > > before
> >> > > > > > > >> > > > > > the topic deletion is no longer valid. This
> >> > motivates
> >> > > > the
> >> > > > > > idea
> >> > > > > > > >> to
> >> > > > > > > >> > > include
> >> > > > > > > >> > > > > > per-partition partitionEpoch. Does this sound
> >> > > > reasonable?
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > Hi Dong,
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > Perhaps we can store the last valid offset of
> each
> >> > > deleted
> >> > > > > > topic
> >> > > > > > > >> in
> >> > > > > > > >> > > > > ZooKeeper.  Then, when a topic with one of those
> >> names
> >> > > > gets
> >> > > > > > > >> > > re-created, we
> >> > > > > > > >> > > > > can start the topic at the previous end offset
> >> rather
> >> > > than
> >> > > > > at
> >> > > > > > 0.
> >> > > > > > > >> This
> >> > > > > > > >> > > > > preserves immutability.  It is no more burdensome
> >> than
> >> > > > > having
> >> > > > > > to
> >> > > > > > > >> > > preserve a
> >> > > > > > > >> > > > > "last epoch" for the deleted partition somewhere,
> >> > right?
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > My concern with this solution is that the number of
> >> > > > zookeeper
> >> > > > > > nodes
> >> > > > > > > >> get
> >> > > > > > > >> > > > more and more over time if some users keep deleting
> >> and
> >> > > > > creating
> >> > > > > > > >> topics.
> >> > > > > > > >> > > Do
> >> > > > > > > >> > > > you think this can be a problem?
> >> > > > > > > >> > >
> >> > > > > > > >> > > Hi Dong,
> >> > > > > > > >> > >
> >> > > > > > > >> > > We could expire the "partition tombstones" after an
> >> hour
> >> > or
> >> > > > so.
> >> > > > > > In
> >> > > > > > > >> > > practice this would solve the issue for clients that
> >> like
> >> > to
> >> > > > > > destroy
> >> > > > > > > >> and
> >> > > > > > > >> > > re-create topics all the time.  In any case, doesn't
> >> the
> >> > > > current
> >> > > > > > > >> proposal
> >> > > > > > > >> > > add per-partition znodes as well that we have to
> track
> >> > even
> >> > > > > after
> >> > > > > > the
> >> > > > > > > >> > > partition is deleted?  Or did I misunderstand that?
> >> > > > > > > >> > >
> >> > > > > > > >> >
> >> > > > > > > >> > Actually the current KIP does not add per-partition
> >> znodes.
> >> > > > Could
> >> > > > > > you
> >> > > > > > > >> > double check? I can fix the KIP wiki if there is
> anything
> >> > > > > > misleading.
> >> > > > > > > >>
> >> > > > > > > >> Hi Dong,
> >> > > > > > > >>
> >> > > > > > > >> I double-checked the KIP, and I can see that you are in
> >> fact
> >> > > > using a
> >> > > > > > > >> global counter for initializing partition epochs.  So,
> you
> >> are
> >> > > > > > correct, it
> >> > > > > > > >> doesn't add per-partition znodes for partitions that no
> >> longer
> >> > > > > exist.
> >> > > > > > > >>
> >> > > > > > > >> >
> >> > > > > > > >> > If we expire the "partition tomstones" after an hour,
> and
> >> > the
> >> > > > > topic
> >> > > > > > is
> >> > > > > > > >> > re-created after more than an hour since the topic
> >> deletion,
> >> > > > then
> >> > > > > > we are
> >> > > > > > > >> > back to the situation where user can not tell whether
> the
> >> > > topic
> >> > > > > has
> >> > > > > > been
> >> > > > > > > >> > re-created or not, right?
> >> > > > > > > >>
> >> > > > > > > >> Yes, with an expiration period, it would not ensure
> >> > > immutability--
> >> > > > > you
> >> > > > > > > >> could effectively reuse partition names and they would
> look
> >> > the
> >> > > > > same.
> >> > > > > > > >>
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > >
> >> > > > > > > >> > > It's not really clear to me what should happen when a
> >> > topic
> >> > > is
> >> > > > > > > >> destroyed
> >> > > > > > > >> > > and re-created with new data.  Should consumers
> >> continue
> >> > to
> >> > > be
> >> > > > > > able to
> >> > > > > > > >> > > consume?  We don't know where they stopped consuming
> >> from
> >> > > the
> >> > > > > > previous
> >> > > > > > > >> > > incarnation of the topic, so messages may have been
> >> lost.
> >> > > > > > Certainly
> >> > > > > > > >> > > consuming data from offset X of the new incarnation
> of
> >> the
> >> > > > topic
> >> > > > > > may
> >> > > > > > > >> give
> >> > > > > > > >> > > something totally different from what you would have
> >> > gotten
> >> > > > from
> >> > > > > > > >> offset X
> >> > > > > > > >> > > of the previous incarnation of the topic.
> >> > > > > > > >> > >
> >> > > > > > > >> >
> >> > > > > > > >> > With the current KIP, if a consumer consumes a topic
> >> based
> >> > on
> >> > > > the
> >> > > > > > last
> >> > > > > > > >> > remembered (offset, partitionEpoch, leaderEpoch), and
> if
> >> the
> >> > > > topic
> >> > > > > > is
> >> > > > > > > >> > re-created, consume will throw
> >> > InvalidPartitionEpochException
> >> > > > > > because
> >> > > > > > > >> the
> >> > > > > > > >> > previous partitionEpoch will be different from the
> >> current
> >> > > > > > > >> partitionEpoch.
> >> > > > > > > >> > This is described in the Proposed Changes ->
> Consumption
> >> > after
> >> > > > > topic
> >> > > > > > > >> > deletion in the KIP. I can improve the KIP if there is
> >> > > anything
> >> > > > > not
> >> > > > > > > >> clear.
> >> > > > > > > >>
> >> > > > > > > >> Thanks for the clarification.  It sounds like what you
> >> really
> >> > > want
> >> > > > > is
> >> > > > > > > >> immutability-- i.e., to never "really" reuse partition
> >> > > > identifiers.
> >> > > > > > And
> >> > > > > > > >> you do this by making the partition name no longer the
> >> "real"
> >> > > > > > identifier.
> >> > > > > > > >>
> >> > > > > > > >> My big concern about this KIP is that it seems like an
> >> > > > > > anti-scalability
> >> > > > > > > >> feature.  Now we are adding 4 extra bytes for every
> >> partition
> >> > in
> >> > > > the
> >> > > > > > > >> FetchResponse and Request, for example.  That could be 40
> >> kb
> >> > per
> >> > > > > > request,
> >> > > > > > > >> if the user has 10,000 partitions.  And of course, the
> KIP
> >> > also
> >> > > > > makes
> >> > > > > > > >> massive changes to UpdateMetadataRequest,
> MetadataResponse,
> >> > > > > > > >> OffsetCommitRequest, OffsetFetchResponse,
> >> LeaderAndIsrRequest,
> >> > > > > > > >> ListOffsetResponse, etc. which will also increase their
> >> size
> >> > on
> >> > > > the
> >> > > > > > wire
> >> > > > > > > >> and in memory.
> >> > > > > > > >>
> >> > > > > > > >> One thing that we talked a lot about in the past is
> >> replacing
> >> > > > > > partition
> >> > > > > > > >> names with IDs.  IDs have a lot of really nice features.
> >> They
> >> > > > take
> >> > > > > > up much
> >> > > > > > > >> less space in memory than strings (especially 2-byte Java
> >> > > > strings).
> >> > > > > > They
> >> > > > > > > >> can often be allocated on the stack rather than the heap
> >> > > > (important
> >> > > > > > when
> >> > > > > > > >> you are dealing with hundreds of thousands of them).
> They
> >> can
> >> > > be
> >> > > > > > > >> efficiently deserialized and serialized.  If we use
> 64-bit
> >> > ones,
> >> > > > we
> >> > > > > > will
> >> > > > > > > >> never run out of IDs, which means that they can always be
> >> > unique
> >> > > > per
> >> > > > > > > >> partition.
> >> > > > > > > >>
> >> > > > > > > >> Given that the partition name is no longer the "real"
> >> > identifier
> >> > > > for
> >> > > > > > > >> partitions in the current KIP-232 proposal, why not just
> >> move
> >> > to
> >> > > > > using
> >> > > > > > > >> partition IDs entirely instead of strings?  You have to
> >> change
> >> > > all
> >> > > > > the
> >> > > > > > > >> messages anyway.  There isn't much point any more to
> >> carrying
> >> > > > around
> >> > > > > > the
> >> > > > > > > >> partition name in every RPC, since you really need (name,
> >> > epoch)
> >> > > > to
> >> > > > > > > >> identify the partition.
> >> > > > > > > >> Probably the metadata response and a few other messages
> >> would
> >> > > have
> >> > > > > to
> >> > > > > > > >> still carry the partition name, to allow clients to go
> from
> >> > name
> >> > > > to
> >> > > > > > id.
> >> > > > > > > >> But we could mostly forget about the strings.  And then
> >> this
> >> > > would
> >> > > > > be
> >> > > > > > a
> >> > > > > > > >> scalability improvement rather than a scalability
> problem.
> >> > > > > > > >>
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > > By choosing to reuse the same (topic, partition,
> >> offset)
> >> > > > > 3-tuple,
> >> > > > > > we
> >> > > > > > > >> have
> >> > > > > > > >> >
> >> > > > > > > >> > chosen to give up immutability.  That was a really bad
> >> > > decision.
> >> > > > > > And
> >> > > > > > > >> now
> >> > > > > > > >> > > we have to worry about time dependencies, stale
> cached
> >> > data,
> >> > > > and
> >> > > > > > all
> >> > > > > > > >> the
> >> > > > > > > >> > > rest.  We can't completely fix this inside Kafka no
> >> matter
> >> > > > what
> >> > > > > > we do,
> >> > > > > > > >> > > because not all that cached data is inside Kafka
> >> itself.
> >> > > Some
> >> > > > > of
> >> > > > > > it
> >> > > > > > > >> may be
> >> > > > > > > >> > > in systems that Kafka has sent data to, such as other
> >> > > daemons,
> >> > > > > SQL
> >> > > > > > > >> > > databases, streams, and so forth.
> >> > > > > > > >> > >
> >> > > > > > > >> >
> >> > > > > > > >> > The current KIP will uniquely identify a message using
> >> > (topic,
> >> > > > > > > >> partition,
> >> > > > > > > >> > offset, partitionEpoch) 4-tuple. This addresses the
> >> message
> >> > > > > > immutability
> >> > > > > > > >> > issue that you mentioned. Is there any corner case
> where
> >> the
> >> > > > > message
> >> > > > > > > >> > immutability is still not preserved with the current
> KIP?
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > >
> >> > > > > > > >> > > I guess the idea here is that mirror maker should
> work
> >> as
> >> > > > > expected
> >> > > > > > > >> when
> >> > > > > > > >> > > users destroy a topic and re-create it with the same
> >> name.
> >> > > > > That's
> >> > > > > > > >> kind of
> >> > > > > > > >> > > tough, though, since in that scenario, mirror maker
> >> > probably
> >> > > > > > should
> >> > > > > > > >> destroy
> >> > > > > > > >> > > and re-create the topic on the other end, too, right?
> >> > > > > Otherwise,
> >> > > > > > > >> what you
> >> > > > > > > >> > > end up with on the other end could be half of one
> >> > > incarnation
> >> > > > of
> >> > > > > > the
> >> > > > > > > >> topic,
> >> > > > > > > >> > > and half of another.
> >> > > > > > > >> > >
> >> > > > > > > >> > > What mirror maker really needs is to be able to
> follow
> >> a
> >> > > > stream
> >> > > > > of
> >> > > > > > > >> events
> >> > > > > > > >> > > about the kafka cluster itself.  We could have some
> >> master
> >> > > > topic
> >> > > > > > > >> which is
> >> > > > > > > >> > > always present and which contains data about all
> topic
> >> > > > > deletions,
> >> > > > > > > >> > > creations, etc.  Then MM can simply follow this topic
> >> and
> >> > do
> >> > > > > what
> >> > > > > > is
> >> > > > > > > >> needed.
> >> > > > > > > >> > >
> >> > > > > > > >> > > >
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > Then the next question maybe, should we use a
> >> global
> >> > > > > > > >> metadataEpoch +
> >> > > > > > > >> > > > > > per-partition partitionEpoch, instead of using
> >> > > > > per-partition
> >> > > > > > > >> > > leaderEpoch
> >> > > > > > > >> > > > > +
> >> > > > > > > >> > > > > > per-partition leaderEpoch. The former solution
> >> using
> >> > > > > > > >> metadataEpoch
> >> > > > > > > >> > > would
> >> > > > > > > >> > > > > > not work due to the following scenario
> (provided
> >> by
> >> > > > Jun):
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > "Consider the following scenario. In metadata
> v1,
> >> > the
> >> > > > > leader
> >> > > > > > > >> for a
> >> > > > > > > >> > > > > > partition is at broker 1. In metadata v2,
> leader
> >> is
> >> > at
> >> > > > > > broker
> >> > > > > > > >> 2. In
> >> > > > > > > >> > > > > > metadata v3, leader is at broker 1 again. The
> >> last
> >> > > > > committed
> >> > > > > > > >> offset
> >> > > > > > > >> > > in
> >> > > > > > > >> > > > > v1,
> >> > > > > > > >> > > > > > v2 and v3 are 10, 20 and 30, respectively. A
> >> > consumer
> >> > > is
> >> > > > > > > >> started and
> >> > > > > > > >> > > > > reads
> >> > > > > > > >> > > > > > metadata v1 and reads messages from offset 0 to
> >> 25
> >> > > from
> >> > > > > > broker
> >> > > > > > > >> 1. My
> >> > > > > > > >> > > > > > understanding is that in the current proposal,
> >> the
> >> > > > > metadata
> >> > > > > > > >> version
> >> > > > > > > >> > > > > > associated with offset 25 is v1. The consumer
> is
> >> > then
> >> > > > > > restarted
> >> > > > > > > >> and
> >> > > > > > > >> > > > > fetches
> >> > > > > > > >> > > > > > metadata v2. The consumer tries to read from
> >> broker
> >> > 2,
> >> > > > > > which is
> >> > > > > > > >> the
> >> > > > > > > >> > > old
> >> > > > > > > >> > > > > > leader with the last offset at 20. In this
> case,
> >> the
> >> > > > > > consumer
> >> > > > > > > >> will
> >> > > > > > > >> > > still
> >> > > > > > > >> > > > > > get OffsetOutOfRangeException incorrectly."
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > Regarding your comment "For the second purpose,
> >> this
> >> > > is
> >> > > > > > "soft
> >> > > > > > > >> state"
> >> > > > > > > >> > > > > > anyway.  If the client thinks X is the leader
> >> but Y
> >> > is
> >> > > > > > really
> >> > > > > > > >> the
> >> > > > > > > >> > > leader,
> >> > > > > > > >> > > > > > the client will talk to X, and X will point out
> >> its
> >> > > > > mistake
> >> > > > > > by
> >> > > > > > > >> > > sending
> >> > > > > > > >> > > > > back
> >> > > > > > > >> > > > > > a NOT_LEADER_FOR_PARTITION.", it is probably no
> >> > true.
> >> > > > The
> >> > > > > > > >> problem
> >> > > > > > > >> > > here is
> >> > > > > > > >> > > > > > that the old leader X may still think it is the
> >> > leader
> >> > > > of
> >> > > > > > the
> >> > > > > > > >> > > partition
> >> > > > > > > >> > > > > and
> >> > > > > > > >> > > > > > thus it will not send back
> >> NOT_LEADER_FOR_PARTITION.
> >> > > The
> >> > > > > > reason
> >> > > > > > > >> is
> >> > > > > > > >> > > > > provided
> >> > > > > > > >> > > > > > in KAFKA-6262. Can you check if that makes
> sense?
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > This is solvable with a timeout, right?  If the
> >> leader
> >> > > > can't
> >> > > > > > > >> > > communicate
> >> > > > > > > >> > > > > with the controller for a certain period of time,
> >> it
> >> > > > should
> >> > > > > > stop
> >> > > > > > > >> > > acting as
> >> > > > > > > >> > > > > the leader.  We have to solve this problem,
> >> anyway, in
> >> > > > order
> >> > > > > > to
> >> > > > > > > >> fix
> >> > > > > > > >> > > all the
> >> > > > > > > >> > > > > corner cases.
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > Not sure if I fully understand your proposal. The
> >> > proposal
> >> > > > > > seems to
> >> > > > > > > >> > > require
> >> > > > > > > >> > > > non-trivial changes to our existing leadership
> >> election
> >> > > > > > mechanism.
> >> > > > > > > >> Could
> >> > > > > > > >> > > > you provide more detail regarding how it works? For
> >> > > example,
> >> > > > > how
> >> > > > > > > >> should
> >> > > > > > > >> > > > user choose this timeout, how leader determines
> >> whether
> >> > it
> >> > > > can
> >> > > > > > still
> >> > > > > > > >> > > > communicate with controller, and how this triggers
> >> > > > controller
> >> > > > > to
> >> > > > > > > >> elect
> >> > > > > > > >> > > new
> >> > > > > > > >> > > > leader?
> >> > > > > > > >> > >
> >> > > > > > > >> > > Before I come up with any proposal, let me make sure
> I
> >> > > > > understand
> >> > > > > > the
> >> > > > > > > >> > > problem correctly.  My big question was, what
> prevents
> >> > > > > split-brain
> >> > > > > > > >> here?
> >> > > > > > > >> > >
> >> > > > > > > >> > > Let's say I have a partition which is on nodes A, B,
> >> and
> >> > C,
> >> > > > with
> >> > > > > > > >> min-ISR
> >> > > > > > > >> > > 2.  The controller is D.  At some point, there is a
> >> > network
> >> > > > > > partition
> >> > > > > > > >> > > between A and B and the rest of the cluster.  The
> >> > Controller
> >> > > > > > > >> re-assigns the
> >> > > > > > > >> > > partition to nodes C, D, and E.  But A and B keep
> >> chugging
> >> > > > away,
> >> > > > > > even
> >> > > > > > > >> > > though they can no longer communicate with the
> >> controller.
> >> > > > > > > >> > >
> >> > > > > > > >> > > At some point, a client with stale metadata writes to
> >> the
> >> > > > > > partition.
> >> > > > > > > >> It
> >> > > > > > > >> > > still thinks the partition is on node A, B, and C, so
> >> > that's
> >> > > > > > where it
> >> > > > > > > >> sends
> >> > > > > > > >> > > the data.  It's unable to talk to C, but A and B
> reply
> >> > back
> >> > > > that
> >> > > > > > all
> >> > > > > > > >> is
> >> > > > > > > >> > > well.
> >> > > > > > > >> > >
> >> > > > > > > >> > > Is this not a case where we could lose data due to
> >> split
> >> > > > brain?
> >> > > > > > Or is
> >> > > > > > > >> > > there a mechanism for preventing this that I missed?
> >> If
> >> > it
> >> > > > is,
> >> > > > > it
> >> > > > > > > >> seems
> >> > > > > > > >> > > like a pretty serious failure case that we should be
> >> > > handling
> >> > > > > > with our
> >> > > > > > > >> > > metadata rework.  And I think epoch numbers and
> >> timeouts
> >> > > might
> >> > > > > be
> >> > > > > > > >> part of
> >> > > > > > > >> > > the solution.
> >> > > > > > > >> > >
> >> > > > > > > >> >
> >> > > > > > > >> > Right, split brain can happen if RF=4 and minIsr=2.
> >> > However, I
> >> > > > am
> >> > > > > > not
> >> > > > > > > >> sure
> >> > > > > > > >> > it is a pretty serious issue which we need to address
> >> today.
> >> > > > This
> >> > > > > > can be
> >> > > > > > > >> > prevented by configuring the Kafka topic so that
> minIsr >
> >> > > RF/2.
> >> > > > > > > >> Actually,
> >> > > > > > > >> > if user sets minIsr=2, is there anything reason that
> user
> >> > > wants
> >> > > > to
> >> > > > > > set
> >> > > > > > > >> RF=4
> >> > > > > > > >> > instead of 4?
> >> > > > > > > >> >
> >> > > > > > > >> > Introducing timeout in leader election mechanism is
> >> > > > non-trivial. I
> >> > > > > > > >> think we
> >> > > > > > > >> > probably want to do that only if there is good use-case
> >> that
> >> > > can
> >> > > > > not
> >> > > > > > > >> > otherwise be addressed with the current mechanism.
> >> > > > > > > >>
> >> > > > > > > >> I still would like to think about these corner cases
> more.
> >> > But
> >> > > > > > perhaps
> >> > > > > > > >> it's not directly related to this KIP.
> >> > > > > > > >>
> >> > > > > > > >> regards,
> >> > > > > > > >> Colin
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > > best,
> >> > > > > > > >> > > Colin
> >> > > > > > > >> > >
> >> > > > > > > >> > >
> >> > > > > > > >> > > >
> >> > > > > > > >> > > >
> >> > > > > > > >> > > > > best,
> >> > > > > > > >> > > > > Colin
> >> > > > > > > >> > > > >
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > Regards,
> >> > > > > > > >> > > > > > Dong
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > On Wed, Jan 24, 2018 at 10:39 AM, Colin McCabe
> <
> >> > > > > > > >> cmccabe@apache.org>
> >> > > > > > > >> > > > > wrote:
> >> > > > > > > >> > > > > >
> >> > > > > > > >> > > > > > > Hi Dong,
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > Thanks for proposing this KIP.  I think a
> >> metadata
> >> > > > epoch
> >> > > > > > is a
> >> > > > > > > >> > > really
> >> > > > > > > >> > > > > good
> >> > > > > > > >> > > > > > > idea.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > I read through the DISCUSS thread, but I
> still
> >> > don't
> >> > > > > have
> >> > > > > > a
> >> > > > > > > >> clear
> >> > > > > > > >> > > > > picture
> >> > > > > > > >> > > > > > > of why the proposal uses a metadata epoch per
> >> > > > partition
> >> > > > > > rather
> >> > > > > > > >> > > than a
> >> > > > > > > >> > > > > > > global metadata epoch.  A metadata epoch per
> >> > > partition
> >> > > > > is
> >> > > > > > > >> kind of
> >> > > > > > > >> > > > > > > unpleasant-- it's at least 4 extra bytes per
> >> > > partition
> >> > > > > > that we
> >> > > > > > > >> > > have to
> >> > > > > > > >> > > > > send
> >> > > > > > > >> > > > > > > over the wire in every full metadata request,
> >> > which
> >> > > > > could
> >> > > > > > > >> become
> >> > > > > > > >> > > extra
> >> > > > > > > >> > > > > > > kilobytes on the wire when the number of
> >> > partitions
> >> > > > > > becomes
> >> > > > > > > >> large.
> >> > > > > > > >> > > > > Plus,
> >> > > > > > > >> > > > > > > we have to update all the auxillary classes
> to
> >> > > include
> >> > > > > an
> >> > > > > > > >> epoch.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > We need to have a global metadata epoch
> anyway
> >> to
> >> > > > handle
> >> > > > > > > >> partition
> >> > > > > > > >> > > > > > > addition and deletion.  For example, if I
> give
> >> you
> >> > > > > > > >> > > > > > > MetadataResponse{part1,epoch 1, part2, epoch
> 1}
> >> > and
> >> > > > > > {part1,
> >> > > > > > > >> > > epoch1},
> >> > > > > > > >> > > > > which
> >> > > > > > > >> > > > > > > MetadataResponse is newer?  You have no way
> of
> >> > > > knowing.
> >> > > > > > It
> >> > > > > > > >> could
> >> > > > > > > >> > > be
> >> > > > > > > >> > > > > that
> >> > > > > > > >> > > > > > > part2 has just been created, and the response
> >> > with 2
> >> > > > > > > >> partitions is
> >> > > > > > > >> > > > > newer.
> >> > > > > > > >> > > > > > > Or it coudl be that part2 has just been
> >> deleted,
> >> > and
> >> > > > > > > >> therefore the
> >> > > > > > > >> > > > > response
> >> > > > > > > >> > > > > > > with 1 partition is newer.  You must have a
> >> global
> >> > > > epoch
> >> > > > > > to
> >> > > > > > > >> > > > > disambiguate
> >> > > > > > > >> > > > > > > these two cases.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > Previously, I worked on the Ceph distributed
> >> > > > filesystem.
> >> > > > > > > >> Ceph had
> >> > > > > > > >> > > the
> >> > > > > > > >> > > > > > > concept of a map of the whole cluster,
> >> maintained
> >> > > by a
> >> > > > > few
> >> > > > > > > >> servers
> >> > > > > > > >> > > > > doing
> >> > > > > > > >> > > > > > > paxos.  This map was versioned by a single
> >> 64-bit
> >> > > > epoch
> >> > > > > > number
> >> > > > > > > >> > > which
> >> > > > > > > >> > > > > > > increased on every change.  It was propagated
> >> to
> >> > > > clients
> >> > > > > > > >> through
> >> > > > > > > >> > > > > gossip.  I
> >> > > > > > > >> > > > > > > wonder if something similar could work here?
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > It seems like the the Kafka MetadataResponse
> >> > serves
> >> > > > two
> >> > > > > > > >> somewhat
> >> > > > > > > >> > > > > unrelated
> >> > > > > > > >> > > > > > > purposes.  Firstly, it lets clients know what
> >> > > > partitions
> >> > > > > > > >> exist in
> >> > > > > > > >> > > the
> >> > > > > > > >> > > > > > > system and where they live.  Secondly, it
> lets
> >> > > clients
> >> > > > > > know
> >> > > > > > > >> which
> >> > > > > > > >> > > nodes
> >> > > > > > > >> > > > > > > within the partition are in-sync (in the ISR)
> >> and
> >> > > > which
> >> > > > > > node
> >> > > > > > > >> is the
> >> > > > > > > >> > > > > leader.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > The first purpose is what you really need a
> >> > metadata
> >> > > > > epoch
> >> > > > > > > >> for, I
> >> > > > > > > >> > > > > think.
> >> > > > > > > >> > > > > > > You want to know whether a partition exists
> or
> >> > not,
> >> > > or
> >> > > > > you
> >> > > > > > > >> want to
> >> > > > > > > >> > > know
> >> > > > > > > >> > > > > > > which nodes you should talk to in order to
> >> write
> >> > to
> >> > > a
> >> > > > > > given
> >> > > > > > > >> > > > > partition.  A
> >> > > > > > > >> > > > > > > single metadata epoch for the whole response
> >> > should
> >> > > be
> >> > > > > > > >> adequate
> >> > > > > > > >> > > here.
> >> > > > > > > >> > > > > We
> >> > > > > > > >> > > > > > > should not change the partition assignment
> >> without
> >> > > > going
> >> > > > > > > >> through
> >> > > > > > > >> > > > > zookeeper
> >> > > > > > > >> > > > > > > (or a similar system), and this inherently
> >> > > serializes
> >> > > > > > updates
> >> > > > > > > >> into
> >> > > > > > > >> > > a
> >> > > > > > > >> > > > > > > numbered stream.  Brokers should also stop
> >> > > responding
> >> > > > to
> >> > > > > > > >> requests
> >> > > > > > > >> > > when
> >> > > > > > > >> > > > > they
> >> > > > > > > >> > > > > > > are unable to contact ZK for a certain time
> >> > period.
> >> > > > > This
> >> > > > > > > >> prevents
> >> > > > > > > >> > > the
> >> > > > > > > >> > > > > case
> >> > > > > > > >> > > > > > > where a given partition has been moved off
> some
> >> > set
> >> > > of
> >> > > > > > nodes,
> >> > > > > > > >> but a
> >> > > > > > > >> > > > > client
> >> > > > > > > >> > > > > > > still ends up talking to those nodes and
> >> writing
> >> > > data
> >> > > > > > there.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > For the second purpose, this is "soft state"
> >> > anyway.
> >> > > > If
> >> > > > > > the
> >> > > > > > > >> client
> >> > > > > > > >> > > > > thinks
> >> > > > > > > >> > > > > > > X is the leader but Y is really the leader,
> the
> >> > > client
> >> > > > > > will
> >> > > > > > > >> talk
> >> > > > > > > >> > > to X,
> >> > > > > > > >> > > > > and
> >> > > > > > > >> > > > > > > X will point out its mistake by sending back
> a
> >> > > > > > > >> > > > > NOT_LEADER_FOR_PARTITION.
> >> > > > > > > >> > > > > > > Then the client can update its metadata again
> >> and
> >> > > find
> >> > > > > > the new
> >> > > > > > > >> > > leader,
> >> > > > > > > >> > > > > if
> >> > > > > > > >> > > > > > > there is one.  There is no need for an epoch
> to
> >> > > handle
> >> > > > > > this.
> >> > > > > > > >> > > > > Similarly, I
> >> > > > > > > >> > > > > > > can't think of a reason why changing the
> >> in-sync
> >> > > > replica
> >> > > > > > set
> >> > > > > > > >> needs
> >> > > > > > > >> > > to
> >> > > > > > > >> > > > > bump
> >> > > > > > > >> > > > > > > the epoch.
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > best,
> >> > > > > > > >> > > > > > > Colin
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > > > > On Wed, Jan 24, 2018, at 09:45, Dong Lin
> wrote:
> >> > > > > > > >> > > > > > > > Thanks much for reviewing the KIP!
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > > Dong
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > > On Wed, Jan 24, 2018 at 7:10 AM, Guozhang
> >> Wang <
> >> > > > > > > >> > > wangguoz@gmail.com>
> >> > > > > > > >> > > > > > > wrote:
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > > > Yeah that makes sense, again I'm just
> >> making
> >> > > sure
> >> > > > we
> >> > > > > > > >> understand
> >> > > > > > > >> > > > > all the
> >> > > > > > > >> > > > > > > > > scenarios and what to expect.
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > I agree that if, more generally speaking,
> >> say
> >> > > > users
> >> > > > > > have
> >> > > > > > > >> only
> >> > > > > > > >> > > > > consumed
> >> > > > > > > >> > > > > > > to
> >> > > > > > > >> > > > > > > > > offset 8, and then call seek(16) to
> "jump"
> >> to
> >> > a
> >> > > > > > further
> >> > > > > > > >> > > position,
> >> > > > > > > >> > > > > then
> >> > > > > > > >> > > > > > > she
> >> > > > > > > >> > > > > > > > > needs to be aware that OORE maybe thrown
> >> and
> >> > she
> >> > > > > > needs to
> >> > > > > > > >> > > handle
> >> > > > > > > >> > > > > it or
> >> > > > > > > >> > > > > > > rely
> >> > > > > > > >> > > > > > > > > on reset policy which should not surprise
> >> her.
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > I'm +1 on the KIP.
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > Guozhang
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > On Wed, Jan 24, 2018 at 12:31 AM, Dong
> Lin
> >> <
> >> > > > > > > >> > > lindong28@gmail.com>
> >> > > > > > > >> > > > > > > wrote:
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > > Yes, in general we can not prevent
> >> > > > > > > >> OffsetOutOfRangeException
> >> > > > > > > >> > > if
> >> > > > > > > >> > > > > user
> >> > > > > > > >> > > > > > > > > seeks
> >> > > > > > > >> > > > > > > > > > to a wrong offset. The main goal is to
> >> > prevent
> >> > > > > > > >> > > > > > > OffsetOutOfRangeException
> >> > > > > > > >> > > > > > > > > if
> >> > > > > > > >> > > > > > > > > > user has done things in the right way,
> >> e.g.
> >> > > user
> >> > > > > > should
> >> > > > > > > >> know
> >> > > > > > > >> > > that
> >> > > > > > > >> > > > > > > there
> >> > > > > > > >> > > > > > > > > is
> >> > > > > > > >> > > > > > > > > > message with this offset.
> >> > > > > > > >> > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > For example, if user calls seek(..)
> right
> >> > > after
> >> > > > > > > >> > > construction, the
> >> > > > > > > >> > > > > > > only
> >> > > > > > > >> > > > > > > > > > reason I can think of is that user
> stores
> >> > > offset
> >> > > > > > > >> externally.
> >> > > > > > > >> > > In
> >> > > > > > > >> > > > > this
> >> > > > > > > >> > > > > > > > > case,
> >> > > > > > > >> > > > > > > > > > user currently needs to use the offset
> >> which
> >> > > is
> >> > > > > > obtained
> >> > > > > > > >> > > using
> >> > > > > > > >> > > > > > > > > position(..)
> >> > > > > > > >> > > > > > > > > > from the last run. With this KIP, user
> >> needs
> >> > > to
> >> > > > > get
> >> > > > > > the
> >> > > > > > > >> > > offset
> >> > > > > > > >> > > > > and
> >> > > > > > > >> > > > > > > the
> >> > > > > > > >> > > > > > > > > > offsetEpoch using
> >> > positionAndOffsetEpoch(...)
> >> > > > and
> >> > > > > > stores
> >> > > > > > > >> > > these
> >> > > > > > > >> > > > > > > > > information
> >> > > > > > > >> > > > > > > > > > externally. The next time user starts
> >> > > consumer,
> >> > > > > > he/she
> >> > > > > > > >> needs
> >> > > > > > > >> > > to
> >> > > > > > > >> > > > > call
> >> > > > > > > >> > > > > > > > > > seek(..., offset, offsetEpoch) right
> >> after
> >> > > > > > construction.
> >> > > > > > > >> > > Then KIP
> >> > > > > > > >> > > > > > > should
> >> > > > > > > >> > > > > > > > > be
> >> > > > > > > >> > > > > > > > > > able to ensure that we don't throw
> >> > > > > > > >> OffsetOutOfRangeException
> >> > > > > > > >> > > if
> >> > > > > > > >> > > > > > > there is
> >> > > > > > > >> > > > > > > > > no
> >> > > > > > > >> > > > > > > > > > unclean leader election.
> >> > > > > > > >> > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > Does this sound OK?
> >> > > > > > > >> > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > Regards,
> >> > > > > > > >> > > > > > > > > > Dong
> >> > > > > > > >> > > > > > > > > >
> >> > > > > > > >> > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > On Tue, Jan 23, 2018 at 11:44 PM,
> >> Guozhang
> >> > > Wang
> >> > > > <
> >> > > > > > > >> > > > > wangguoz@gmail.com>
> >> > > > > > > >> > > > > > > > > > wrote:
> >> > > > > > > >> > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > "If consumer wants to consume message
> >> with
> >> > > > > offset
> >> > > > > > 16,
> >> > > > > > > >> then
> >> > > > > > > >> > > > > consumer
> >> > > > > > > >> > > > > > > > > must
> >> > > > > > > >> > > > > > > > > > > have
> >> > > > > > > >> > > > > > > > > > > already fetched message with offset
> 15"
> >> > > > > > > >> > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > --> this may not be always true
> right?
> >> > What
> >> > > if
> >> > > > > > > >> consumer
> >> > > > > > > >> > > just
> >> > > > > > > >> > > > > call
> >> > > > > > > >> > > > > > > > > > seek(16)
> >> > > > > > > >> > > > > > > > > > > after construction and then poll
> >> without
> >> > > > > committed
> >> > > > > > > >> offset
> >> > > > > > > >> > > ever
> >> > > > > > > >> > > > > > > stored
> >> > > > > > > >> > > > > > > > > > > before? Admittedly it is rare but we
> do
> >> > not
> >> > > > > > > >> programmably
> >> > > > > > > >> > > > > disallow
> >> > > > > > > >> > > > > > > it.
> >> > > > > > > >> > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > Guozhang
> >> > > > > > > >> > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > On Tue, Jan 23, 2018 at 10:42 PM,
> Dong
> >> > Lin <
> >> > > > > > > >> > > > > lindong28@gmail.com>
> >> > > > > > > >> > > > > > > > > wrote:
> >> > > > > > > >> > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > Hey Guozhang,
> >> > > > > > > >> > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > Thanks much for reviewing the KIP!
> >> > > > > > > >> > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > In the scenario you described,
> let's
> >> > > assume
> >> > > > > that
> >> > > > > > > >> broker
> >> > > > > > > >> > > A has
> >> > > > > > > >> > > > > > > > > messages
> >> > > > > > > >> > > > > > > > > > > with
> >> > > > > > > >> > > > > > > > > > > > offset up to 10, and broker B has
> >> > messages
> >> > > > > with
> >> > > > > > > >> offset
> >> > > > > > > >> > > up to
> >> > > > > > > >> > > > > 20.
> >> > > > > > > >> > > > > > > If
> >> > > > > > > >> > > > > > > > > > > > consumer wants to consume message
> >> with
> >> > > > offset
> >> > > > > > 9, it
> >> > > > > > > >> will
> >> > > > > > > >> > > not
> >> > > > > > > >> > > > > > > receive
> >> > > > > > > >> > > > > > > > > > > > OffsetOutOfRangeException
> >> > > > > > > >> > > > > > > > > > > > from broker A.
> >> > > > > > > >> > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > If consumer wants to consume
> message
> >> > with
> >> > > > > offset
> >> > > > > > > >> 16, then
> >> > > > > > > >> > > > > > > consumer
> >> > > > > > > >> > > > > > > > > must
> >> > > > > > > >> > > > > > > > > > > > have already fetched message with
> >> offset
> >> > > 15,
> >> > > > > > which
> >> > > > > > > >> can
> >> > > > > > > >> > > only
> >> > > > > > > >> > > > > come
> >> > > > > > > >> > > > > > > from
> >> > > > > > > >> > > > > > > > > > > > broker B. Because consumer will
> fetch
> >> > from
> >> > > > > > broker B
> >> > > > > > > >> only
> >> > > > > > > >> > > if
> >> > > > > > > >> > > > > > > > > leaderEpoch
> >> > > > > > > >> > > > > > > > > > > >=
> >> > > > > > > >> > > > > > > > > > > > 2, then the current consumer
> >> leaderEpoch
> >> > > can
> >> > > > > > not be
> >> > > > > > > >> 1
> >> > > > > > > >> > > since
> >> > > > > > > >> > > > > this
> >> > > > > > > >> > > > > > > KIP
> >> > > > > > > >> > > > > > > > > > > > prevents leaderEpoch rewind. Thus
> we
> >> > will
> >> > > > not
> >> > > > > > have
> >> > > > > > > >> > > > > > > > > > > > OffsetOutOfRangeException
> >> > > > > > > >> > > > > > > > > > > > in this case.
> >> > > > > > > >> > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > Does this address your question, or
> >> > maybe
> >> > > > > there
> >> > > > > > is
> >> > > > > > > >> more
> >> > > > > > > >> > > > > advanced
> >> > > > > > > >> > > > > > > > > > scenario
> >> > > > > > > >> > > > > > > > > > > > that the KIP does not handle?
> >> > > > > > > >> > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > Thanks,
> >> > > > > > > >> > > > > > > > > > > > Dong
> >> > > > > > > >> > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > On Tue, Jan 23, 2018 at 9:43 PM,
> >> > Guozhang
> >> > > > > Wang <
> >> > > > > > > >> > > > > > > wangguoz@gmail.com>
> >> > > > > > > >> > > > > > > > > > > wrote:
> >> > > > > > > >> > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > > Thanks Dong, I made a pass over
> the
> >> > wiki
> >> > > > and
> >> > > > > > it
> >> > > > > > > >> lgtm.
> >> > > > > > > >> > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > > Just a quick question: can we
> >> > completely
> >> > > > > > > >> eliminate the
> >> > > > > > > >> > > > > > > > > > > > > OffsetOutOfRangeException with
> this
> >> > > > > approach?
> >> > > > > > Say
> >> > > > > > > >> if
> >> > > > > > > >> > > there
> >> > > > > > > >> > > > > is
> >> > > > > > > >> > > > > > > > > > > consecutive
> >> > > > > > > >> > > > > > > > > > > > > leader changes such that the
> cached
> >> > > > > metadata's
> >> > > > > > > >> > > partition
> >> > > > > > > >> > > > > epoch
> >> > > > > > > >> > > > > > > is
> >> > > > > > > >> > > > > > > > > 1,
> >> > > > > > > >> > > > > > > > > > > and
> >> > > > > > > >> > > > > > > > > > > > > the metadata fetch response
> returns
> >> > > with
> >> > > > > > > >> partition
> >> > > > > > > >> > > epoch 2
> >> > > > > > > >> > > > > > > > > pointing
> >> > > > > > > >> > > > > > > > > > to
> >> > > > > > > >> > > > > > > > > > > > > leader broker A, while the actual
> >> > > > up-to-date
> >> > > > > > > >> metadata
> >> > > > > > > >> > > has
> >> > > > > > > >> > > > > > > partition
> >> > > > > > > >> > > > > > > > > > > > epoch 3
> >> > > > > > > >> > > > > > > > > > > > > whose leader is now broker B, the
> >> > > metadata
> >> > > > > > > >> refresh will
> >> > > > > > > >> > > > > still
> >> > > > > > > >> > > > > > > > > succeed
> >> > > > > > > >> > > > > > > > > > > and
> >> > > > > > > >> > > > > > > > > > > > > the follow-up fetch request may
> >> still
> >> > > see
> >> > > > > > OORE?
> >> > > > > > > >> > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > > Guozhang
> >> > > > > > > >> > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > > On Tue, Jan 23, 2018 at 3:47 PM,
> >> Dong
> >> > > Lin
> >> > > > <
> >> > > > > > > >> > > > > lindong28@gmail.com
> >> > > > > > > >> > > > > > > >
> >> > > > > > > >> > > > > > > > > > wrote:
> >> > > > > > > >> > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > > > Hi all,
> >> > > > > > > >> > > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > > > I would like to start the
> voting
> >> > > process
> >> > > > > for
> >> > > > > > > >> KIP-232:
> >> > > > > > > >> > > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > > > https://cwiki.apache.org/
> >> > > > > > > >> > > confluence/display/KAFKA/KIP-
> >> > > > > > > >> > > > > > > > > > > > > > 232%3A+Detect+outdated+metadat
> >> > > > > > > >> a+using+leaderEpoch+
> >> > > > > > > >> > > > > > > > > > and+partitionEpoch
> >> > > > > > > >> > > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > > > The KIP will help fix a
> >> concurrency
> >> > > > issue
> >> > > > > in
> >> > > > > > > >> Kafka
> >> > > > > > > >> > > which
> >> > > > > > > >> > > > > > > > > currently
> >> > > > > > > >> > > > > > > > > > > can
> >> > > > > > > >> > > > > > > > > > > > > > cause message loss or message
> >> > > > duplication
> >> > > > > in
> >> > > > > > > >> > > consumer.
> >> > > > > > > >> > > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > > > Regards,
> >> > > > > > > >> > > > > > > > > > > > > > Dong
> >> > > > > > > >> > > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > > > --
> >> > > > > > > >> > > > > > > > > > > > > -- Guozhang
> >> > > > > > > >> > > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > > > --
> >> > > > > > > >> > > > > > > > > > > -- Guozhang
> >> > > > > > > >> > > > > > > > > > >
> >> > > > > > > >> > > > > > > > > >
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > > > > --
> >> > > > > > > >> > > > > > > > > -- Guozhang
> >> > > > > > > >> > > > > > > > >
> >> > > > > > > >> > > > > > >
> >> > > > > > > >> > > > >
> >> > > > > > > >> > >
> >> > > > > > > >>
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>