You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Jason Gustafson <ja...@confluent.io> on 2019/04/04 22:47:37 UTC

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Hi Everyone,

Sorry for the long delay on this KIP. I have updated it to include the
handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are no
further comments, I will plan to start a vote early next week.

Thanks!
Jason

On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <ad...@gmail.com>
wrote:

> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
>
> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <ad...@gmail.com>
> wrote:
>
> > Hi John
> >
> > What is the status of this KIP?
> >
> > My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
> > 2.1.1 for a multitude of our internal topics, and I suspect that a proper
> > fix is needed.
> >
> > Adam
> >
> > On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com> wrote:
> >
> >> Thanks Jason. The proposed solution sounds good to me.
> >>
> >>
> >> Guozhang
> >>
> >> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
> >> wrote:
> >>
> >> > Hey Guozhang,
> >> >
> >> > Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
> >> > occurs following expiration of the producerId. It's possible that
> >> another
> >> > producerId has been installed in its place following expiration (if
> >> another
> >> > producer instance has become active), or the mapping is empty. We can
> >> > safely retry the InitProducerId with the logic in this KIP in order to
> >> > detect which case it is. So I'd suggest something like this:
> >> >
> >> > 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
> >> > InitProducerId using the current producerId and epoch.
> >> > 2. If no mapping exists, the coordinator can generate a new producerId
> >> and
> >> > return it. If a transaction is in progress on the client, it will have
> >> to
> >> > be aborted, but the producer can continue afterwards.
> >> > 3. Otherwise if a different producerId has been assigned, then we can
> >> > return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can
> >> > probably raise this as ProducerFencedException since that is
> effectively
> >> > what has happened. Ideally this is the only fatal case that users have
> >> to
> >> > handle.
> >> >
> >> > I'll give it a little more thought and update the KIP.
> >> >
> >> > Thanks,
> >> > Jason
> >> >
> >> > On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
> >> wrote:
> >> >
> >> > > You're right about the dangling txn since it will actually block
> >> > > read-committed consumers from proceeding at all. I'd agree that
> since
> >> > this
> >> > > is a very rare case, we can consider fixing it not via broker-side
> >> logic
> >> > > but via tooling in a future work.
> >> > >
> >> > > I've also discovered some related error handling logic inside
> producer
> >> > that
> >> > > may be addressed together with this KIP (since it is mostly for
> >> internal
> >> > > implementations the wiki itself does not need to be modified):
> >> > >
> >> > >
> >> > >
> >> >
> >>
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> >> > >
> >> > > Guozhang
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <jason@confluent.io
> >
> >> > > wrote:
> >> > >
> >> > > > Hey Guozhang,
> >> > > >
> >> > > > To clarify, the broker does not actually use the ApiVersion API
> for
> >> > > > inter-broker communications. The use of an API and its
> corresponding
> >> > > > version is controlled by `inter.broker.protocol.version`.
> >> > > >
> >> > > > Nevertheless, it sounds like we're on the same page about removing
> >> > > > DescribeTransactionState. The impact of a dangling transaction is
> a
> >> > > little
> >> > > > worse than what you describe though. Consumers with the
> >> read_committed
> >> > > > isolation level will be stuck. Still, I think we agree that this
> >> case
> >> > > > should be rare and we can reconsider for future work. Rather than
> >> > > > preventing dangling transactions, perhaps we should consider
> options
> >> > > which
> >> > > > allows us to detect them and recover. Anyway, this needs more
> >> thought.
> >> > I
> >> > > > will update the KIP.
> >> > > >
> >> > > > Best,
> >> > > > Jason
> >> > > >
> >> > > > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wangguoz@gmail.com
> >
> >> > > wrote:
> >> > > >
> >> > > > > 0. My original question is about the implementation details
> >> > primarily,
> >> > > > > since current the handling logic of the APIVersionResponse is
> >> simply
> >> > > "use
> >> > > > > the highest supported version of the corresponding request", but
> >> if
> >> > the
> >> > > > > returned response from APIVersionRequest says "I don't even know
> >> > about
> >> > > > the
> >> > > > > DescribeTransactionStateRequest at all", then we need additional
> >> > logic
> >> > > > for
> >> > > > > the falling back logic. Currently this logic is embedded in
> >> > > NetworkClient
> >> > > > > which is shared by all clients, so I'd like to avoid making this
> >> > logic
> >> > > > more
> >> > > > > complicated.
> >> > > > >
> >> > > > > As for the general issue that a broker does not recognize a
> >> producer
> >> > > with
> >> > > > > sequence number 0, here's my thinking: as you mentioned in the
> >> wiki,
> >> > > this
> >> > > > > is only a concern for transactional producer since for
> idempotent
> >> > > > producer
> >> > > > > it can just bump the epoch and go. For transactional producer,
> >> even
> >> > if
> >> > > > the
> >> > > > > producer request from a fenced producer gets accepted, its
> >> > transaction
> >> > > > will
> >> > > > > never be committed and hence messages not exposed to
> >> read-committed
> >> > > > > consumers as well. The drawback is though, 1) read-uncommitted
> >> > > consumers
> >> > > > > will still read those messages, 2) unnecessary storage for those
> >> > fenced
> >> > > > > produce messages, but in practice should not accumulate to a
> large
> >> > > amount
> >> > > > > since producer should soon try to commit and be told it is
> fenced
> >> and
> >> > > > then
> >> > > > > stop, 3) there will be no markers for those transactional
> messages
> >> > > ever.
> >> > > > > Looking at the list and thinking about the likelihood it may
> >> happen
> >> > > > > assuming we retain the producer up to transactional.id.timeout
> >> > (default
> >> > > > is
> >> > > > > 7 days), I feel comfortable leaving it as is.
> >> > > > >
> >> > > > > Guozhang
> >> > > > >
> >> > > > > On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> >> jason@confluent.io>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hey Guozhang,
> >> > > > > >
> >> > > > > > Thanks for the comments. Responses below:
> >> > > > > >
> >> > > > > > 0. The new API is used between brokers, so we govern its usage
> >> > using
> >> > > > > > `inter.broker.protocol.version`. If the other broker hasn't
> >> > upgraded,
> >> > > > we
> >> > > > > > will just fallback to the old logic, which is to accept the
> >> write.
> >> > > This
> >> > > > > is
> >> > > > > > similar to how we introduced the OffsetsForLeaderEpoch API.
> Does
> >> > that
> >> > > > > seem
> >> > > > > > reasonable?
> >> > > > > >
> >> > > > > > To tell the truth, after digging this KIP up and reading it
> >> over, I
> >> > > am
> >> > > > > > doubting how crucial this API is. It is attempting to protect
> a
> >> > write
> >> > > > > from
> >> > > > > > a zombie which has just reset its sequence number after that
> >> > producer
> >> > > > had
> >> > > > > > had its state cleaned up. However, one of the other
> >> improvements in
> >> > > > this
> >> > > > > > KIP is to maintain producer state beyond its retention in the
> >> log.
> >> > I
> >> > > > > think
> >> > > > > > that makes this case sufficiently unlikely that we can leave
> it
> >> for
> >> > > > > future
> >> > > > > > work. I am not 100% sure this is the only scenario where
> >> > transaction
> >> > > > > state
> >> > > > > > and log state can diverge anyway, so it would be better to
> >> consider
> >> > > > this
> >> > > > > > problem more generally. What do you think?
> >> > > > > >
> >> > > > > > 1. Thanks, from memory, the term changed after the first
> >> iteration.
> >> > > > I'll
> >> > > > > > make a pass and try to clarify usage.
> >> > > > > > 2. I was attempting to handle some edge cases since this check
> >> > would
> >> > > be
> >> > > > > > asynchronous. In any case, if we drop this validation as
> >> suggested
> >> > > > above,
> >> > > > > > then we can ignore this.
> >> > > > > >
> >> > > > > > -Jason
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> >> wangguoz@gmail.com>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > > Hello Jason, thanks for the great write-up.
> >> > > > > > >
> >> > > > > > > 0. One question about the migration plan: "The new
> >> > > > GetTransactionState
> >> > > > > > API
> >> > > > > > > and the new version of the transaction state message will
> not
> >> be
> >> > > used
> >> > > > > > until
> >> > > > > > > the inter-broker version supports it." I'm not so clear
> about
> >> the
> >> > > > > > > implementation details here: say a broker is on the newer
> >> version
> >> > > and
> >> > > > > the
> >> > > > > > > txn-coordinator is still on older version. Today the
> >> > > > APIVersionsRequest
> >> > > > > > can
> >> > > > > > > only help upgrade / downgrade the request version, but not
> >> > > forbidding
> >> > > > > > > sending any. Are you suggesting we add additional logic on
> the
> >> > > broker
> >> > > > > > side
> >> > > > > > > to handle the case of "not sending the request"? If yes my
> >> > concern
> >> > > is
> >> > > > > > that
> >> > > > > > > this will be some tech-debt code that will live long before
> >> being
> >> > > > > > removed.
> >> > > > > > >
> >> > > > > > > Some additional minor comments:
> >> > > > > > >
> >> > > > > > > 1. "last epoch" and "instance epoch" seem to be referring to
> >> the
> >> > > same
> >> > > > > > thing
> >> > > > > > > in your wiki.
> >> > > > > > > 2. "The broker must verify after receiving the response that
> >> the
> >> > > > > producer
> >> > > > > > > state is still unknown.": not sure why we have to validate?
> If
> >> > the
> >> > > > > > metadata
> >> > > > > > > returned from the txn-coordinator can always be considered
> the
> >> > > > > > > source-of-truth, can't we just bindly use it to update the
> >> cache?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Guozhang
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> >> > > > matthias@confluent.io>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > I am +1 on this :)
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > -Matthias
> >> > > > > > > >
> >> > > > > > > > On 9/4/18 9:55 AM, Jason Gustafson wrote:
> >> > > > > > > > > Bump. Thanks to Magnus for noticing that I forgot to
> link
> >> to
> >> > > the
> >> > > > > KIP:
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> >> > > > > > > > > .
> >> > > > > > > > >
> >> > > > > > > > > -Jason
> >> > > > > > > > >
> >> > > > > > > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> >> > > > > jason@confluent.io
> >> > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > >> Hi All,
> >> > > > > > > > >>
> >> > > > > > > > >> I have a proposal to improve the
> transactional/idempotent
> >> > > > > producer's
> >> > > > > > > > >> handling of the UNKNOWN_PRODUCER error, which is the
> >> result
> >> > of
> >> > > > > > losing
> >> > > > > > > > >> producer state following segment removal. The current
> >> > behavior
> >> > > > is
> >> > > > > > both
> >> > > > > > > > >> complex and limited. Please take a look and let me know
> >> what
> >> > > you
> >> > > > > > > think.
> >> > > > > > > > >>
> >> > > > > > > > >> Thanks in advance to Matthias Sax for feedback on the
> >> > initial
> >> > > > > draft.
> >> > > > > > > > >>
> >> > > > > > > > >> -Jason
> >> > > > > > > > >>
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > --
> >> > > > > > > -- Guozhang
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > -- Guozhang
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > -- Guozhang
> >> > >
> >> >
> >>
> >>
> >> --
> >> -- Guozhang
> >>
> >
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Guozhang Wang <wa...@gmail.com>.
Makes sense, thanks!

On Tue, Aug 27, 2019 at 10:38 AM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Guozhang,
>
> 1. I think there are still some retriable errors that could affect the
> transaction APIs. For example, COORDINATOR_LOAD_IN_PROGRESS.
> 2. Yes, this is right. The only fatal error is when the producer has been
> fenced by another instance.
>
> Thanks,
> Jason
>
> On Mon, Aug 26, 2019 at 6:05 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hi Jason,
> >
> > I've made another pass on the wiki page and it reads much better now. One
> > more clarification about the "Simplified error handling" section:
> >
> > 1. There will be no "retriable error" from the broker side regarding any
> > send requests and txn requests (to txn coordinators). All errors would
> > cause the corresponding txn to eventually be aborted.
> > 2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the
> > producer entering the ABORTABLE_ERROR state, but only the current txn to
> be
> > aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to
> > enter the FATAL_ERROR state, plus it would cause all future txns to be
> > aborted.
> >
> > Is that right?
> >
> >
> > Guozhang
> >
> >
> > On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax <ma...@confluent.io>
> > wrote:
> > >
> > > Thanks Jason!
> > >
> > > LGTM.
> > >
> > > On 8/21/19 3:07 PM, Jason Gustafson wrote:
> > > > Hi Matthias,
> > > >
> > > > Thanks, I appreciate the thorough review. I've revised the section to
> > make
> > > > the logic clearer. I think you have it right except for the 1). We
> only
> > > > generate a new PID if the epoch cannot be incremented without
> overflow.
> > > >
> > > > -Jason
> > > >
> > > > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <
> matthias@confluent.io
> > >
> > > > wrote:
> > > >
> > > >> Thanks for the KIP. I just have some clarification questions to make
> > > >> sure I understand the proposal correctly:
> > > >>
> > > >> 1) "Safe Epoch Incrementing"
> > > >>
> > > >>> When the coordinator receives a new InitProducerId request, we will
> > use
> > > >> the following logic to update the epoch:
> > > >>>
> > > >>> 1. No epoch is provided: the current epoch will be bumped and the
> > last
> > > >> epoch will be set to -1.
> > > >>> 2. Epoch and producerId are provided, and the provided producerId
> > > >> matches the current producerId or the provided producerId matches
> the
> > > >> previous producerId and the provided epoch is exhausted:
> > > >>>       a. Provided epoch matches current epoch: the last epoch will
> be
> > > >> set to the current epoch, and the current epoch will be bumped .
> > > >>>       b. Provided epoch matches last epoch: the current epoch will
> be
> > > >> returned
> > > >>>       c. Else: return INVALID_PRODUCER_EPOCH
> > > >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH
> > > >>
> > > >> Case (1) would be for a new producer. Hence, should we state that
> "no
> > > >> PID" is provided (instead of "no epoch" is provided?). That might be
> > > >> clearer and it implies that there is no epoch anyway.
> > > >>
> > > >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
> > > >> error and tries to re-initialize itself.
> > > >>
> > > >> Case (2a) implies that the producer send its first request and is
> not
> > > >> fenced. Case (2b) implies that the producer re-tries to
> re-initialize
> > > >> itself, ie, it first request to re-initilize did not get a respond
> but
> > > >> was processed by the transaction coordinator. Case (2c) implies
> that a
> > > >> producer was fenced (similar case 3, even if I am not sure what
> case 3
> > > >> actually would be?)
> > > >>
> > > >> Please let me know if my understanding is correct.
> > > >>
> > > >> What is still unclear to me is, why case (2 -- or is it only 2b?)
> > > >> requires that the "provide epoch is exhausted"?
> > > >>
> > > >> For case 2b:
> > > >>
> > > >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
> > > >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the
> corresponding
> > > >> PID/epoch pair. The TC processes the request and creates a new
> PID=101
> > > >> with new epoch=0, however, the respond to the producer is lost. The
> TC
> > > >> still stores `currentPid=101`, `currentEpoch=0` and
> `previousPid=100`,
> > > >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
> > > >> PID/epoch still matches the previous PID/epoch pair and hence the TC
> > > >> know it's a retry?
> > > >>
> > > >> If this reasoning is correct, should the logic be as follows:
> > > >>
> > > >> 1. No PID is provided: create a new PID with epoch=0 and set the
> last
> > > >> epoch to -1.
> > > >> 2. Epoch and producerId are provided
> > > >>    a) the provided producerId/epoch matches the current
> > producerId/epoch:
> > > >>       i) if the epoch is not exhausted, bump the epoch
> > > >>       ii) if the epoch is exhausted, create a new PID with epoch=0
> > > >>    b) the provided producerId/epoch matches the previous
> > > >> producerId/epoch: respond with current PID/epoch
> > > >>    c) Otherwise, return INVALID_PRODUCER_EPOCH
> > > >>
> > > >>
> > > >>
> > > >> -Matthias
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On 4/4/19 3:47 PM, Jason Gustafson wrote:
> > > >>> Hi Everyone,
> > > >>>
> > > >>> Sorry for the long delay on this KIP. I have updated it to include
> > the
> > > >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If
> there
> > are
> > > >> no
> > > >>> further comments, I will plan to start a vote early next week.
> > > >>>
> > > >>> Thanks!
> > > >>> Jason
> > > >>>
> > > >>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <
> > adam.bellemare@gmail.com
> > > >>>
> > > >>> wrote:
> > > >>>
> > > >>>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
> > > >>>>
> > > >>>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
> > > >> adam.bellemare@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hi John
> > > >>>>>
> > > >>>>> What is the status of this KIP?
> > > >>>>>
> > > >>>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID"
> error
> > on
> > > >>>>> 2.1.1 for a multitude of our internal topics, and I suspect that
> a
> > > >> proper
> > > >>>>> fix is needed.
> > > >>>>>
> > > >>>>> Adam
> > > >>>>>
> > > >>>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wangguoz@gmail.com
> >
> > > >> wrote:
> > > >>>>>
> > > >>>>>> Thanks Jason. The proposed solution sounds good to me.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Guozhang
> > > >>>>>>
> > > >>>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <
> > jason@confluent.io>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hey Guozhang,
> > > >>>>>>>
> > > >>>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING
> > error
> > > >>>>>>> occurs following expiration of the producerId. It's possible
> that
> > > >>>>>> another
> > > >>>>>>> producerId has been installed in its place following expiration
> > (if
> > > >>>>>> another
> > > >>>>>>> producer instance has become active), or the mapping is empty.
> We
> > can
> > > >>>>>>> safely retry the InitProducerId with the logic in this KIP in
> > order
> > > >> to
> > > >>>>>>> detect which case it is. So I'd suggest something like this:
> > > >>>>>>>
> > > >>>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer
> can
> > send
> > > >>>>>>> InitProducerId using the current producerId and epoch.
> > > >>>>>>> 2. If no mapping exists, the coordinator can generate a new
> > > >> producerId
> > > >>>>>> and
> > > >>>>>>> return it. If a transaction is in progress on the client, it
> will
> > > >> have
> > > >>>>>> to
> > > >>>>>>> be aborted, but the producer can continue afterwards.
> > > >>>>>>> 3. Otherwise if a different producerId has been assigned, then
> we
> > can
> > > >>>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling,
> > we
> > > >> can
> > > >>>>>>> probably raise this as ProducerFencedException since that is
> > > >>>> effectively
> > > >>>>>>> what has happened. Ideally this is the only fatal case that
> users
> > > >> have
> > > >>>>>> to
> > > >>>>>>> handle.
> > > >>>>>>>
> > > >>>>>>> I'll give it a little more thought and update the KIP.
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>> Jason
> > > >>>>>>>
> > > >>>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <
> wangguoz@gmail.com
> > >
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> You're right about the dangling txn since it will actually
> block
> > > >>>>>>>> read-committed consumers from proceeding at all. I'd agree
> that
> > > >>>> since
> > > >>>>>>> this
> > > >>>>>>>> is a very rare case, we can consider fixing it not via
> > broker-side
> > > >>>>>> logic
> > > >>>>>>>> but via tooling in a future work.
> > > >>>>>>>>
> > > >>>>>>>> I've also discovered some related error handling logic inside
> > > >>>> producer
> > > >>>>>>> that
> > > >>>>>>>> may be addressed together with this KIP (since it is mostly
> for
> > > >>>>>> internal
> > > >>>>>>>> implementations the wiki itself does not need to be modified):
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> >
> >
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> > > >>>>>>>>
> > > >>>>>>>> Guozhang
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <
> > jason@confluent.io
> > > >>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hey Guozhang,
> > > >>>>>>>>>
> > > >>>>>>>>> To clarify, the broker does not actually use the ApiVersion
> API
> > > >>>> for
> > > >>>>>>>>> inter-broker communications. The use of an API and its
> > > >>>> corresponding
> > > >>>>>>>>> version is controlled by `inter.broker.protocol.version`.
> > > >>>>>>>>>
> > > >>>>>>>>> Nevertheless, it sounds like we're on the same page about
> > removing
> > > >>>>>>>>> DescribeTransactionState. The impact of a dangling
> transaction
> > is
> > > >>>> a
> > > >>>>>>>> little
> > > >>>>>>>>> worse than what you describe though. Consumers with the
> > > >>>>>> read_committed
> > > >>>>>>>>> isolation level will be stuck. Still, I think we agree that
> > this
> > > >>>>>> case
> > > >>>>>>>>> should be rare and we can reconsider for future work. Rather
> > than
> > > >>>>>>>>> preventing dangling transactions, perhaps we should consider
> > > >>>> options
> > > >>>>>>>> which
> > > >>>>>>>>> allows us to detect them and recover. Anyway, this needs more
> > > >>>>>> thought.
> > > >>>>>>> I
> > > >>>>>>>>> will update the KIP.
> > > >>>>>>>>>
> > > >>>>>>>>> Best,
> > > >>>>>>>>> Jason
> > > >>>>>>>>>
> > > >>>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <
> > wangguoz@gmail.com
> > > >>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> 0. My original question is about the implementation details
> > > >>>>>>> primarily,
> > > >>>>>>>>>> since current the handling logic of the APIVersionResponse
> is
> > > >>>>>> simply
> > > >>>>>>>> "use
> > > >>>>>>>>>> the highest supported version of the corresponding request",
> > but
> > > >>>>>> if
> > > >>>>>>> the
> > > >>>>>>>>>> returned response from APIVersionRequest says "I don't even
> > know
> > > >>>>>>> about
> > > >>>>>>>>> the
> > > >>>>>>>>>> DescribeTransactionStateRequest at all", then we need
> > additional
> > > >>>>>>> logic
> > > >>>>>>>>> for
> > > >>>>>>>>>> the falling back logic. Currently this logic is embedded in
> > > >>>>>>>> NetworkClient
> > > >>>>>>>>>> which is shared by all clients, so I'd like to avoid making
> > this
> > > >>>>>>> logic
> > > >>>>>>>>> more
> > > >>>>>>>>>> complicated.
> > > >>>>>>>>>>
> > > >>>>>>>>>> As for the general issue that a broker does not recognize a
> > > >>>>>> producer
> > > >>>>>>>> with
> > > >>>>>>>>>> sequence number 0, here's my thinking: as you mentioned in
> the
> > > >>>>>> wiki,
> > > >>>>>>>> this
> > > >>>>>>>>>> is only a concern for transactional producer since for
> > > >>>> idempotent
> > > >>>>>>>>> producer
> > > >>>>>>>>>> it can just bump the epoch and go. For transactional
> producer,
> > > >>>>>> even
> > > >>>>>>> if
> > > >>>>>>>>> the
> > > >>>>>>>>>> producer request from a fenced producer gets accepted, its
> > > >>>>>>> transaction
> > > >>>>>>>>> will
> > > >>>>>>>>>> never be committed and hence messages not exposed to
> > > >>>>>> read-committed
> > > >>>>>>>>>> consumers as well. The drawback is though, 1)
> read-uncommitted
> > > >>>>>>>> consumers
> > > >>>>>>>>>> will still read those messages, 2) unnecessary storage for
> > those
> > > >>>>>>> fenced
> > > >>>>>>>>>> produce messages, but in practice should not accumulate to a
> > > >>>> large
> > > >>>>>>>> amount
> > > >>>>>>>>>> since producer should soon try to commit and be told it is
> > > >>>> fenced
> > > >>>>>> and
> > > >>>>>>>>> then
> > > >>>>>>>>>> stop, 3) there will be no markers for those transactional
> > > >>>> messages
> > > >>>>>>>> ever.
> > > >>>>>>>>>> Looking at the list and thinking about the likelihood it may
> > > >>>>>> happen
> > > >>>>>>>>>> assuming we retain the producer up to
> transactional.id.timeout
> > > >>>>>>> (default
> > > >>>>>>>>> is
> > > >>>>>>>>>> 7 days), I feel comfortable leaving it as is.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Guozhang
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> > > >>>>>> jason@confluent.io>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hey Guozhang,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for the comments. Responses below:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 0. The new API is used between brokers, so we govern its
> > usage
> > > >>>>>>> using
> > > >>>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
> > > >>>>>>> upgraded,
> > > >>>>>>>>> we
> > > >>>>>>>>>>> will just fallback to the old logic, which is to accept the
> > > >>>>>> write.
> > > >>>>>>>> This
> > > >>>>>>>>>> is
> > > >>>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
> > > >>>> Does
> > > >>>>>>> that
> > > >>>>>>>>>> seem
> > > >>>>>>>>>>> reasonable?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> To tell the truth, after digging this KIP up and reading it
> > > >>>>>> over, I
> > > >>>>>>>> am
> > > >>>>>>>>>>> doubting how crucial this API is. It is attempting to
> protect
> > > >>>> a
> > > >>>>>>> write
> > > >>>>>>>>>> from
> > > >>>>>>>>>>> a zombie which has just reset its sequence number after
> that
> > > >>>>>>> producer
> > > >>>>>>>>> had
> > > >>>>>>>>>>> had its state cleaned up. However, one of the other
> > > >>>>>> improvements in
> > > >>>>>>>>> this
> > > >>>>>>>>>>> KIP is to maintain producer state beyond its retention in
> the
> > > >>>>>> log.
> > > >>>>>>> I
> > > >>>>>>>>>> think
> > > >>>>>>>>>>> that makes this case sufficiently unlikely that we can
> leave
> > > >>>> it
> > > >>>>>> for
> > > >>>>>>>>>> future
> > > >>>>>>>>>>> work. I am not 100% sure this is the only scenario where
> > > >>>>>>> transaction
> > > >>>>>>>>>> state
> > > >>>>>>>>>>> and log state can diverge anyway, so it would be better to
> > > >>>>>> consider
> > > >>>>>>>>> this
> > > >>>>>>>>>>> problem more generally. What do you think?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 1. Thanks, from memory, the term changed after the first
> > > >>>>>> iteration.
> > > >>>>>>>>> I'll
> > > >>>>>>>>>>> make a pass and try to clarify usage.
> > > >>>>>>>>>>> 2. I was attempting to handle some edge cases since this
> > check
> > > >>>>>>> would
> > > >>>>>>>> be
> > > >>>>>>>>>>> asynchronous. In any case, if we drop this validation as
> > > >>>>>> suggested
> > > >>>>>>>>> above,
> > > >>>>>>>>>>> then we can ignore this.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> -Jason
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> > > >>>>>> wangguoz@gmail.com>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hello Jason, thanks for the great write-up.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> 0. One question about the migration plan: "The new
> > > >>>>>>>>> GetTransactionState
> > > >>>>>>>>>>> API
> > > >>>>>>>>>>>> and the new version of the transaction state message will
> > > >>>> not
> > > >>>>>> be
> > > >>>>>>>> used
> > > >>>>>>>>>>> until
> > > >>>>>>>>>>>> the inter-broker version supports it." I'm not so clear
> > > >>>> about
> > > >>>>>> the
> > > >>>>>>>>>>>> implementation details here: say a broker is on the newer
> > > >>>>>> version
> > > >>>>>>>> and
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>> txn-coordinator is still on older version. Today the
> > > >>>>>>>>> APIVersionsRequest
> > > >>>>>>>>>>> can
> > > >>>>>>>>>>>> only help upgrade / downgrade the request version, but not
> > > >>>>>>>> forbidding
> > > >>>>>>>>>>>> sending any. Are you suggesting we add additional logic on
> > > >>>> the
> > > >>>>>>>> broker
> > > >>>>>>>>>>> side
> > > >>>>>>>>>>>> to handle the case of "not sending the request"? If yes my
> > > >>>>>>> concern
> > > >>>>>>>> is
> > > >>>>>>>>>>> that
> > > >>>>>>>>>>>> this will be some tech-debt code that will live long
> before
> > > >>>>>> being
> > > >>>>>>>>>>> removed.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Some additional minor comments:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring
> to
> > > >>>>>> the
> > > >>>>>>>> same
> > > >>>>>>>>>>> thing
> > > >>>>>>>>>>>> in your wiki.
> > > >>>>>>>>>>>> 2. "The broker must verify after receiving the response
> that
> > > >>>>>> the
> > > >>>>>>>>>> producer
> > > >>>>>>>>>>>> state is still unknown.": not sure why we have to
> validate?
> > > >>>> If
> > > >>>>>>> the
> > > >>>>>>>>>>> metadata
> > > >>>>>>>>>>>> returned from the txn-coordinator can always be considered
> > > >>>> the
> > > >>>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
> > > >>>>>> cache?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Guozhang
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> > > >>>>>>>>> matthias@confluent.io>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> I am +1 on this :)
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> -Matthias
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > > >>>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
> > > >>>> link
> > > >>>>>> to
> > > >>>>>>>> the
> > > >>>>>>>>>> KIP:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > > >>>>>>>>>>>>>> .
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> -Jason
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> > > >>>>>>>>>> jason@confluent.io
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Hi All,
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I have a proposal to improve the
> > > >>>> transactional/idempotent
> > > >>>>>>>>>> producer's
> > > >>>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
> > > >>>>>> result
> > > >>>>>>> of
> > > >>>>>>>>>>> losing
> > > >>>>>>>>>>>>>>> producer state following segment removal. The current
> > > >>>>>>> behavior
> > > >>>>>>>>> is
> > > >>>>>>>>>>> both
> > > >>>>>>>>>>>>>>> complex and limited. Please take a look and let me know
> > > >>>>>> what
> > > >>>>>>>> you
> > > >>>>>>>>>>>> think.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
> > > >>>>>>> initial
> > > >>>>>>>>>> draft.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> -Jason
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> --
> > > >>>>>>>>>>>> -- Guozhang
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>> -- Guozhang
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> --
> > > >>>>>>>> -- Guozhang
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> -- Guozhang
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Jason Gustafson <ja...@confluent.io>.
Hi Guozhang,

1. I think there are still some retriable errors that could affect the
transaction APIs. For example, COORDINATOR_LOAD_IN_PROGRESS.
2. Yes, this is right. The only fatal error is when the producer has been
fenced by another instance.

Thanks,
Jason

On Mon, Aug 26, 2019 at 6:05 PM Guozhang Wang <wa...@gmail.com> wrote:

> Hi Jason,
>
> I've made another pass on the wiki page and it reads much better now. One
> more clarification about the "Simplified error handling" section:
>
> 1. There will be no "retriable error" from the broker side regarding any
> send requests and txn requests (to txn coordinators). All errors would
> cause the corresponding txn to eventually be aborted.
> 2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the
> producer entering the ABORTABLE_ERROR state, but only the current txn to be
> aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to
> enter the FATAL_ERROR state, plus it would cause all future txns to be
> aborted.
>
> Is that right?
>
>
> Guozhang
>
>
> On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax <ma...@confluent.io>
> wrote:
> >
> > Thanks Jason!
> >
> > LGTM.
> >
> > On 8/21/19 3:07 PM, Jason Gustafson wrote:
> > > Hi Matthias,
> > >
> > > Thanks, I appreciate the thorough review. I've revised the section to
> make
> > > the logic clearer. I think you have it right except for the 1). We only
> > > generate a new PID if the epoch cannot be incremented without overflow.
> > >
> > > -Jason
> > >
> > > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <matthias@confluent.io
> >
> > > wrote:
> > >
> > >> Thanks for the KIP. I just have some clarification questions to make
> > >> sure I understand the proposal correctly:
> > >>
> > >> 1) "Safe Epoch Incrementing"
> > >>
> > >>> When the coordinator receives a new InitProducerId request, we will
> use
> > >> the following logic to update the epoch:
> > >>>
> > >>> 1. No epoch is provided: the current epoch will be bumped and the
> last
> > >> epoch will be set to -1.
> > >>> 2. Epoch and producerId are provided, and the provided producerId
> > >> matches the current producerId or the provided producerId matches the
> > >> previous producerId and the provided epoch is exhausted:
> > >>>       a. Provided epoch matches current epoch: the last epoch will be
> > >> set to the current epoch, and the current epoch will be bumped .
> > >>>       b. Provided epoch matches last epoch: the current epoch will be
> > >> returned
> > >>>       c. Else: return INVALID_PRODUCER_EPOCH
> > >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH
> > >>
> > >> Case (1) would be for a new producer. Hence, should we state that "no
> > >> PID" is provided (instead of "no epoch" is provided?). That might be
> > >> clearer and it implies that there is no epoch anyway.
> > >>
> > >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
> > >> error and tries to re-initialize itself.
> > >>
> > >> Case (2a) implies that the producer send its first request and is not
> > >> fenced. Case (2b) implies that the producer re-tries to re-initialize
> > >> itself, ie, it first request to re-initilize did not get a respond but
> > >> was processed by the transaction coordinator. Case (2c) implies that a
> > >> producer was fenced (similar case 3, even if I am not sure what case 3
> > >> actually would be?)
> > >>
> > >> Please let me know if my understanding is correct.
> > >>
> > >> What is still unclear to me is, why case (2 -- or is it only 2b?)
> > >> requires that the "provide epoch is exhausted"?
> > >>
> > >> For case 2b:
> > >>
> > >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
> > >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
> > >> PID/epoch pair. The TC processes the request and creates a new PID=101
> > >> with new epoch=0, however, the respond to the producer is lost. The TC
> > >> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
> > >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
> > >> PID/epoch still matches the previous PID/epoch pair and hence the TC
> > >> know it's a retry?
> > >>
> > >> If this reasoning is correct, should the logic be as follows:
> > >>
> > >> 1. No PID is provided: create a new PID with epoch=0 and set the last
> > >> epoch to -1.
> > >> 2. Epoch and producerId are provided
> > >>    a) the provided producerId/epoch matches the current
> producerId/epoch:
> > >>       i) if the epoch is not exhausted, bump the epoch
> > >>       ii) if the epoch is exhausted, create a new PID with epoch=0
> > >>    b) the provided producerId/epoch matches the previous
> > >> producerId/epoch: respond with current PID/epoch
> > >>    c) Otherwise, return INVALID_PRODUCER_EPOCH
> > >>
> > >>
> > >>
> > >> -Matthias
> > >>
> > >>
> > >>
> > >>
> > >> On 4/4/19 3:47 PM, Jason Gustafson wrote:
> > >>> Hi Everyone,
> > >>>
> > >>> Sorry for the long delay on this KIP. I have updated it to include
> the
> > >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there
> are
> > >> no
> > >>> further comments, I will plan to start a vote early next week.
> > >>>
> > >>> Thanks!
> > >>> Jason
> > >>>
> > >>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <
> adam.bellemare@gmail.com
> > >>>
> > >>> wrote:
> > >>>
> > >>>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
> > >>>>
> > >>>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
> > >> adam.bellemare@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi John
> > >>>>>
> > >>>>> What is the status of this KIP?
> > >>>>>
> > >>>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error
> on
> > >>>>> 2.1.1 for a multitude of our internal topics, and I suspect that a
> > >> proper
> > >>>>> fix is needed.
> > >>>>>
> > >>>>> Adam
> > >>>>>
> > >>>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com>
> > >> wrote:
> > >>>>>
> > >>>>>> Thanks Jason. The proposed solution sounds good to me.
> > >>>>>>
> > >>>>>>
> > >>>>>> Guozhang
> > >>>>>>
> > >>>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <
> jason@confluent.io>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hey Guozhang,
> > >>>>>>>
> > >>>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING
> error
> > >>>>>>> occurs following expiration of the producerId. It's possible that
> > >>>>>> another
> > >>>>>>> producerId has been installed in its place following expiration
> (if
> > >>>>>> another
> > >>>>>>> producer instance has become active), or the mapping is empty. We
> can
> > >>>>>>> safely retry the InitProducerId with the logic in this KIP in
> order
> > >> to
> > >>>>>>> detect which case it is. So I'd suggest something like this:
> > >>>>>>>
> > >>>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can
> send
> > >>>>>>> InitProducerId using the current producerId and epoch.
> > >>>>>>> 2. If no mapping exists, the coordinator can generate a new
> > >> producerId
> > >>>>>> and
> > >>>>>>> return it. If a transaction is in progress on the client, it will
> > >> have
> > >>>>>> to
> > >>>>>>> be aborted, but the producer can continue afterwards.
> > >>>>>>> 3. Otherwise if a different producerId has been assigned, then we
> can
> > >>>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling,
> we
> > >> can
> > >>>>>>> probably raise this as ProducerFencedException since that is
> > >>>> effectively
> > >>>>>>> what has happened. Ideally this is the only fatal case that users
> > >> have
> > >>>>>> to
> > >>>>>>> handle.
> > >>>>>>>
> > >>>>>>> I'll give it a little more thought and update the KIP.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Jason
> > >>>>>>>
> > >>>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wangguoz@gmail.com
> >
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> You're right about the dangling txn since it will actually block
> > >>>>>>>> read-committed consumers from proceeding at all. I'd agree that
> > >>>> since
> > >>>>>>> this
> > >>>>>>>> is a very rare case, we can consider fixing it not via
> broker-side
> > >>>>>> logic
> > >>>>>>>> but via tooling in a future work.
> > >>>>>>>>
> > >>>>>>>> I've also discovered some related error handling logic inside
> > >>>> producer
> > >>>>>>> that
> > >>>>>>>> may be addressed together with this KIP (since it is mostly for
> > >>>>>> internal
> > >>>>>>>> implementations the wiki itself does not need to be modified):
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
>
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> > >>>>>>>>
> > >>>>>>>> Guozhang
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <
> jason@confluent.io
> > >>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hey Guozhang,
> > >>>>>>>>>
> > >>>>>>>>> To clarify, the broker does not actually use the ApiVersion API
> > >>>> for
> > >>>>>>>>> inter-broker communications. The use of an API and its
> > >>>> corresponding
> > >>>>>>>>> version is controlled by `inter.broker.protocol.version`.
> > >>>>>>>>>
> > >>>>>>>>> Nevertheless, it sounds like we're on the same page about
> removing
> > >>>>>>>>> DescribeTransactionState. The impact of a dangling transaction
> is
> > >>>> a
> > >>>>>>>> little
> > >>>>>>>>> worse than what you describe though. Consumers with the
> > >>>>>> read_committed
> > >>>>>>>>> isolation level will be stuck. Still, I think we agree that
> this
> > >>>>>> case
> > >>>>>>>>> should be rare and we can reconsider for future work. Rather
> than
> > >>>>>>>>> preventing dangling transactions, perhaps we should consider
> > >>>> options
> > >>>>>>>> which
> > >>>>>>>>> allows us to detect them and recover. Anyway, this needs more
> > >>>>>> thought.
> > >>>>>>> I
> > >>>>>>>>> will update the KIP.
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Jason
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <
> wangguoz@gmail.com
> > >>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> 0. My original question is about the implementation details
> > >>>>>>> primarily,
> > >>>>>>>>>> since current the handling logic of the APIVersionResponse is
> > >>>>>> simply
> > >>>>>>>> "use
> > >>>>>>>>>> the highest supported version of the corresponding request",
> but
> > >>>>>> if
> > >>>>>>> the
> > >>>>>>>>>> returned response from APIVersionRequest says "I don't even
> know
> > >>>>>>> about
> > >>>>>>>>> the
> > >>>>>>>>>> DescribeTransactionStateRequest at all", then we need
> additional
> > >>>>>>> logic
> > >>>>>>>>> for
> > >>>>>>>>>> the falling back logic. Currently this logic is embedded in
> > >>>>>>>> NetworkClient
> > >>>>>>>>>> which is shared by all clients, so I'd like to avoid making
> this
> > >>>>>>> logic
> > >>>>>>>>> more
> > >>>>>>>>>> complicated.
> > >>>>>>>>>>
> > >>>>>>>>>> As for the general issue that a broker does not recognize a
> > >>>>>> producer
> > >>>>>>>> with
> > >>>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
> > >>>>>> wiki,
> > >>>>>>>> this
> > >>>>>>>>>> is only a concern for transactional producer since for
> > >>>> idempotent
> > >>>>>>>>> producer
> > >>>>>>>>>> it can just bump the epoch and go. For transactional producer,
> > >>>>>> even
> > >>>>>>> if
> > >>>>>>>>> the
> > >>>>>>>>>> producer request from a fenced producer gets accepted, its
> > >>>>>>> transaction
> > >>>>>>>>> will
> > >>>>>>>>>> never be committed and hence messages not exposed to
> > >>>>>> read-committed
> > >>>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
> > >>>>>>>> consumers
> > >>>>>>>>>> will still read those messages, 2) unnecessary storage for
> those
> > >>>>>>> fenced
> > >>>>>>>>>> produce messages, but in practice should not accumulate to a
> > >>>> large
> > >>>>>>>> amount
> > >>>>>>>>>> since producer should soon try to commit and be told it is
> > >>>> fenced
> > >>>>>> and
> > >>>>>>>>> then
> > >>>>>>>>>> stop, 3) there will be no markers for those transactional
> > >>>> messages
> > >>>>>>>> ever.
> > >>>>>>>>>> Looking at the list and thinking about the likelihood it may
> > >>>>>> happen
> > >>>>>>>>>> assuming we retain the producer up to transactional.id.timeout
> > >>>>>>> (default
> > >>>>>>>>> is
> > >>>>>>>>>> 7 days), I feel comfortable leaving it as is.
> > >>>>>>>>>>
> > >>>>>>>>>> Guozhang
> > >>>>>>>>>>
> > >>>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> > >>>>>> jason@confluent.io>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hey Guozhang,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the comments. Responses below:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 0. The new API is used between brokers, so we govern its
> usage
> > >>>>>>> using
> > >>>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
> > >>>>>>> upgraded,
> > >>>>>>>>> we
> > >>>>>>>>>>> will just fallback to the old logic, which is to accept the
> > >>>>>> write.
> > >>>>>>>> This
> > >>>>>>>>>> is
> > >>>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
> > >>>> Does
> > >>>>>>> that
> > >>>>>>>>>> seem
> > >>>>>>>>>>> reasonable?
> > >>>>>>>>>>>
> > >>>>>>>>>>> To tell the truth, after digging this KIP up and reading it
> > >>>>>> over, I
> > >>>>>>>> am
> > >>>>>>>>>>> doubting how crucial this API is. It is attempting to protect
> > >>>> a
> > >>>>>>> write
> > >>>>>>>>>> from
> > >>>>>>>>>>> a zombie which has just reset its sequence number after that
> > >>>>>>> producer
> > >>>>>>>>> had
> > >>>>>>>>>>> had its state cleaned up. However, one of the other
> > >>>>>> improvements in
> > >>>>>>>>> this
> > >>>>>>>>>>> KIP is to maintain producer state beyond its retention in the
> > >>>>>> log.
> > >>>>>>> I
> > >>>>>>>>>> think
> > >>>>>>>>>>> that makes this case sufficiently unlikely that we can leave
> > >>>> it
> > >>>>>> for
> > >>>>>>>>>> future
> > >>>>>>>>>>> work. I am not 100% sure this is the only scenario where
> > >>>>>>> transaction
> > >>>>>>>>>> state
> > >>>>>>>>>>> and log state can diverge anyway, so it would be better to
> > >>>>>> consider
> > >>>>>>>>> this
> > >>>>>>>>>>> problem more generally. What do you think?
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. Thanks, from memory, the term changed after the first
> > >>>>>> iteration.
> > >>>>>>>>> I'll
> > >>>>>>>>>>> make a pass and try to clarify usage.
> > >>>>>>>>>>> 2. I was attempting to handle some edge cases since this
> check
> > >>>>>>> would
> > >>>>>>>> be
> > >>>>>>>>>>> asynchronous. In any case, if we drop this validation as
> > >>>>>> suggested
> > >>>>>>>>> above,
> > >>>>>>>>>>> then we can ignore this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> -Jason
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> > >>>>>> wangguoz@gmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hello Jason, thanks for the great write-up.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 0. One question about the migration plan: "The new
> > >>>>>>>>> GetTransactionState
> > >>>>>>>>>>> API
> > >>>>>>>>>>>> and the new version of the transaction state message will
> > >>>> not
> > >>>>>> be
> > >>>>>>>> used
> > >>>>>>>>>>> until
> > >>>>>>>>>>>> the inter-broker version supports it." I'm not so clear
> > >>>> about
> > >>>>>> the
> > >>>>>>>>>>>> implementation details here: say a broker is on the newer
> > >>>>>> version
> > >>>>>>>> and
> > >>>>>>>>>> the
> > >>>>>>>>>>>> txn-coordinator is still on older version. Today the
> > >>>>>>>>> APIVersionsRequest
> > >>>>>>>>>>> can
> > >>>>>>>>>>>> only help upgrade / downgrade the request version, but not
> > >>>>>>>> forbidding
> > >>>>>>>>>>>> sending any. Are you suggesting we add additional logic on
> > >>>> the
> > >>>>>>>> broker
> > >>>>>>>>>>> side
> > >>>>>>>>>>>> to handle the case of "not sending the request"? If yes my
> > >>>>>>> concern
> > >>>>>>>> is
> > >>>>>>>>>>> that
> > >>>>>>>>>>>> this will be some tech-debt code that will live long before
> > >>>>>> being
> > >>>>>>>>>>> removed.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Some additional minor comments:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
> > >>>>>> the
> > >>>>>>>> same
> > >>>>>>>>>>> thing
> > >>>>>>>>>>>> in your wiki.
> > >>>>>>>>>>>> 2. "The broker must verify after receiving the response that
> > >>>>>> the
> > >>>>>>>>>> producer
> > >>>>>>>>>>>> state is still unknown.": not sure why we have to validate?
> > >>>> If
> > >>>>>>> the
> > >>>>>>>>>>> metadata
> > >>>>>>>>>>>> returned from the txn-coordinator can always be considered
> > >>>> the
> > >>>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
> > >>>>>> cache?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Guozhang
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> > >>>>>>>>> matthias@confluent.io>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> I am +1 on this :)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -Matthias
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > >>>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
> > >>>> link
> > >>>>>> to
> > >>>>>>>> the
> > >>>>>>>>>> KIP:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > >>>>>>>>>>>>>> .
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -Jason
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> > >>>>>>>>>> jason@confluent.io
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi All,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I have a proposal to improve the
> > >>>> transactional/idempotent
> > >>>>>>>>>> producer's
> > >>>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
> > >>>>>> result
> > >>>>>>> of
> > >>>>>>>>>>> losing
> > >>>>>>>>>>>>>>> producer state following segment removal. The current
> > >>>>>>> behavior
> > >>>>>>>>> is
> > >>>>>>>>>>> both
> > >>>>>>>>>>>>>>> complex and limited. Please take a look and let me know
> > >>>>>> what
> > >>>>>>>> you
> > >>>>>>>>>>>> think.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
> > >>>>>>> initial
> > >>>>>>>>>> draft.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> -Jason
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> -- Guozhang
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> -- Guozhang
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> -- Guozhang
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> -- Guozhang
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Guozhang Wang <wa...@gmail.com>.
Hi Jason,

I've made another pass on the wiki page and it reads much better now. One
more clarification about the "Simplified error handling" section:

1. There will be no "retriable error" from the broker side regarding any
send requests and txn requests (to txn coordinators). All errors would
cause the corresponding txn to eventually be aborted.
2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the
producer entering the ABORTABLE_ERROR state, but only the current txn to be
aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to
enter the FATAL_ERROR state, plus it would cause all future txns to be
aborted.

Is that right?


Guozhang


On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax <ma...@confluent.io>
wrote:
>
> Thanks Jason!
>
> LGTM.
>
> On 8/21/19 3:07 PM, Jason Gustafson wrote:
> > Hi Matthias,
> >
> > Thanks, I appreciate the thorough review. I've revised the section to
make
> > the logic clearer. I think you have it right except for the 1). We only
> > generate a new PID if the epoch cannot be incremented without overflow.
> >
> > -Jason
> >
> > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <ma...@confluent.io>
> > wrote:
> >
> >> Thanks for the KIP. I just have some clarification questions to make
> >> sure I understand the proposal correctly:
> >>
> >> 1) "Safe Epoch Incrementing"
> >>
> >>> When the coordinator receives a new InitProducerId request, we will
use
> >> the following logic to update the epoch:
> >>>
> >>> 1. No epoch is provided: the current epoch will be bumped and the last
> >> epoch will be set to -1.
> >>> 2. Epoch and producerId are provided, and the provided producerId
> >> matches the current producerId or the provided producerId matches the
> >> previous producerId and the provided epoch is exhausted:
> >>>       a. Provided epoch matches current epoch: the last epoch will be
> >> set to the current epoch, and the current epoch will be bumped .
> >>>       b. Provided epoch matches last epoch: the current epoch will be
> >> returned
> >>>       c. Else: return INVALID_PRODUCER_EPOCH
> >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH
> >>
> >> Case (1) would be for a new producer. Hence, should we state that "no
> >> PID" is provided (instead of "no epoch" is provided?). That might be
> >> clearer and it implies that there is no epoch anyway.
> >>
> >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
> >> error and tries to re-initialize itself.
> >>
> >> Case (2a) implies that the producer send its first request and is not
> >> fenced. Case (2b) implies that the producer re-tries to re-initialize
> >> itself, ie, it first request to re-initilize did not get a respond but
> >> was processed by the transaction coordinator. Case (2c) implies that a
> >> producer was fenced (similar case 3, even if I am not sure what case 3
> >> actually would be?)
> >>
> >> Please let me know if my understanding is correct.
> >>
> >> What is still unclear to me is, why case (2 -- or is it only 2b?)
> >> requires that the "provide epoch is exhausted"?
> >>
> >> For case 2b:
> >>
> >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
> >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
> >> PID/epoch pair. The TC processes the request and creates a new PID=101
> >> with new epoch=0, however, the respond to the producer is lost. The TC
> >> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
> >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
> >> PID/epoch still matches the previous PID/epoch pair and hence the TC
> >> know it's a retry?
> >>
> >> If this reasoning is correct, should the logic be as follows:
> >>
> >> 1. No PID is provided: create a new PID with epoch=0 and set the last
> >> epoch to -1.
> >> 2. Epoch and producerId are provided
> >>    a) the provided producerId/epoch matches the current
producerId/epoch:
> >>       i) if the epoch is not exhausted, bump the epoch
> >>       ii) if the epoch is exhausted, create a new PID with epoch=0
> >>    b) the provided producerId/epoch matches the previous
> >> producerId/epoch: respond with current PID/epoch
> >>    c) Otherwise, return INVALID_PRODUCER_EPOCH
> >>
> >>
> >>
> >> -Matthias
> >>
> >>
> >>
> >>
> >> On 4/4/19 3:47 PM, Jason Gustafson wrote:
> >>> Hi Everyone,
> >>>
> >>> Sorry for the long delay on this KIP. I have updated it to include the
> >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there
are
> >> no
> >>> further comments, I will plan to start a vote early next week.
> >>>
> >>> Thanks!
> >>> Jason
> >>>
> >>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <
adam.bellemare@gmail.com
> >>>
> >>> wrote:
> >>>
> >>>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
> >>>>
> >>>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
> >> adam.bellemare@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi John
> >>>>>
> >>>>> What is the status of this KIP?
> >>>>>
> >>>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error
on
> >>>>> 2.1.1 for a multitude of our internal topics, and I suspect that a
> >> proper
> >>>>> fix is needed.
> >>>>>
> >>>>> Adam
> >>>>>
> >>>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> Thanks Jason. The proposed solution sounds good to me.
> >>>>>>
> >>>>>>
> >>>>>> Guozhang
> >>>>>>
> >>>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hey Guozhang,
> >>>>>>>
> >>>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING
error
> >>>>>>> occurs following expiration of the producerId. It's possible that
> >>>>>> another
> >>>>>>> producerId has been installed in its place following expiration
(if
> >>>>>> another
> >>>>>>> producer instance has become active), or the mapping is empty. We
can
> >>>>>>> safely retry the InitProducerId with the logic in this KIP in
order
> >> to
> >>>>>>> detect which case it is. So I'd suggest something like this:
> >>>>>>>
> >>>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can
send
> >>>>>>> InitProducerId using the current producerId and epoch.
> >>>>>>> 2. If no mapping exists, the coordinator can generate a new
> >> producerId
> >>>>>> and
> >>>>>>> return it. If a transaction is in progress on the client, it will
> >> have
> >>>>>> to
> >>>>>>> be aborted, but the producer can continue afterwards.
> >>>>>>> 3. Otherwise if a different producerId has been assigned, then we
can
> >>>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we
> >> can
> >>>>>>> probably raise this as ProducerFencedException since that is
> >>>> effectively
> >>>>>>> what has happened. Ideally this is the only fatal case that users
> >> have
> >>>>>> to
> >>>>>>> handle.
> >>>>>>>
> >>>>>>> I'll give it a little more thought and update the KIP.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Jason
> >>>>>>>
> >>>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> You're right about the dangling txn since it will actually block
> >>>>>>>> read-committed consumers from proceeding at all. I'd agree that
> >>>> since
> >>>>>>> this
> >>>>>>>> is a very rare case, we can consider fixing it not via
broker-side
> >>>>>> logic
> >>>>>>>> but via tooling in a future work.
> >>>>>>>>
> >>>>>>>> I've also discovered some related error handling logic inside
> >>>> producer
> >>>>>>> that
> >>>>>>>> may be addressed together with this KIP (since it is mostly for
> >>>>>> internal
> >>>>>>>> implementations the wiki itself does not need to be modified):
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> >>>>>>>>
> >>>>>>>> Guozhang
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <
jason@confluent.io
> >>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Guozhang,
> >>>>>>>>>
> >>>>>>>>> To clarify, the broker does not actually use the ApiVersion API
> >>>> for
> >>>>>>>>> inter-broker communications. The use of an API and its
> >>>> corresponding
> >>>>>>>>> version is controlled by `inter.broker.protocol.version`.
> >>>>>>>>>
> >>>>>>>>> Nevertheless, it sounds like we're on the same page about
removing
> >>>>>>>>> DescribeTransactionState. The impact of a dangling transaction
is
> >>>> a
> >>>>>>>> little
> >>>>>>>>> worse than what you describe though. Consumers with the
> >>>>>> read_committed
> >>>>>>>>> isolation level will be stuck. Still, I think we agree that this
> >>>>>> case
> >>>>>>>>> should be rare and we can reconsider for future work. Rather
than
> >>>>>>>>> preventing dangling transactions, perhaps we should consider
> >>>> options
> >>>>>>>> which
> >>>>>>>>> allows us to detect them and recover. Anyway, this needs more
> >>>>>> thought.
> >>>>>>> I
> >>>>>>>>> will update the KIP.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jason
> >>>>>>>>>
> >>>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <
wangguoz@gmail.com
> >>>>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> 0. My original question is about the implementation details
> >>>>>>> primarily,
> >>>>>>>>>> since current the handling logic of the APIVersionResponse is
> >>>>>> simply
> >>>>>>>> "use
> >>>>>>>>>> the highest supported version of the corresponding request",
but
> >>>>>> if
> >>>>>>> the
> >>>>>>>>>> returned response from APIVersionRequest says "I don't even
know
> >>>>>>> about
> >>>>>>>>> the
> >>>>>>>>>> DescribeTransactionStateRequest at all", then we need
additional
> >>>>>>> logic
> >>>>>>>>> for
> >>>>>>>>>> the falling back logic. Currently this logic is embedded in
> >>>>>>>> NetworkClient
> >>>>>>>>>> which is shared by all clients, so I'd like to avoid making
this
> >>>>>>> logic
> >>>>>>>>> more
> >>>>>>>>>> complicated.
> >>>>>>>>>>
> >>>>>>>>>> As for the general issue that a broker does not recognize a
> >>>>>> producer
> >>>>>>>> with
> >>>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
> >>>>>> wiki,
> >>>>>>>> this
> >>>>>>>>>> is only a concern for transactional producer since for
> >>>> idempotent
> >>>>>>>>> producer
> >>>>>>>>>> it can just bump the epoch and go. For transactional producer,
> >>>>>> even
> >>>>>>> if
> >>>>>>>>> the
> >>>>>>>>>> producer request from a fenced producer gets accepted, its
> >>>>>>> transaction
> >>>>>>>>> will
> >>>>>>>>>> never be committed and hence messages not exposed to
> >>>>>> read-committed
> >>>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
> >>>>>>>> consumers
> >>>>>>>>>> will still read those messages, 2) unnecessary storage for
those
> >>>>>>> fenced
> >>>>>>>>>> produce messages, but in practice should not accumulate to a
> >>>> large
> >>>>>>>> amount
> >>>>>>>>>> since producer should soon try to commit and be told it is
> >>>> fenced
> >>>>>> and
> >>>>>>>>> then
> >>>>>>>>>> stop, 3) there will be no markers for those transactional
> >>>> messages
> >>>>>>>> ever.
> >>>>>>>>>> Looking at the list and thinking about the likelihood it may
> >>>>>> happen
> >>>>>>>>>> assuming we retain the producer up to transactional.id.timeout
> >>>>>>> (default
> >>>>>>>>> is
> >>>>>>>>>> 7 days), I feel comfortable leaving it as is.
> >>>>>>>>>>
> >>>>>>>>>> Guozhang
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> >>>>>> jason@confluent.io>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hey Guozhang,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the comments. Responses below:
> >>>>>>>>>>>
> >>>>>>>>>>> 0. The new API is used between brokers, so we govern its usage
> >>>>>>> using
> >>>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
> >>>>>>> upgraded,
> >>>>>>>>> we
> >>>>>>>>>>> will just fallback to the old logic, which is to accept the
> >>>>>> write.
> >>>>>>>> This
> >>>>>>>>>> is
> >>>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
> >>>> Does
> >>>>>>> that
> >>>>>>>>>> seem
> >>>>>>>>>>> reasonable?
> >>>>>>>>>>>
> >>>>>>>>>>> To tell the truth, after digging this KIP up and reading it
> >>>>>> over, I
> >>>>>>>> am
> >>>>>>>>>>> doubting how crucial this API is. It is attempting to protect
> >>>> a
> >>>>>>> write
> >>>>>>>>>> from
> >>>>>>>>>>> a zombie which has just reset its sequence number after that
> >>>>>>> producer
> >>>>>>>>> had
> >>>>>>>>>>> had its state cleaned up. However, one of the other
> >>>>>> improvements in
> >>>>>>>>> this
> >>>>>>>>>>> KIP is to maintain producer state beyond its retention in the
> >>>>>> log.
> >>>>>>> I
> >>>>>>>>>> think
> >>>>>>>>>>> that makes this case sufficiently unlikely that we can leave
> >>>> it
> >>>>>> for
> >>>>>>>>>> future
> >>>>>>>>>>> work. I am not 100% sure this is the only scenario where
> >>>>>>> transaction
> >>>>>>>>>> state
> >>>>>>>>>>> and log state can diverge anyway, so it would be better to
> >>>>>> consider
> >>>>>>>>> this
> >>>>>>>>>>> problem more generally. What do you think?
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Thanks, from memory, the term changed after the first
> >>>>>> iteration.
> >>>>>>>>> I'll
> >>>>>>>>>>> make a pass and try to clarify usage.
> >>>>>>>>>>> 2. I was attempting to handle some edge cases since this check
> >>>>>>> would
> >>>>>>>> be
> >>>>>>>>>>> asynchronous. In any case, if we drop this validation as
> >>>>>> suggested
> >>>>>>>>> above,
> >>>>>>>>>>> then we can ignore this.
> >>>>>>>>>>>
> >>>>>>>>>>> -Jason
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> >>>>>> wangguoz@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hello Jason, thanks for the great write-up.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 0. One question about the migration plan: "The new
> >>>>>>>>> GetTransactionState
> >>>>>>>>>>> API
> >>>>>>>>>>>> and the new version of the transaction state message will
> >>>> not
> >>>>>> be
> >>>>>>>> used
> >>>>>>>>>>> until
> >>>>>>>>>>>> the inter-broker version supports it." I'm not so clear
> >>>> about
> >>>>>> the
> >>>>>>>>>>>> implementation details here: say a broker is on the newer
> >>>>>> version
> >>>>>>>> and
> >>>>>>>>>> the
> >>>>>>>>>>>> txn-coordinator is still on older version. Today the
> >>>>>>>>> APIVersionsRequest
> >>>>>>>>>>> can
> >>>>>>>>>>>> only help upgrade / downgrade the request version, but not
> >>>>>>>> forbidding
> >>>>>>>>>>>> sending any. Are you suggesting we add additional logic on
> >>>> the
> >>>>>>>> broker
> >>>>>>>>>>> side
> >>>>>>>>>>>> to handle the case of "not sending the request"? If yes my
> >>>>>>> concern
> >>>>>>>> is
> >>>>>>>>>>> that
> >>>>>>>>>>>> this will be some tech-debt code that will live long before
> >>>>>> being
> >>>>>>>>>>> removed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Some additional minor comments:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
> >>>>>> the
> >>>>>>>> same
> >>>>>>>>>>> thing
> >>>>>>>>>>>> in your wiki.
> >>>>>>>>>>>> 2. "The broker must verify after receiving the response that
> >>>>>> the
> >>>>>>>>>> producer
> >>>>>>>>>>>> state is still unknown.": not sure why we have to validate?
> >>>> If
> >>>>>>> the
> >>>>>>>>>>> metadata
> >>>>>>>>>>>> returned from the txn-coordinator can always be considered
> >>>> the
> >>>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
> >>>>>> cache?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> >>>>>>>>> matthias@confluent.io>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I am +1 on this :)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
> >>>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
> >>>> link
> >>>>>> to
> >>>>>>>> the
> >>>>>>>>>> KIP:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> >>>>>>>>>>>>>> .
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Jason
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> >>>>>>>>>> jason@confluent.io
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi All,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I have a proposal to improve the
> >>>> transactional/idempotent
> >>>>>>>>>> producer's
> >>>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
> >>>>>> result
> >>>>>>> of
> >>>>>>>>>>> losing
> >>>>>>>>>>>>>>> producer state following segment removal. The current
> >>>>>>> behavior
> >>>>>>>>> is
> >>>>>>>>>>> both
> >>>>>>>>>>>>>>> complex and limited. Please take a look and let me know
> >>>>>> what
> >>>>>>>> you
> >>>>>>>>>>>> think.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
> >>>>>>> initial
> >>>>>>>>>> draft.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Jason
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> -- Guozhang
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> -- Guozhang
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -- Guozhang
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>


--
-- Guozhang

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Thanks Jason!

LGTM.

On 8/21/19 3:07 PM, Jason Gustafson wrote:
> Hi Matthias,
> 
> Thanks, I appreciate the thorough review. I've revised the section to make
> the logic clearer. I think you have it right except for the 1). We only
> generate a new PID if the epoch cannot be incremented without overflow.
> 
> -Jason
> 
> On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <ma...@confluent.io>
> wrote:
> 
>> Thanks for the KIP. I just have some clarification questions to make
>> sure I understand the proposal correctly:
>>
>> 1) "Safe Epoch Incrementing"
>>
>>> When the coordinator receives a new InitProducerId request, we will use
>> the following logic to update the epoch:
>>>
>>> 1. No epoch is provided: the current epoch will be bumped and the last
>> epoch will be set to -1.
>>> 2. Epoch and producerId are provided, and the provided producerId
>> matches the current producerId or the provided producerId matches the
>> previous producerId and the provided epoch is exhausted:
>>>       a. Provided epoch matches current epoch: the last epoch will be
>> set to the current epoch, and the current epoch will be bumped .
>>>       b. Provided epoch matches last epoch: the current epoch will be
>> returned
>>>       c. Else: return INVALID_PRODUCER_EPOCH
>>> 3. Otherwise, return INVALID_PRODUCER_EPOCH
>>
>> Case (1) would be for a new producer. Hence, should we state that "no
>> PID" is provided (instead of "no epoch" is provided?). That might be
>> clearer and it implies that there is no epoch anyway.
>>
>> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
>> error and tries to re-initialize itself.
>>
>> Case (2a) implies that the producer send its first request and is not
>> fenced. Case (2b) implies that the producer re-tries to re-initialize
>> itself, ie, it first request to re-initilize did not get a respond but
>> was processed by the transaction coordinator. Case (2c) implies that a
>> producer was fenced (similar case 3, even if I am not sure what case 3
>> actually would be?)
>>
>> Please let me know if my understanding is correct.
>>
>> What is still unclear to me is, why case (2 -- or is it only 2b?)
>> requires that the "provide epoch is exhausted"?
>>
>> For case 2b:
>>
>> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
>> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
>> PID/epoch pair. The TC processes the request and creates a new PID=101
>> with new epoch=0, however, the respond to the producer is lost. The TC
>> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
>> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
>> PID/epoch still matches the previous PID/epoch pair and hence the TC
>> know it's a retry?
>>
>> If this reasoning is correct, should the logic be as follows:
>>
>> 1. No PID is provided: create a new PID with epoch=0 and set the last
>> epoch to -1.
>> 2. Epoch and producerId are provided
>>    a) the provided producerId/epoch matches the current producerId/epoch:
>>       i) if the epoch is not exhausted, bump the epoch
>>       ii) if the epoch is exhausted, create a new PID with epoch=0
>>    b) the provided producerId/epoch matches the previous
>> producerId/epoch: respond with current PID/epoch
>>    c) Otherwise, return INVALID_PRODUCER_EPOCH
>>
>>
>>
>> -Matthias
>>
>>
>>
>>
>> On 4/4/19 3:47 PM, Jason Gustafson wrote:
>>> Hi Everyone,
>>>
>>> Sorry for the long delay on this KIP. I have updated it to include the
>>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are
>> no
>>> further comments, I will plan to start a vote early next week.
>>>
>>> Thanks!
>>> Jason
>>>
>>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <adam.bellemare@gmail.com
>>>
>>> wrote:
>>>
>>>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
>>>>
>>>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
>> adam.bellemare@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi John
>>>>>
>>>>> What is the status of this KIP?
>>>>>
>>>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
>>>>> 2.1.1 for a multitude of our internal topics, and I suspect that a
>> proper
>>>>> fix is needed.
>>>>>
>>>>> Adam
>>>>>
>>>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com>
>> wrote:
>>>>>
>>>>>> Thanks Jason. The proposed solution sounds good to me.
>>>>>>
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Guozhang,
>>>>>>>
>>>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
>>>>>>> occurs following expiration of the producerId. It's possible that
>>>>>> another
>>>>>>> producerId has been installed in its place following expiration (if
>>>>>> another
>>>>>>> producer instance has become active), or the mapping is empty. We can
>>>>>>> safely retry the InitProducerId with the logic in this KIP in order
>> to
>>>>>>> detect which case it is. So I'd suggest something like this:
>>>>>>>
>>>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
>>>>>>> InitProducerId using the current producerId and epoch.
>>>>>>> 2. If no mapping exists, the coordinator can generate a new
>> producerId
>>>>>> and
>>>>>>> return it. If a transaction is in progress on the client, it will
>> have
>>>>>> to
>>>>>>> be aborted, but the producer can continue afterwards.
>>>>>>> 3. Otherwise if a different producerId has been assigned, then we can
>>>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we
>> can
>>>>>>> probably raise this as ProducerFencedException since that is
>>>> effectively
>>>>>>> what has happened. Ideally this is the only fatal case that users
>> have
>>>>>> to
>>>>>>> handle.
>>>>>>>
>>>>>>> I'll give it a little more thought and update the KIP.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jason
>>>>>>>
>>>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>> You're right about the dangling txn since it will actually block
>>>>>>>> read-committed consumers from proceeding at all. I'd agree that
>>>> since
>>>>>>> this
>>>>>>>> is a very rare case, we can consider fixing it not via broker-side
>>>>>> logic
>>>>>>>> but via tooling in a future work.
>>>>>>>>
>>>>>>>> I've also discovered some related error handling logic inside
>>>> producer
>>>>>>> that
>>>>>>>> may be addressed together with this KIP (since it is mostly for
>>>>>> internal
>>>>>>>> implementations the wiki itself does not need to be modified):
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <jason@confluent.io
>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Guozhang,
>>>>>>>>>
>>>>>>>>> To clarify, the broker does not actually use the ApiVersion API
>>>> for
>>>>>>>>> inter-broker communications. The use of an API and its
>>>> corresponding
>>>>>>>>> version is controlled by `inter.broker.protocol.version`.
>>>>>>>>>
>>>>>>>>> Nevertheless, it sounds like we're on the same page about removing
>>>>>>>>> DescribeTransactionState. The impact of a dangling transaction is
>>>> a
>>>>>>>> little
>>>>>>>>> worse than what you describe though. Consumers with the
>>>>>> read_committed
>>>>>>>>> isolation level will be stuck. Still, I think we agree that this
>>>>>> case
>>>>>>>>> should be rare and we can reconsider for future work. Rather than
>>>>>>>>> preventing dangling transactions, perhaps we should consider
>>>> options
>>>>>>>> which
>>>>>>>>> allows us to detect them and recover. Anyway, this needs more
>>>>>> thought.
>>>>>>> I
>>>>>>>>> will update the KIP.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jason
>>>>>>>>>
>>>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wangguoz@gmail.com
>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> 0. My original question is about the implementation details
>>>>>>> primarily,
>>>>>>>>>> since current the handling logic of the APIVersionResponse is
>>>>>> simply
>>>>>>>> "use
>>>>>>>>>> the highest supported version of the corresponding request", but
>>>>>> if
>>>>>>> the
>>>>>>>>>> returned response from APIVersionRequest says "I don't even know
>>>>>>> about
>>>>>>>>> the
>>>>>>>>>> DescribeTransactionStateRequest at all", then we need additional
>>>>>>> logic
>>>>>>>>> for
>>>>>>>>>> the falling back logic. Currently this logic is embedded in
>>>>>>>> NetworkClient
>>>>>>>>>> which is shared by all clients, so I'd like to avoid making this
>>>>>>> logic
>>>>>>>>> more
>>>>>>>>>> complicated.
>>>>>>>>>>
>>>>>>>>>> As for the general issue that a broker does not recognize a
>>>>>> producer
>>>>>>>> with
>>>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
>>>>>> wiki,
>>>>>>>> this
>>>>>>>>>> is only a concern for transactional producer since for
>>>> idempotent
>>>>>>>>> producer
>>>>>>>>>> it can just bump the epoch and go. For transactional producer,
>>>>>> even
>>>>>>> if
>>>>>>>>> the
>>>>>>>>>> producer request from a fenced producer gets accepted, its
>>>>>>> transaction
>>>>>>>>> will
>>>>>>>>>> never be committed and hence messages not exposed to
>>>>>> read-committed
>>>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
>>>>>>>> consumers
>>>>>>>>>> will still read those messages, 2) unnecessary storage for those
>>>>>>> fenced
>>>>>>>>>> produce messages, but in practice should not accumulate to a
>>>> large
>>>>>>>> amount
>>>>>>>>>> since producer should soon try to commit and be told it is
>>>> fenced
>>>>>> and
>>>>>>>>> then
>>>>>>>>>> stop, 3) there will be no markers for those transactional
>>>> messages
>>>>>>>> ever.
>>>>>>>>>> Looking at the list and thinking about the likelihood it may
>>>>>> happen
>>>>>>>>>> assuming we retain the producer up to transactional.id.timeout
>>>>>>> (default
>>>>>>>>> is
>>>>>>>>>> 7 days), I feel comfortable leaving it as is.
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
>>>>>> jason@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the comments. Responses below:
>>>>>>>>>>>
>>>>>>>>>>> 0. The new API is used between brokers, so we govern its usage
>>>>>>> using
>>>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
>>>>>>> upgraded,
>>>>>>>>> we
>>>>>>>>>>> will just fallback to the old logic, which is to accept the
>>>>>> write.
>>>>>>>> This
>>>>>>>>>> is
>>>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
>>>> Does
>>>>>>> that
>>>>>>>>>> seem
>>>>>>>>>>> reasonable?
>>>>>>>>>>>
>>>>>>>>>>> To tell the truth, after digging this KIP up and reading it
>>>>>> over, I
>>>>>>>> am
>>>>>>>>>>> doubting how crucial this API is. It is attempting to protect
>>>> a
>>>>>>> write
>>>>>>>>>> from
>>>>>>>>>>> a zombie which has just reset its sequence number after that
>>>>>>> producer
>>>>>>>>> had
>>>>>>>>>>> had its state cleaned up. However, one of the other
>>>>>> improvements in
>>>>>>>>> this
>>>>>>>>>>> KIP is to maintain producer state beyond its retention in the
>>>>>> log.
>>>>>>> I
>>>>>>>>>> think
>>>>>>>>>>> that makes this case sufficiently unlikely that we can leave
>>>> it
>>>>>> for
>>>>>>>>>> future
>>>>>>>>>>> work. I am not 100% sure this is the only scenario where
>>>>>>> transaction
>>>>>>>>>> state
>>>>>>>>>>> and log state can diverge anyway, so it would be better to
>>>>>> consider
>>>>>>>>> this
>>>>>>>>>>> problem more generally. What do you think?
>>>>>>>>>>>
>>>>>>>>>>> 1. Thanks, from memory, the term changed after the first
>>>>>> iteration.
>>>>>>>>> I'll
>>>>>>>>>>> make a pass and try to clarify usage.
>>>>>>>>>>> 2. I was attempting to handle some edge cases since this check
>>>>>>> would
>>>>>>>> be
>>>>>>>>>>> asynchronous. In any case, if we drop this validation as
>>>>>> suggested
>>>>>>>>> above,
>>>>>>>>>>> then we can ignore this.
>>>>>>>>>>>
>>>>>>>>>>> -Jason
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
>>>>>> wangguoz@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello Jason, thanks for the great write-up.
>>>>>>>>>>>>
>>>>>>>>>>>> 0. One question about the migration plan: "The new
>>>>>>>>> GetTransactionState
>>>>>>>>>>> API
>>>>>>>>>>>> and the new version of the transaction state message will
>>>> not
>>>>>> be
>>>>>>>> used
>>>>>>>>>>> until
>>>>>>>>>>>> the inter-broker version supports it." I'm not so clear
>>>> about
>>>>>> the
>>>>>>>>>>>> implementation details here: say a broker is on the newer
>>>>>> version
>>>>>>>> and
>>>>>>>>>> the
>>>>>>>>>>>> txn-coordinator is still on older version. Today the
>>>>>>>>> APIVersionsRequest
>>>>>>>>>>> can
>>>>>>>>>>>> only help upgrade / downgrade the request version, but not
>>>>>>>> forbidding
>>>>>>>>>>>> sending any. Are you suggesting we add additional logic on
>>>> the
>>>>>>>> broker
>>>>>>>>>>> side
>>>>>>>>>>>> to handle the case of "not sending the request"? If yes my
>>>>>>> concern
>>>>>>>> is
>>>>>>>>>>> that
>>>>>>>>>>>> this will be some tech-debt code that will live long before
>>>>>> being
>>>>>>>>>>> removed.
>>>>>>>>>>>>
>>>>>>>>>>>> Some additional minor comments:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
>>>>>> the
>>>>>>>> same
>>>>>>>>>>> thing
>>>>>>>>>>>> in your wiki.
>>>>>>>>>>>> 2. "The broker must verify after receiving the response that
>>>>>> the
>>>>>>>>>> producer
>>>>>>>>>>>> state is still unknown.": not sure why we have to validate?
>>>> If
>>>>>>> the
>>>>>>>>>>> metadata
>>>>>>>>>>>> returned from the txn-coordinator can always be considered
>>>> the
>>>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
>>>>>> cache?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
>>>>>>>>> matthias@confluent.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I am +1 on this :)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
>>>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
>>>> link
>>>>>> to
>>>>>>>> the
>>>>>>>>>> KIP:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Jason
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
>>>>>>>>>> jason@confluent.io
>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have a proposal to improve the
>>>> transactional/idempotent
>>>>>>>>>> producer's
>>>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
>>>>>> result
>>>>>>> of
>>>>>>>>>>> losing
>>>>>>>>>>>>>>> producer state following segment removal. The current
>>>>>>> behavior
>>>>>>>>> is
>>>>>>>>>>> both
>>>>>>>>>>>>>>> complex and limited. Please take a look and let me know
>>>>>> what
>>>>>>>> you
>>>>>>>>>>>> think.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
>>>>>>> initial
>>>>>>>>>> draft.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Jason
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -- Guozhang
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>
>>
>>
> 


Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Jason Gustafson <ja...@confluent.io>.
Hi Matthias,

Thanks, I appreciate the thorough review. I've revised the section to make
the logic clearer. I think you have it right except for the 1). We only
generate a new PID if the epoch cannot be incremented without overflow.

-Jason

On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <ma...@confluent.io>
wrote:

> Thanks for the KIP. I just have some clarification questions to make
> sure I understand the proposal correctly:
>
> 1) "Safe Epoch Incrementing"
>
> > When the coordinator receives a new InitProducerId request, we will use
> the following logic to update the epoch:
> >
> > 1. No epoch is provided: the current epoch will be bumped and the last
> epoch will be set to -1.
> > 2. Epoch and producerId are provided, and the provided producerId
> matches the current producerId or the provided producerId matches the
> previous producerId and the provided epoch is exhausted:
> >       a. Provided epoch matches current epoch: the last epoch will be
> set to the current epoch, and the current epoch will be bumped .
> >       b. Provided epoch matches last epoch: the current epoch will be
> returned
> >       c. Else: return INVALID_PRODUCER_EPOCH
> > 3. Otherwise, return INVALID_PRODUCER_EPOCH
>
> Case (1) would be for a new producer. Hence, should we state that "no
> PID" is provided (instead of "no epoch" is provided?). That might be
> clearer and it implies that there is no epoch anyway.
>
> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
> error and tries to re-initialize itself.
>
> Case (2a) implies that the producer send its first request and is not
> fenced. Case (2b) implies that the producer re-tries to re-initialize
> itself, ie, it first request to re-initilize did not get a respond but
> was processed by the transaction coordinator. Case (2c) implies that a
> producer was fenced (similar case 3, even if I am not sure what case 3
> actually would be?)
>
> Please let me know if my understanding is correct.
>
> What is still unclear to me is, why case (2 -- or is it only 2b?)
> requires that the "provide epoch is exhausted"?
>
> For case 2b:
>
> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
> PID/epoch pair. The TC processes the request and creates a new PID=101
> with new epoch=0, however, the respond to the producer is lost. The TC
> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
> PID/epoch still matches the previous PID/epoch pair and hence the TC
> know it's a retry?
>
> If this reasoning is correct, should the logic be as follows:
>
> 1. No PID is provided: create a new PID with epoch=0 and set the last
> epoch to -1.
> 2. Epoch and producerId are provided
>    a) the provided producerId/epoch matches the current producerId/epoch:
>       i) if the epoch is not exhausted, bump the epoch
>       ii) if the epoch is exhausted, create a new PID with epoch=0
>    b) the provided producerId/epoch matches the previous
> producerId/epoch: respond with current PID/epoch
>    c) Otherwise, return INVALID_PRODUCER_EPOCH
>
>
>
> -Matthias
>
>
>
>
> On 4/4/19 3:47 PM, Jason Gustafson wrote:
> > Hi Everyone,
> >
> > Sorry for the long delay on this KIP. I have updated it to include the
> > handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are
> no
> > further comments, I will plan to start a vote early next week.
> >
> > Thanks!
> > Jason
> >
> > On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <adam.bellemare@gmail.com
> >
> > wrote:
> >
> >> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
> >>
> >> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
> adam.bellemare@gmail.com>
> >> wrote:
> >>
> >>> Hi John
> >>>
> >>> What is the status of this KIP?
> >>>
> >>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
> >>> 2.1.1 for a multitude of our internal topics, and I suspect that a
> proper
> >>> fix is needed.
> >>>
> >>> Adam
> >>>
> >>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >>>
> >>>> Thanks Jason. The proposed solution sounds good to me.
> >>>>
> >>>>
> >>>> Guozhang
> >>>>
> >>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
> >>>> wrote:
> >>>>
> >>>>> Hey Guozhang,
> >>>>>
> >>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
> >>>>> occurs following expiration of the producerId. It's possible that
> >>>> another
> >>>>> producerId has been installed in its place following expiration (if
> >>>> another
> >>>>> producer instance has become active), or the mapping is empty. We can
> >>>>> safely retry the InitProducerId with the logic in this KIP in order
> to
> >>>>> detect which case it is. So I'd suggest something like this:
> >>>>>
> >>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
> >>>>> InitProducerId using the current producerId and epoch.
> >>>>> 2. If no mapping exists, the coordinator can generate a new
> producerId
> >>>> and
> >>>>> return it. If a transaction is in progress on the client, it will
> have
> >>>> to
> >>>>> be aborted, but the producer can continue afterwards.
> >>>>> 3. Otherwise if a different producerId has been assigned, then we can
> >>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we
> can
> >>>>> probably raise this as ProducerFencedException since that is
> >> effectively
> >>>>> what has happened. Ideally this is the only fatal case that users
> have
> >>>> to
> >>>>> handle.
> >>>>>
> >>>>> I'll give it a little more thought and update the KIP.
> >>>>>
> >>>>> Thanks,
> >>>>> Jason
> >>>>>
> >>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> You're right about the dangling txn since it will actually block
> >>>>>> read-committed consumers from proceeding at all. I'd agree that
> >> since
> >>>>> this
> >>>>>> is a very rare case, we can consider fixing it not via broker-side
> >>>> logic
> >>>>>> but via tooling in a future work.
> >>>>>>
> >>>>>> I've also discovered some related error handling logic inside
> >> producer
> >>>>> that
> >>>>>> may be addressed together with this KIP (since it is mostly for
> >>>> internal
> >>>>>> implementations the wiki itself does not need to be modified):
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> >>>>>>
> >>>>>> Guozhang
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <jason@confluent.io
> >>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hey Guozhang,
> >>>>>>>
> >>>>>>> To clarify, the broker does not actually use the ApiVersion API
> >> for
> >>>>>>> inter-broker communications. The use of an API and its
> >> corresponding
> >>>>>>> version is controlled by `inter.broker.protocol.version`.
> >>>>>>>
> >>>>>>> Nevertheless, it sounds like we're on the same page about removing
> >>>>>>> DescribeTransactionState. The impact of a dangling transaction is
> >> a
> >>>>>> little
> >>>>>>> worse than what you describe though. Consumers with the
> >>>> read_committed
> >>>>>>> isolation level will be stuck. Still, I think we agree that this
> >>>> case
> >>>>>>> should be rare and we can reconsider for future work. Rather than
> >>>>>>> preventing dangling transactions, perhaps we should consider
> >> options
> >>>>>> which
> >>>>>>> allows us to detect them and recover. Anyway, this needs more
> >>>> thought.
> >>>>> I
> >>>>>>> will update the KIP.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jason
> >>>>>>>
> >>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wangguoz@gmail.com
> >>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> 0. My original question is about the implementation details
> >>>>> primarily,
> >>>>>>>> since current the handling logic of the APIVersionResponse is
> >>>> simply
> >>>>>> "use
> >>>>>>>> the highest supported version of the corresponding request", but
> >>>> if
> >>>>> the
> >>>>>>>> returned response from APIVersionRequest says "I don't even know
> >>>>> about
> >>>>>>> the
> >>>>>>>> DescribeTransactionStateRequest at all", then we need additional
> >>>>> logic
> >>>>>>> for
> >>>>>>>> the falling back logic. Currently this logic is embedded in
> >>>>>> NetworkClient
> >>>>>>>> which is shared by all clients, so I'd like to avoid making this
> >>>>> logic
> >>>>>>> more
> >>>>>>>> complicated.
> >>>>>>>>
> >>>>>>>> As for the general issue that a broker does not recognize a
> >>>> producer
> >>>>>> with
> >>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
> >>>> wiki,
> >>>>>> this
> >>>>>>>> is only a concern for transactional producer since for
> >> idempotent
> >>>>>>> producer
> >>>>>>>> it can just bump the epoch and go. For transactional producer,
> >>>> even
> >>>>> if
> >>>>>>> the
> >>>>>>>> producer request from a fenced producer gets accepted, its
> >>>>> transaction
> >>>>>>> will
> >>>>>>>> never be committed and hence messages not exposed to
> >>>> read-committed
> >>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
> >>>>>> consumers
> >>>>>>>> will still read those messages, 2) unnecessary storage for those
> >>>>> fenced
> >>>>>>>> produce messages, but in practice should not accumulate to a
> >> large
> >>>>>> amount
> >>>>>>>> since producer should soon try to commit and be told it is
> >> fenced
> >>>> and
> >>>>>>> then
> >>>>>>>> stop, 3) there will be no markers for those transactional
> >> messages
> >>>>>> ever.
> >>>>>>>> Looking at the list and thinking about the likelihood it may
> >>>> happen
> >>>>>>>> assuming we retain the producer up to transactional.id.timeout
> >>>>> (default
> >>>>>>> is
> >>>>>>>> 7 days), I feel comfortable leaving it as is.
> >>>>>>>>
> >>>>>>>> Guozhang
> >>>>>>>>
> >>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> >>>> jason@confluent.io>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Guozhang,
> >>>>>>>>>
> >>>>>>>>> Thanks for the comments. Responses below:
> >>>>>>>>>
> >>>>>>>>> 0. The new API is used between brokers, so we govern its usage
> >>>>> using
> >>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
> >>>>> upgraded,
> >>>>>>> we
> >>>>>>>>> will just fallback to the old logic, which is to accept the
> >>>> write.
> >>>>>> This
> >>>>>>>> is
> >>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
> >> Does
> >>>>> that
> >>>>>>>> seem
> >>>>>>>>> reasonable?
> >>>>>>>>>
> >>>>>>>>> To tell the truth, after digging this KIP up and reading it
> >>>> over, I
> >>>>>> am
> >>>>>>>>> doubting how crucial this API is. It is attempting to protect
> >> a
> >>>>> write
> >>>>>>>> from
> >>>>>>>>> a zombie which has just reset its sequence number after that
> >>>>> producer
> >>>>>>> had
> >>>>>>>>> had its state cleaned up. However, one of the other
> >>>> improvements in
> >>>>>>> this
> >>>>>>>>> KIP is to maintain producer state beyond its retention in the
> >>>> log.
> >>>>> I
> >>>>>>>> think
> >>>>>>>>> that makes this case sufficiently unlikely that we can leave
> >> it
> >>>> for
> >>>>>>>> future
> >>>>>>>>> work. I am not 100% sure this is the only scenario where
> >>>>> transaction
> >>>>>>>> state
> >>>>>>>>> and log state can diverge anyway, so it would be better to
> >>>> consider
> >>>>>>> this
> >>>>>>>>> problem more generally. What do you think?
> >>>>>>>>>
> >>>>>>>>> 1. Thanks, from memory, the term changed after the first
> >>>> iteration.
> >>>>>>> I'll
> >>>>>>>>> make a pass and try to clarify usage.
> >>>>>>>>> 2. I was attempting to handle some edge cases since this check
> >>>>> would
> >>>>>> be
> >>>>>>>>> asynchronous. In any case, if we drop this validation as
> >>>> suggested
> >>>>>>> above,
> >>>>>>>>> then we can ignore this.
> >>>>>>>>>
> >>>>>>>>> -Jason
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> >>>> wangguoz@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hello Jason, thanks for the great write-up.
> >>>>>>>>>>
> >>>>>>>>>> 0. One question about the migration plan: "The new
> >>>>>>> GetTransactionState
> >>>>>>>>> API
> >>>>>>>>>> and the new version of the transaction state message will
> >> not
> >>>> be
> >>>>>> used
> >>>>>>>>> until
> >>>>>>>>>> the inter-broker version supports it." I'm not so clear
> >> about
> >>>> the
> >>>>>>>>>> implementation details here: say a broker is on the newer
> >>>> version
> >>>>>> and
> >>>>>>>> the
> >>>>>>>>>> txn-coordinator is still on older version. Today the
> >>>>>>> APIVersionsRequest
> >>>>>>>>> can
> >>>>>>>>>> only help upgrade / downgrade the request version, but not
> >>>>>> forbidding
> >>>>>>>>>> sending any. Are you suggesting we add additional logic on
> >> the
> >>>>>> broker
> >>>>>>>>> side
> >>>>>>>>>> to handle the case of "not sending the request"? If yes my
> >>>>> concern
> >>>>>> is
> >>>>>>>>> that
> >>>>>>>>>> this will be some tech-debt code that will live long before
> >>>> being
> >>>>>>>>> removed.
> >>>>>>>>>>
> >>>>>>>>>> Some additional minor comments:
> >>>>>>>>>>
> >>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
> >>>> the
> >>>>>> same
> >>>>>>>>> thing
> >>>>>>>>>> in your wiki.
> >>>>>>>>>> 2. "The broker must verify after receiving the response that
> >>>> the
> >>>>>>>> producer
> >>>>>>>>>> state is still unknown.": not sure why we have to validate?
> >> If
> >>>>> the
> >>>>>>>>> metadata
> >>>>>>>>>> returned from the txn-coordinator can always be considered
> >> the
> >>>>>>>>>> source-of-truth, can't we just bindly use it to update the
> >>>> cache?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Guozhang
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> >>>>>>> matthias@confluent.io>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I am +1 on this :)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> -Matthias
> >>>>>>>>>>>
> >>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
> >>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
> >> link
> >>>> to
> >>>>>> the
> >>>>>>>> KIP:
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> >>>>>>>>>>>> .
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Jason
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> >>>>>>>> jason@confluent.io
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi All,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I have a proposal to improve the
> >> transactional/idempotent
> >>>>>>>> producer's
> >>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
> >>>> result
> >>>>> of
> >>>>>>>>> losing
> >>>>>>>>>>>>> producer state following segment removal. The current
> >>>>> behavior
> >>>>>>> is
> >>>>>>>>> both
> >>>>>>>>>>>>> complex and limited. Please take a look and let me know
> >>>> what
> >>>>>> you
> >>>>>>>>>> think.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
> >>>>> initial
> >>>>>>>> draft.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Jason
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> -- Guozhang
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> -- Guozhang
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -- Guozhang
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> -- Guozhang
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Thanks for the KIP. I just have some clarification questions to make
sure I understand the proposal correctly:

1) "Safe Epoch Incrementing"

> When the coordinator receives a new InitProducerId request, we will use the following logic to update the epoch:
> 
> 1. No epoch is provided: the current epoch will be bumped and the last epoch will be set to -1.
> 2. Epoch and producerId are provided, and the provided producerId matches the current producerId or the provided producerId matches the previous producerId and the provided epoch is exhausted:
> 	a. Provided epoch matches current epoch: the last epoch will be set to the current epoch, and the current epoch will be bumped .
> 	b. Provided epoch matches last epoch: the current epoch will be returned
> 	c. Else: return INVALID_PRODUCER_EPOCH
> 3. Otherwise, return INVALID_PRODUCER_EPOCH

Case (1) would be for a new producer. Hence, should we state that "no
PID" is provided (instead of "no epoch" is provided?). That might be
clearer and it implies that there is no epoch anyway.

Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
error and tries to re-initialize itself.

Case (2a) implies that the producer send its first request and is not
fenced. Case (2b) implies that the producer re-tries to re-initialize
itself, ie, it first request to re-initilize did not get a respond but
was processed by the transaction coordinator. Case (2c) implies that a
producer was fenced (similar case 3, even if I am not sure what case 3
actually would be?)

Please let me know if my understanding is correct.

What is still unclear to me is, why case (2 -- or is it only 2b?)
requires that the "provide epoch is exhausted"?

For case 2b:

Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
`UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
PID/epoch pair. The TC processes the request and creates a new PID=101
with new epoch=0, however, the respond to the producer is lost. The TC
still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
`previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
PID/epoch still matches the previous PID/epoch pair and hence the TC
know it's a retry?

If this reasoning is correct, should the logic be as follows:

1. No PID is provided: create a new PID with epoch=0 and set the last
epoch to -1.
2. Epoch and producerId are provided
   a) the provided producerId/epoch matches the current producerId/epoch:
      i) if the epoch is not exhausted, bump the epoch
      ii) if the epoch is exhausted, create a new PID with epoch=0
   b) the provided producerId/epoch matches the previous
producerId/epoch: respond with current PID/epoch
   c) Otherwise, return INVALID_PRODUCER_EPOCH



-Matthias




On 4/4/19 3:47 PM, Jason Gustafson wrote:
> Hi Everyone,
> 
> Sorry for the long delay on this KIP. I have updated it to include the
> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are no
> further comments, I will plan to start a vote early next week.
> 
> Thanks!
> Jason
> 
> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <ad...@gmail.com>
> wrote:
> 
>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
>>
>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <ad...@gmail.com>
>> wrote:
>>
>>> Hi John
>>>
>>> What is the status of this KIP?
>>>
>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
>>> 2.1.1 for a multitude of our internal topics, and I suspect that a proper
>>> fix is needed.
>>>
>>> Adam
>>>
>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com> wrote:
>>>
>>>> Thanks Jason. The proposed solution sounds good to me.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
>>>> wrote:
>>>>
>>>>> Hey Guozhang,
>>>>>
>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
>>>>> occurs following expiration of the producerId. It's possible that
>>>> another
>>>>> producerId has been installed in its place following expiration (if
>>>> another
>>>>> producer instance has become active), or the mapping is empty. We can
>>>>> safely retry the InitProducerId with the logic in this KIP in order to
>>>>> detect which case it is. So I'd suggest something like this:
>>>>>
>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
>>>>> InitProducerId using the current producerId and epoch.
>>>>> 2. If no mapping exists, the coordinator can generate a new producerId
>>>> and
>>>>> return it. If a transaction is in progress on the client, it will have
>>>> to
>>>>> be aborted, but the producer can continue afterwards.
>>>>> 3. Otherwise if a different producerId has been assigned, then we can
>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can
>>>>> probably raise this as ProducerFencedException since that is
>> effectively
>>>>> what has happened. Ideally this is the only fatal case that users have
>>>> to
>>>>> handle.
>>>>>
>>>>> I'll give it a little more thought and update the KIP.
>>>>>
>>>>> Thanks,
>>>>> Jason
>>>>>
>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> You're right about the dangling txn since it will actually block
>>>>>> read-committed consumers from proceeding at all. I'd agree that
>> since
>>>>> this
>>>>>> is a very rare case, we can consider fixing it not via broker-side
>>>> logic
>>>>>> but via tooling in a future work.
>>>>>>
>>>>>> I've also discovered some related error handling logic inside
>> producer
>>>>> that
>>>>>> may be addressed together with this KIP (since it is mostly for
>>>> internal
>>>>>> implementations the wiki itself does not need to be modified):
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <jason@confluent.io
>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Guozhang,
>>>>>>>
>>>>>>> To clarify, the broker does not actually use the ApiVersion API
>> for
>>>>>>> inter-broker communications. The use of an API and its
>> corresponding
>>>>>>> version is controlled by `inter.broker.protocol.version`.
>>>>>>>
>>>>>>> Nevertheless, it sounds like we're on the same page about removing
>>>>>>> DescribeTransactionState. The impact of a dangling transaction is
>> a
>>>>>> little
>>>>>>> worse than what you describe though. Consumers with the
>>>> read_committed
>>>>>>> isolation level will be stuck. Still, I think we agree that this
>>>> case
>>>>>>> should be rare and we can reconsider for future work. Rather than
>>>>>>> preventing dangling transactions, perhaps we should consider
>> options
>>>>>> which
>>>>>>> allows us to detect them and recover. Anyway, this needs more
>>>> thought.
>>>>> I
>>>>>>> will update the KIP.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jason
>>>>>>>
>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wangguoz@gmail.com
>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>> 0. My original question is about the implementation details
>>>>> primarily,
>>>>>>>> since current the handling logic of the APIVersionResponse is
>>>> simply
>>>>>> "use
>>>>>>>> the highest supported version of the corresponding request", but
>>>> if
>>>>> the
>>>>>>>> returned response from APIVersionRequest says "I don't even know
>>>>> about
>>>>>>> the
>>>>>>>> DescribeTransactionStateRequest at all", then we need additional
>>>>> logic
>>>>>>> for
>>>>>>>> the falling back logic. Currently this logic is embedded in
>>>>>> NetworkClient
>>>>>>>> which is shared by all clients, so I'd like to avoid making this
>>>>> logic
>>>>>>> more
>>>>>>>> complicated.
>>>>>>>>
>>>>>>>> As for the general issue that a broker does not recognize a
>>>> producer
>>>>>> with
>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
>>>> wiki,
>>>>>> this
>>>>>>>> is only a concern for transactional producer since for
>> idempotent
>>>>>>> producer
>>>>>>>> it can just bump the epoch and go. For transactional producer,
>>>> even
>>>>> if
>>>>>>> the
>>>>>>>> producer request from a fenced producer gets accepted, its
>>>>> transaction
>>>>>>> will
>>>>>>>> never be committed and hence messages not exposed to
>>>> read-committed
>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
>>>>>> consumers
>>>>>>>> will still read those messages, 2) unnecessary storage for those
>>>>> fenced
>>>>>>>> produce messages, but in practice should not accumulate to a
>> large
>>>>>> amount
>>>>>>>> since producer should soon try to commit and be told it is
>> fenced
>>>> and
>>>>>>> then
>>>>>>>> stop, 3) there will be no markers for those transactional
>> messages
>>>>>> ever.
>>>>>>>> Looking at the list and thinking about the likelihood it may
>>>> happen
>>>>>>>> assuming we retain the producer up to transactional.id.timeout
>>>>> (default
>>>>>>> is
>>>>>>>> 7 days), I feel comfortable leaving it as is.
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
>>>> jason@confluent.io>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Guozhang,
>>>>>>>>>
>>>>>>>>> Thanks for the comments. Responses below:
>>>>>>>>>
>>>>>>>>> 0. The new API is used between brokers, so we govern its usage
>>>>> using
>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
>>>>> upgraded,
>>>>>>> we
>>>>>>>>> will just fallback to the old logic, which is to accept the
>>>> write.
>>>>>> This
>>>>>>>> is
>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
>> Does
>>>>> that
>>>>>>>> seem
>>>>>>>>> reasonable?
>>>>>>>>>
>>>>>>>>> To tell the truth, after digging this KIP up and reading it
>>>> over, I
>>>>>> am
>>>>>>>>> doubting how crucial this API is. It is attempting to protect
>> a
>>>>> write
>>>>>>>> from
>>>>>>>>> a zombie which has just reset its sequence number after that
>>>>> producer
>>>>>>> had
>>>>>>>>> had its state cleaned up. However, one of the other
>>>> improvements in
>>>>>>> this
>>>>>>>>> KIP is to maintain producer state beyond its retention in the
>>>> log.
>>>>> I
>>>>>>>> think
>>>>>>>>> that makes this case sufficiently unlikely that we can leave
>> it
>>>> for
>>>>>>>> future
>>>>>>>>> work. I am not 100% sure this is the only scenario where
>>>>> transaction
>>>>>>>> state
>>>>>>>>> and log state can diverge anyway, so it would be better to
>>>> consider
>>>>>>> this
>>>>>>>>> problem more generally. What do you think?
>>>>>>>>>
>>>>>>>>> 1. Thanks, from memory, the term changed after the first
>>>> iteration.
>>>>>>> I'll
>>>>>>>>> make a pass and try to clarify usage.
>>>>>>>>> 2. I was attempting to handle some edge cases since this check
>>>>> would
>>>>>> be
>>>>>>>>> asynchronous. In any case, if we drop this validation as
>>>> suggested
>>>>>>> above,
>>>>>>>>> then we can ignore this.
>>>>>>>>>
>>>>>>>>> -Jason
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
>>>> wangguoz@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Jason, thanks for the great write-up.
>>>>>>>>>>
>>>>>>>>>> 0. One question about the migration plan: "The new
>>>>>>> GetTransactionState
>>>>>>>>> API
>>>>>>>>>> and the new version of the transaction state message will
>> not
>>>> be
>>>>>> used
>>>>>>>>> until
>>>>>>>>>> the inter-broker version supports it." I'm not so clear
>> about
>>>> the
>>>>>>>>>> implementation details here: say a broker is on the newer
>>>> version
>>>>>> and
>>>>>>>> the
>>>>>>>>>> txn-coordinator is still on older version. Today the
>>>>>>> APIVersionsRequest
>>>>>>>>> can
>>>>>>>>>> only help upgrade / downgrade the request version, but not
>>>>>> forbidding
>>>>>>>>>> sending any. Are you suggesting we add additional logic on
>> the
>>>>>> broker
>>>>>>>>> side
>>>>>>>>>> to handle the case of "not sending the request"? If yes my
>>>>> concern
>>>>>> is
>>>>>>>>> that
>>>>>>>>>> this will be some tech-debt code that will live long before
>>>> being
>>>>>>>>> removed.
>>>>>>>>>>
>>>>>>>>>> Some additional minor comments:
>>>>>>>>>>
>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
>>>> the
>>>>>> same
>>>>>>>>> thing
>>>>>>>>>> in your wiki.
>>>>>>>>>> 2. "The broker must verify after receiving the response that
>>>> the
>>>>>>>> producer
>>>>>>>>>> state is still unknown.": not sure why we have to validate?
>> If
>>>>> the
>>>>>>>>> metadata
>>>>>>>>>> returned from the txn-coordinator can always be considered
>> the
>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
>>>> cache?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
>>>>>>> matthias@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am +1 on this :)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
>> link
>>>> to
>>>>>> the
>>>>>>>> KIP:
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>> -Jason
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
>>>>>>>> jason@confluent.io
>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a proposal to improve the
>> transactional/idempotent
>>>>>>>> producer's
>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
>>>> result
>>>>> of
>>>>>>>>> losing
>>>>>>>>>>>>> producer state following segment removal. The current
>>>>> behavior
>>>>>>> is
>>>>>>>>> both
>>>>>>>>>>>>> complex and limited. Please take a look and let me know
>>>> what
>>>>>> you
>>>>>>>>>> think.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
>>>>> initial
>>>>>>>> draft.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jason
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -- Guozhang
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>
> 


Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by "Matthias J. Sax" <ma...@confluent.io>.
@Adam:

As workaround, you can increase the repartition topic config
`segment.bytes` and set a larger segment size. This should mitigate the
issue.


-Matthias



On 4/4/19 3:47 PM, Jason Gustafson wrote:
> Hi Everyone,
> 
> Sorry for the long delay on this KIP. I have updated it to include the
> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are no
> further comments, I will plan to start a vote early next week.
> 
> Thanks!
> Jason
> 
> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <ad...@gmail.com>
> wrote:
> 
>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
>>
>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <ad...@gmail.com>
>> wrote:
>>
>>> Hi John
>>>
>>> What is the status of this KIP?
>>>
>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
>>> 2.1.1 for a multitude of our internal topics, and I suspect that a proper
>>> fix is needed.
>>>
>>> Adam
>>>
>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com> wrote:
>>>
>>>> Thanks Jason. The proposed solution sounds good to me.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
>>>> wrote:
>>>>
>>>>> Hey Guozhang,
>>>>>
>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
>>>>> occurs following expiration of the producerId. It's possible that
>>>> another
>>>>> producerId has been installed in its place following expiration (if
>>>> another
>>>>> producer instance has become active), or the mapping is empty. We can
>>>>> safely retry the InitProducerId with the logic in this KIP in order to
>>>>> detect which case it is. So I'd suggest something like this:
>>>>>
>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
>>>>> InitProducerId using the current producerId and epoch.
>>>>> 2. If no mapping exists, the coordinator can generate a new producerId
>>>> and
>>>>> return it. If a transaction is in progress on the client, it will have
>>>> to
>>>>> be aborted, but the producer can continue afterwards.
>>>>> 3. Otherwise if a different producerId has been assigned, then we can
>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can
>>>>> probably raise this as ProducerFencedException since that is
>> effectively
>>>>> what has happened. Ideally this is the only fatal case that users have
>>>> to
>>>>> handle.
>>>>>
>>>>> I'll give it a little more thought and update the KIP.
>>>>>
>>>>> Thanks,
>>>>> Jason
>>>>>
>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> You're right about the dangling txn since it will actually block
>>>>>> read-committed consumers from proceeding at all. I'd agree that
>> since
>>>>> this
>>>>>> is a very rare case, we can consider fixing it not via broker-side
>>>> logic
>>>>>> but via tooling in a future work.
>>>>>>
>>>>>> I've also discovered some related error handling logic inside
>> producer
>>>>> that
>>>>>> may be addressed together with this KIP (since it is mostly for
>>>> internal
>>>>>> implementations the wiki itself does not need to be modified):
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <jason@confluent.io
>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Guozhang,
>>>>>>>
>>>>>>> To clarify, the broker does not actually use the ApiVersion API
>> for
>>>>>>> inter-broker communications. The use of an API and its
>> corresponding
>>>>>>> version is controlled by `inter.broker.protocol.version`.
>>>>>>>
>>>>>>> Nevertheless, it sounds like we're on the same page about removing
>>>>>>> DescribeTransactionState. The impact of a dangling transaction is
>> a
>>>>>> little
>>>>>>> worse than what you describe though. Consumers with the
>>>> read_committed
>>>>>>> isolation level will be stuck. Still, I think we agree that this
>>>> case
>>>>>>> should be rare and we can reconsider for future work. Rather than
>>>>>>> preventing dangling transactions, perhaps we should consider
>> options
>>>>>> which
>>>>>>> allows us to detect them and recover. Anyway, this needs more
>>>> thought.
>>>>> I
>>>>>>> will update the KIP.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jason
>>>>>>>
>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wangguoz@gmail.com
>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>> 0. My original question is about the implementation details
>>>>> primarily,
>>>>>>>> since current the handling logic of the APIVersionResponse is
>>>> simply
>>>>>> "use
>>>>>>>> the highest supported version of the corresponding request", but
>>>> if
>>>>> the
>>>>>>>> returned response from APIVersionRequest says "I don't even know
>>>>> about
>>>>>>> the
>>>>>>>> DescribeTransactionStateRequest at all", then we need additional
>>>>> logic
>>>>>>> for
>>>>>>>> the falling back logic. Currently this logic is embedded in
>>>>>> NetworkClient
>>>>>>>> which is shared by all clients, so I'd like to avoid making this
>>>>> logic
>>>>>>> more
>>>>>>>> complicated.
>>>>>>>>
>>>>>>>> As for the general issue that a broker does not recognize a
>>>> producer
>>>>>> with
>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
>>>> wiki,
>>>>>> this
>>>>>>>> is only a concern for transactional producer since for
>> idempotent
>>>>>>> producer
>>>>>>>> it can just bump the epoch and go. For transactional producer,
>>>> even
>>>>> if
>>>>>>> the
>>>>>>>> producer request from a fenced producer gets accepted, its
>>>>> transaction
>>>>>>> will
>>>>>>>> never be committed and hence messages not exposed to
>>>> read-committed
>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
>>>>>> consumers
>>>>>>>> will still read those messages, 2) unnecessary storage for those
>>>>> fenced
>>>>>>>> produce messages, but in practice should not accumulate to a
>> large
>>>>>> amount
>>>>>>>> since producer should soon try to commit and be told it is
>> fenced
>>>> and
>>>>>>> then
>>>>>>>> stop, 3) there will be no markers for those transactional
>> messages
>>>>>> ever.
>>>>>>>> Looking at the list and thinking about the likelihood it may
>>>> happen
>>>>>>>> assuming we retain the producer up to transactional.id.timeout
>>>>> (default
>>>>>>> is
>>>>>>>> 7 days), I feel comfortable leaving it as is.
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
>>>> jason@confluent.io>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Guozhang,
>>>>>>>>>
>>>>>>>>> Thanks for the comments. Responses below:
>>>>>>>>>
>>>>>>>>> 0. The new API is used between brokers, so we govern its usage
>>>>> using
>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
>>>>> upgraded,
>>>>>>> we
>>>>>>>>> will just fallback to the old logic, which is to accept the
>>>> write.
>>>>>> This
>>>>>>>> is
>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
>> Does
>>>>> that
>>>>>>>> seem
>>>>>>>>> reasonable?
>>>>>>>>>
>>>>>>>>> To tell the truth, after digging this KIP up and reading it
>>>> over, I
>>>>>> am
>>>>>>>>> doubting how crucial this API is. It is attempting to protect
>> a
>>>>> write
>>>>>>>> from
>>>>>>>>> a zombie which has just reset its sequence number after that
>>>>> producer
>>>>>>> had
>>>>>>>>> had its state cleaned up. However, one of the other
>>>> improvements in
>>>>>>> this
>>>>>>>>> KIP is to maintain producer state beyond its retention in the
>>>> log.
>>>>> I
>>>>>>>> think
>>>>>>>>> that makes this case sufficiently unlikely that we can leave
>> it
>>>> for
>>>>>>>> future
>>>>>>>>> work. I am not 100% sure this is the only scenario where
>>>>> transaction
>>>>>>>> state
>>>>>>>>> and log state can diverge anyway, so it would be better to
>>>> consider
>>>>>>> this
>>>>>>>>> problem more generally. What do you think?
>>>>>>>>>
>>>>>>>>> 1. Thanks, from memory, the term changed after the first
>>>> iteration.
>>>>>>> I'll
>>>>>>>>> make a pass and try to clarify usage.
>>>>>>>>> 2. I was attempting to handle some edge cases since this check
>>>>> would
>>>>>> be
>>>>>>>>> asynchronous. In any case, if we drop this validation as
>>>> suggested
>>>>>>> above,
>>>>>>>>> then we can ignore this.
>>>>>>>>>
>>>>>>>>> -Jason
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
>>>> wangguoz@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Jason, thanks for the great write-up.
>>>>>>>>>>
>>>>>>>>>> 0. One question about the migration plan: "The new
>>>>>>> GetTransactionState
>>>>>>>>> API
>>>>>>>>>> and the new version of the transaction state message will
>> not
>>>> be
>>>>>> used
>>>>>>>>> until
>>>>>>>>>> the inter-broker version supports it." I'm not so clear
>> about
>>>> the
>>>>>>>>>> implementation details here: say a broker is on the newer
>>>> version
>>>>>> and
>>>>>>>> the
>>>>>>>>>> txn-coordinator is still on older version. Today the
>>>>>>> APIVersionsRequest
>>>>>>>>> can
>>>>>>>>>> only help upgrade / downgrade the request version, but not
>>>>>> forbidding
>>>>>>>>>> sending any. Are you suggesting we add additional logic on
>> the
>>>>>> broker
>>>>>>>>> side
>>>>>>>>>> to handle the case of "not sending the request"? If yes my
>>>>> concern
>>>>>> is
>>>>>>>>> that
>>>>>>>>>> this will be some tech-debt code that will live long before
>>>> being
>>>>>>>>> removed.
>>>>>>>>>>
>>>>>>>>>> Some additional minor comments:
>>>>>>>>>>
>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
>>>> the
>>>>>> same
>>>>>>>>> thing
>>>>>>>>>> in your wiki.
>>>>>>>>>> 2. "The broker must verify after receiving the response that
>>>> the
>>>>>>>> producer
>>>>>>>>>> state is still unknown.": not sure why we have to validate?
>> If
>>>>> the
>>>>>>>>> metadata
>>>>>>>>>> returned from the txn-coordinator can always be considered
>> the
>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
>>>> cache?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
>>>>>>> matthias@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am +1 on this :)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
>> link
>>>> to
>>>>>> the
>>>>>>>> KIP:
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>> -Jason
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
>>>>>>>> jason@confluent.io
>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a proposal to improve the
>> transactional/idempotent
>>>>>>>> producer's
>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
>>>> result
>>>>> of
>>>>>>>>> losing
>>>>>>>>>>>>> producer state following segment removal. The current
>>>>> behavior
>>>>>>> is
>>>>>>>>> both
>>>>>>>>>>>>> complex and limited. Please take a look and let me know
>>>> what
>>>>>> you
>>>>>>>>>> think.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
>>>>> initial
>>>>>>>> draft.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jason
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -- Guozhang
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>
>