You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Jason Gustafson <ja...@confluent.io> on 2018/09/04 16:55:34 UTC

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Bump. Thanks to Magnus for noticing that I forgot to link to the KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
.

-Jason

On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <ja...@confluent.io> wrote:

> Hi All,
>
> I have a proposal to improve the transactional/idempotent producer's
> handling of the UNKNOWN_PRODUCER error, which is the result of losing
> producer state following segment removal. The current behavior is both
> complex and limited. Please take a look and let me know what you think.
>
> Thanks in advance to Matthias Sax for feedback on the initial draft.
>
> -Jason
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Guozhang Wang <wa...@gmail.com>.

Makes sense, thanks!

On Tue, Aug 27, 2019 at 10:38 AM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Guozhang,
>
> 1. I think there are still some retriable errors that could affect the
> transaction APIs. For example, COORDINATOR_LOAD_IN_PROGRESS.
> 2. Yes, this is right. The only fatal error is when the producer has been
> fenced by another instance.
>
> Thanks,
> Jason
>
> On Mon, Aug 26, 2019 at 6:05 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hi Jason,
> >
> > I've made another pass on the wiki page and it reads much better now. One
> > more clarification about the "Simplified error handling" section:
> >
> > 1. There will be no "retriable error" from the broker side regarding any
> > send requests and txn requests (to txn coordinators). All errors would
> > cause the corresponding txn to eventually be aborted.
> > 2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the
> > producer entering the ABORTABLE_ERROR state, but only the current txn to
> be
> > aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to
> > enter the FATAL_ERROR state, plus it would cause all future txns to be
> > aborted.
> >
> > Is that right?
> >
> >
> > Guozhang
> >
> >
> > On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax <ma...@confluent.io>
> > wrote:
> > >
> > > Thanks Jason!
> > >
> > > LGTM.
> > >
> > > On 8/21/19 3:07 PM, Jason Gustafson wrote:
> > > > Hi Matthias,
> > > >
> > > > Thanks, I appreciate the thorough review. I've revised the section to
> > make
> > > > the logic clearer. I think you have it right except for the 1). We
> only
> > > > generate a new PID if the epoch cannot be incremented without
> overflow.
> > > >
> > > > -Jason
> > > >
> > > > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <
> matthias@confluent.io
> > >
> > > > wrote:
> > > >
> > > >> Thanks for the KIP. I just have some clarification questions to make
> > > >> sure I understand the proposal correctly:
> > > >>
> > > >> 1) "Safe Epoch Incrementing"
> > > >>
> > > >>> When the coordinator receives a new InitProducerId request, we will
> > use
> > > >> the following logic to update the epoch:
> > > >>>
> > > >>> 1. No epoch is provided: the current epoch will be bumped and the
> > last
> > > >> epoch will be set to -1.
> > > >>> 2. Epoch and producerId are provided, and the provided producerId
> > > >> matches the current producerId or the provided producerId matches
> the
> > > >> previous producerId and the provided epoch is exhausted:
> > > >>>       a. Provided epoch matches current epoch: the last epoch will
> be
> > > >> set to the current epoch, and the current epoch will be bumped .
> > > >>>       b. Provided epoch matches last epoch: the current epoch will
> be
> > > >> returned
> > > >>>       c. Else: return INVALID_PRODUCER_EPOCH
> > > >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH
> > > >>
> > > >> Case (1) would be for a new producer. Hence, should we state that
> "no
> > > >> PID" is provided (instead of "no epoch" is provided?). That might be
> > > >> clearer and it implies that there is no epoch anyway.
> > > >>
> > > >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
> > > >> error and tries to re-initialize itself.
> > > >>
> > > >> Case (2a) implies that the producer send its first request and is
> not
> > > >> fenced. Case (2b) implies that the producer re-tries to
> re-initialize
> > > >> itself, ie, it first request to re-initilize did not get a respond
> but
> > > >> was processed by the transaction coordinator. Case (2c) implies
> that a
> > > >> producer was fenced (similar case 3, even if I am not sure what
> case 3
> > > >> actually would be?)
> > > >>
> > > >> Please let me know if my understanding is correct.
> > > >>
> > > >> What is still unclear to me is, why case (2 -- or is it only 2b?)
> > > >> requires that the "provide epoch is exhausted"?
> > > >>
> > > >> For case 2b:
> > > >>
> > > >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
> > > >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the
> corresponding
> > > >> PID/epoch pair. The TC processes the request and creates a new
> PID=101
> > > >> with new epoch=0, however, the respond to the producer is lost. The
> TC
> > > >> still stores `currentPid=101`, `currentEpoch=0` and
> `previousPid=100`,
> > > >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
> > > >> PID/epoch still matches the previous PID/epoch pair and hence the TC
> > > >> know it's a retry?
> > > >>
> > > >> If this reasoning is correct, should the logic be as follows:
> > > >>
> > > >> 1. No PID is provided: create a new PID with epoch=0 and set the
> last
> > > >> epoch to -1.
> > > >> 2. Epoch and producerId are provided
> > > >>    a) the provided producerId/epoch matches the current
> > producerId/epoch:
> > > >>       i) if the epoch is not exhausted, bump the epoch
> > > >>       ii) if the epoch is exhausted, create a new PID with epoch=0
> > > >>    b) the provided producerId/epoch matches the previous
> > > >> producerId/epoch: respond with current PID/epoch
> > > >>    c) Otherwise, return INVALID_PRODUCER_EPOCH
> > > >>
> > > >>
> > > >>
> > > >> -Matthias
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On 4/4/19 3:47 PM, Jason Gustafson wrote:
> > > >>> Hi Everyone,
> > > >>>
> > > >>> Sorry for the long delay on this KIP. I have updated it to include
> > the
> > > >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If
> there
> > are
> > > >> no
> > > >>> further comments, I will plan to start a vote early next week.
> > > >>>
> > > >>> Thanks!
> > > >>> Jason
> > > >>>
> > > >>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <
> > adam.bellemare@gmail.com
> > > >>>
> > > >>> wrote:
> > > >>>
> > > >>>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
> > > >>>>
> > > >>>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
> > > >> adam.bellemare@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hi John
> > > >>>>>
> > > >>>>> What is the status of this KIP?
> > > >>>>>
> > > >>>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID"
> error
> > on
> > > >>>>> 2.1.1 for a multitude of our internal topics, and I suspect that
> a
> > > >> proper
> > > >>>>> fix is needed.
> > > >>>>>
> > > >>>>> Adam
> > > >>>>>
> > > >>>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wangguoz@gmail.com
> >
> > > >> wrote:
> > > >>>>>
> > > >>>>>> Thanks Jason. The proposed solution sounds good to me.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Guozhang
> > > >>>>>>
> > > >>>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <
> > jason@confluent.io>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hey Guozhang,
> > > >>>>>>>
> > > >>>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING
> > error
> > > >>>>>>> occurs following expiration of the producerId. It's possible
> that
> > > >>>>>> another
> > > >>>>>>> producerId has been installed in its place following expiration
> > (if
> > > >>>>>> another
> > > >>>>>>> producer instance has become active), or the mapping is empty.
> We
> > can
> > > >>>>>>> safely retry the InitProducerId with the logic in this KIP in
> > order
> > > >> to
> > > >>>>>>> detect which case it is. So I'd suggest something like this:
> > > >>>>>>>
> > > >>>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer
> can
> > send
> > > >>>>>>> InitProducerId using the current producerId and epoch.
> > > >>>>>>> 2. If no mapping exists, the coordinator can generate a new
> > > >> producerId
> > > >>>>>> and
> > > >>>>>>> return it. If a transaction is in progress on the client, it
> will
> > > >> have
> > > >>>>>> to
> > > >>>>>>> be aborted, but the producer can continue afterwards.
> > > >>>>>>> 3. Otherwise if a different producerId has been assigned, then
> we
> > can
> > > >>>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling,
> > we
> > > >> can
> > > >>>>>>> probably raise this as ProducerFencedException since that is
> > > >>>> effectively
> > > >>>>>>> what has happened. Ideally this is the only fatal case that
> users
> > > >> have
> > > >>>>>> to
> > > >>>>>>> handle.
> > > >>>>>>>
> > > >>>>>>> I'll give it a little more thought and update the KIP.
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>> Jason
> > > >>>>>>>
> > > >>>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <
> wangguoz@gmail.com
> > >
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> You're right about the dangling txn since it will actually
> block
> > > >>>>>>>> read-committed consumers from proceeding at all. I'd agree
> that
> > > >>>> since
> > > >>>>>>> this
> > > >>>>>>>> is a very rare case, we can consider fixing it not via
> > broker-side
> > > >>>>>> logic
> > > >>>>>>>> but via tooling in a future work.
> > > >>>>>>>>
> > > >>>>>>>> I've also discovered some related error handling logic inside
> > > >>>> producer
> > > >>>>>>> that
> > > >>>>>>>> may be addressed together with this KIP (since it is mostly
> for
> > > >>>>>> internal
> > > >>>>>>>> implementations the wiki itself does not need to be modified):
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> >
> >
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> > > >>>>>>>>
> > > >>>>>>>> Guozhang
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <
> > jason@confluent.io
> > > >>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hey Guozhang,
> > > >>>>>>>>>
> > > >>>>>>>>> To clarify, the broker does not actually use the ApiVersion
> API
> > > >>>> for
> > > >>>>>>>>> inter-broker communications. The use of an API and its
> > > >>>> corresponding
> > > >>>>>>>>> version is controlled by `inter.broker.protocol.version`.
> > > >>>>>>>>>
> > > >>>>>>>>> Nevertheless, it sounds like we're on the same page about
> > removing
> > > >>>>>>>>> DescribeTransactionState. The impact of a dangling
> transaction
> > is
> > > >>>> a
> > > >>>>>>>> little
> > > >>>>>>>>> worse than what you describe though. Consumers with the
> > > >>>>>> read_committed
> > > >>>>>>>>> isolation level will be stuck. Still, I think we agree that
> > this
> > > >>>>>> case
> > > >>>>>>>>> should be rare and we can reconsider for future work. Rather
> > than
> > > >>>>>>>>> preventing dangling transactions, perhaps we should consider
> > > >>>> options
> > > >>>>>>>> which
> > > >>>>>>>>> allows us to detect them and recover. Anyway, this needs more
> > > >>>>>> thought.
> > > >>>>>>> I
> > > >>>>>>>>> will update the KIP.
> > > >>>>>>>>>
> > > >>>>>>>>> Best,
> > > >>>>>>>>> Jason
> > > >>>>>>>>>
> > > >>>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <
> > wangguoz@gmail.com
> > > >>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> 0. My original question is about the implementation details
> > > >>>>>>> primarily,
> > > >>>>>>>>>> since current the handling logic of the APIVersionResponse
> is
> > > >>>>>> simply
> > > >>>>>>>> "use
> > > >>>>>>>>>> the highest supported version of the corresponding request",
> > but
> > > >>>>>> if
> > > >>>>>>> the
> > > >>>>>>>>>> returned response from APIVersionRequest says "I don't even
> > know
> > > >>>>>>> about
> > > >>>>>>>>> the
> > > >>>>>>>>>> DescribeTransactionStateRequest at all", then we need
> > additional
> > > >>>>>>> logic
> > > >>>>>>>>> for
> > > >>>>>>>>>> the falling back logic. Currently this logic is embedded in
> > > >>>>>>>> NetworkClient
> > > >>>>>>>>>> which is shared by all clients, so I'd like to avoid making
> > this
> > > >>>>>>> logic
> > > >>>>>>>>> more
> > > >>>>>>>>>> complicated.
> > > >>>>>>>>>>
> > > >>>>>>>>>> As for the general issue that a broker does not recognize a
> > > >>>>>> producer
> > > >>>>>>>> with
> > > >>>>>>>>>> sequence number 0, here's my thinking: as you mentioned in
> the
> > > >>>>>> wiki,
> > > >>>>>>>> this
> > > >>>>>>>>>> is only a concern for transactional producer since for
> > > >>>> idempotent
> > > >>>>>>>>> producer
> > > >>>>>>>>>> it can just bump the epoch and go. For transactional
> producer,
> > > >>>>>> even
> > > >>>>>>> if
> > > >>>>>>>>> the
> > > >>>>>>>>>> producer request from a fenced producer gets accepted, its
> > > >>>>>>> transaction
> > > >>>>>>>>> will
> > > >>>>>>>>>> never be committed and hence messages not exposed to
> > > >>>>>> read-committed
> > > >>>>>>>>>> consumers as well. The drawback is though, 1)
> read-uncommitted
> > > >>>>>>>> consumers
> > > >>>>>>>>>> will still read those messages, 2) unnecessary storage for
> > those
> > > >>>>>>> fenced
> > > >>>>>>>>>> produce messages, but in practice should not accumulate to a
> > > >>>> large
> > > >>>>>>>> amount
> > > >>>>>>>>>> since producer should soon try to commit and be told it is
> > > >>>> fenced
> > > >>>>>> and
> > > >>>>>>>>> then
> > > >>>>>>>>>> stop, 3) there will be no markers for those transactional
> > > >>>> messages
> > > >>>>>>>> ever.
> > > >>>>>>>>>> Looking at the list and thinking about the likelihood it may
> > > >>>>>> happen
> > > >>>>>>>>>> assuming we retain the producer up to
> transactional.id.timeout
> > > >>>>>>> (default
> > > >>>>>>>>> is
> > > >>>>>>>>>> 7 days), I feel comfortable leaving it as is.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Guozhang
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> > > >>>>>> jason@confluent.io>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hey Guozhang,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for the comments. Responses below:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 0. The new API is used between brokers, so we govern its
> > usage
> > > >>>>>>> using
> > > >>>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
> > > >>>>>>> upgraded,
> > > >>>>>>>>> we
> > > >>>>>>>>>>> will just fallback to the old logic, which is to accept the
> > > >>>>>> write.
> > > >>>>>>>> This
> > > >>>>>>>>>> is
> > > >>>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
> > > >>>> Does
> > > >>>>>>> that
> > > >>>>>>>>>> seem
> > > >>>>>>>>>>> reasonable?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> To tell the truth, after digging this KIP up and reading it
> > > >>>>>> over, I
> > > >>>>>>>> am
> > > >>>>>>>>>>> doubting how crucial this API is. It is attempting to
> protect
> > > >>>> a
> > > >>>>>>> write
> > > >>>>>>>>>> from
> > > >>>>>>>>>>> a zombie which has just reset its sequence number after
> that
> > > >>>>>>> producer
> > > >>>>>>>>> had
> > > >>>>>>>>>>> had its state cleaned up. However, one of the other
> > > >>>>>> improvements in
> > > >>>>>>>>> this
> > > >>>>>>>>>>> KIP is to maintain producer state beyond its retention in
> the
> > > >>>>>> log.
> > > >>>>>>> I
> > > >>>>>>>>>> think
> > > >>>>>>>>>>> that makes this case sufficiently unlikely that we can
> leave
> > > >>>> it
> > > >>>>>> for
> > > >>>>>>>>>> future
> > > >>>>>>>>>>> work. I am not 100% sure this is the only scenario where
> > > >>>>>>> transaction
> > > >>>>>>>>>> state
> > > >>>>>>>>>>> and log state can diverge anyway, so it would be better to
> > > >>>>>> consider
> > > >>>>>>>>> this
> > > >>>>>>>>>>> problem more generally. What do you think?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 1. Thanks, from memory, the term changed after the first
> > > >>>>>> iteration.
> > > >>>>>>>>> I'll
> > > >>>>>>>>>>> make a pass and try to clarify usage.
> > > >>>>>>>>>>> 2. I was attempting to handle some edge cases since this
> > check
> > > >>>>>>> would
> > > >>>>>>>> be
> > > >>>>>>>>>>> asynchronous. In any case, if we drop this validation as
> > > >>>>>> suggested
> > > >>>>>>>>> above,
> > > >>>>>>>>>>> then we can ignore this.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> -Jason
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> > > >>>>>> wangguoz@gmail.com>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hello Jason, thanks for the great write-up.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> 0. One question about the migration plan: "The new
> > > >>>>>>>>> GetTransactionState
> > > >>>>>>>>>>> API
> > > >>>>>>>>>>>> and the new version of the transaction state message will
> > > >>>> not
> > > >>>>>> be
> > > >>>>>>>> used
> > > >>>>>>>>>>> until
> > > >>>>>>>>>>>> the inter-broker version supports it." I'm not so clear
> > > >>>> about
> > > >>>>>> the
> > > >>>>>>>>>>>> implementation details here: say a broker is on the newer
> > > >>>>>> version
> > > >>>>>>>> and
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>> txn-coordinator is still on older version. Today the
> > > >>>>>>>>> APIVersionsRequest
> > > >>>>>>>>>>> can
> > > >>>>>>>>>>>> only help upgrade / downgrade the request version, but not
> > > >>>>>>>> forbidding
> > > >>>>>>>>>>>> sending any. Are you suggesting we add additional logic on
> > > >>>> the
> > > >>>>>>>> broker
> > > >>>>>>>>>>> side
> > > >>>>>>>>>>>> to handle the case of "not sending the request"? If yes my
> > > >>>>>>> concern
> > > >>>>>>>> is
> > > >>>>>>>>>>> that
> > > >>>>>>>>>>>> this will be some tech-debt code that will live long
> before
> > > >>>>>> being
> > > >>>>>>>>>>> removed.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Some additional minor comments:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring
> to
> > > >>>>>> the
> > > >>>>>>>> same
> > > >>>>>>>>>>> thing
> > > >>>>>>>>>>>> in your wiki.
> > > >>>>>>>>>>>> 2. "The broker must verify after receiving the response
> that
> > > >>>>>> the
> > > >>>>>>>>>> producer
> > > >>>>>>>>>>>> state is still unknown.": not sure why we have to
> validate?
> > > >>>> If
> > > >>>>>>> the
> > > >>>>>>>>>>> metadata
> > > >>>>>>>>>>>> returned from the txn-coordinator can always be considered
> > > >>>> the
> > > >>>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
> > > >>>>>> cache?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Guozhang
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> > > >>>>>>>>> matthias@confluent.io>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> I am +1 on this :)
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> -Matthias
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > > >>>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
> > > >>>> link
> > > >>>>>> to
> > > >>>>>>>> the
> > > >>>>>>>>>> KIP:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > > >>>>>>>>>>>>>> .
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> -Jason
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> > > >>>>>>>>>> jason@confluent.io
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Hi All,
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I have a proposal to improve the
> > > >>>> transactional/idempotent
> > > >>>>>>>>>> producer's
> > > >>>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
> > > >>>>>> result
> > > >>>>>>> of
> > > >>>>>>>>>>> losing
> > > >>>>>>>>>>>>>>> producer state following segment removal. The current
> > > >>>>>>> behavior
> > > >>>>>>>>> is
> > > >>>>>>>>>>> both
> > > >>>>>>>>>>>>>>> complex and limited. Please take a look and let me know
> > > >>>>>> what
> > > >>>>>>>> you
> > > >>>>>>>>>>>> think.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
> > > >>>>>>> initial
> > > >>>>>>>>>> draft.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> -Jason
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> --
> > > >>>>>>>>>>>> -- Guozhang
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>> -- Guozhang
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> --
> > > >>>>>>>> -- Guozhang
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> -- Guozhang
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Jason Gustafson <ja...@confluent.io>.

Hi Guozhang,

1. I think there are still some retriable errors that could affect the
transaction APIs. For example, COORDINATOR_LOAD_IN_PROGRESS.
2. Yes, this is right. The only fatal error is when the producer has been
fenced by another instance.

Thanks,
Jason

On Mon, Aug 26, 2019 at 6:05 PM Guozhang Wang <wa...@gmail.com> wrote:

> Hi Jason,
>
> I've made another pass on the wiki page and it reads much better now. One
> more clarification about the "Simplified error handling" section:
>
> 1. There will be no "retriable error" from the broker side regarding any
> send requests and txn requests (to txn coordinators). All errors would
> cause the corresponding txn to eventually be aborted.
> 2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the
> producer entering the ABORTABLE_ERROR state, but only the current txn to be
> aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to
> enter the FATAL_ERROR state, plus it would cause all future txns to be
> aborted.
>
> Is that right?
>
>
> Guozhang
>
>
> On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax <ma...@confluent.io>
> wrote:
> >
> > Thanks Jason!
> >
> > LGTM.
> >
> > On 8/21/19 3:07 PM, Jason Gustafson wrote:
> > > Hi Matthias,
> > >
> > > Thanks, I appreciate the thorough review. I've revised the section to
> make
> > > the logic clearer. I think you have it right except for the 1). We only
> > > generate a new PID if the epoch cannot be incremented without overflow.
> > >
> > > -Jason
> > >
> > > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <matthias@confluent.io
> >
> > > wrote:
> > >
> > >> Thanks for the KIP. I just have some clarification questions to make
> > >> sure I understand the proposal correctly:
> > >>
> > >> 1) "Safe Epoch Incrementing"
> > >>
> > >>> When the coordinator receives a new InitProducerId request, we will
> use
> > >> the following logic to update the epoch:
> > >>>
> > >>> 1. No epoch is provided: the current epoch will be bumped and the
> last
> > >> epoch will be set to -1.
> > >>> 2. Epoch and producerId are provided, and the provided producerId
> > >> matches the current producerId or the provided producerId matches the
> > >> previous producerId and the provided epoch is exhausted:
> > >>>       a. Provided epoch matches current epoch: the last epoch will be
> > >> set to the current epoch, and the current epoch will be bumped .
> > >>>       b. Provided epoch matches last epoch: the current epoch will be
> > >> returned
> > >>>       c. Else: return INVALID_PRODUCER_EPOCH
> > >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH
> > >>
> > >> Case (1) would be for a new producer. Hence, should we state that "no
> > >> PID" is provided (instead of "no epoch" is provided?). That might be
> > >> clearer and it implies that there is no epoch anyway.
> > >>
> > >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
> > >> error and tries to re-initialize itself.
> > >>
> > >> Case (2a) implies that the producer send its first request and is not
> > >> fenced. Case (2b) implies that the producer re-tries to re-initialize
> > >> itself, ie, it first request to re-initilize did not get a respond but
> > >> was processed by the transaction coordinator. Case (2c) implies that a
> > >> producer was fenced (similar case 3, even if I am not sure what case 3
> > >> actually would be?)
> > >>
> > >> Please let me know if my understanding is correct.
> > >>
> > >> What is still unclear to me is, why case (2 -- or is it only 2b?)
> > >> requires that the "provide epoch is exhausted"?
> > >>
> > >> For case 2b:
> > >>
> > >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
> > >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
> > >> PID/epoch pair. The TC processes the request and creates a new PID=101
> > >> with new epoch=0, however, the respond to the producer is lost. The TC
> > >> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
> > >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
> > >> PID/epoch still matches the previous PID/epoch pair and hence the TC
> > >> know it's a retry?
> > >>
> > >> If this reasoning is correct, should the logic be as follows:
> > >>
> > >> 1. No PID is provided: create a new PID with epoch=0 and set the last
> > >> epoch to -1.
> > >> 2. Epoch and producerId are provided
> > >>    a) the provided producerId/epoch matches the current
> producerId/epoch:
> > >>       i) if the epoch is not exhausted, bump the epoch
> > >>       ii) if the epoch is exhausted, create a new PID with epoch=0
> > >>    b) the provided producerId/epoch matches the previous
> > >> producerId/epoch: respond with current PID/epoch
> > >>    c) Otherwise, return INVALID_PRODUCER_EPOCH
> > >>
> > >>
> > >>
> > >> -Matthias
> > >>
> > >>
> > >>
> > >>
> > >> On 4/4/19 3:47 PM, Jason Gustafson wrote:
> > >>> Hi Everyone,
> > >>>
> > >>> Sorry for the long delay on this KIP. I have updated it to include
> the
> > >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there
> are
> > >> no
> > >>> further comments, I will plan to start a vote early next week.
> > >>>
> > >>> Thanks!
> > >>> Jason
> > >>>
> > >>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <
> adam.bellemare@gmail.com
> > >>>
> > >>> wrote:
> > >>>
> > >>>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
> > >>>>
> > >>>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
> > >> adam.bellemare@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi John
> > >>>>>
> > >>>>> What is the status of this KIP?
> > >>>>>
> > >>>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error
> on
> > >>>>> 2.1.1 for a multitude of our internal topics, and I suspect that a
> > >> proper
> > >>>>> fix is needed.
> > >>>>>
> > >>>>> Adam
> > >>>>>
> > >>>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com>
> > >> wrote:
> > >>>>>
> > >>>>>> Thanks Jason. The proposed solution sounds good to me.
> > >>>>>>
> > >>>>>>
> > >>>>>> Guozhang
> > >>>>>>
> > >>>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <
> jason@confluent.io>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hey Guozhang,
> > >>>>>>>
> > >>>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING
> error
> > >>>>>>> occurs following expiration of the producerId. It's possible that
> > >>>>>> another
> > >>>>>>> producerId has been installed in its place following expiration
> (if
> > >>>>>> another
> > >>>>>>> producer instance has become active), or the mapping is empty. We
> can
> > >>>>>>> safely retry the InitProducerId with the logic in this KIP in
> order
> > >> to
> > >>>>>>> detect which case it is. So I'd suggest something like this:
> > >>>>>>>
> > >>>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can
> send
> > >>>>>>> InitProducerId using the current producerId and epoch.
> > >>>>>>> 2. If no mapping exists, the coordinator can generate a new
> > >> producerId
> > >>>>>> and
> > >>>>>>> return it. If a transaction is in progress on the client, it will
> > >> have
> > >>>>>> to
> > >>>>>>> be aborted, but the producer can continue afterwards.
> > >>>>>>> 3. Otherwise if a different producerId has been assigned, then we
> can
> > >>>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling,
> we
> > >> can
> > >>>>>>> probably raise this as ProducerFencedException since that is
> > >>>> effectively
> > >>>>>>> what has happened. Ideally this is the only fatal case that users
> > >> have
> > >>>>>> to
> > >>>>>>> handle.
> > >>>>>>>
> > >>>>>>> I'll give it a little more thought and update the KIP.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Jason
> > >>>>>>>
> > >>>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wangguoz@gmail.com
> >
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> You're right about the dangling txn since it will actually block
> > >>>>>>>> read-committed consumers from proceeding at all. I'd agree that
> > >>>> since
> > >>>>>>> this
> > >>>>>>>> is a very rare case, we can consider fixing it not via
> broker-side
> > >>>>>> logic
> > >>>>>>>> but via tooling in a future work.
> > >>>>>>>>
> > >>>>>>>> I've also discovered some related error handling logic inside
> > >>>> producer
> > >>>>>>> that
> > >>>>>>>> may be addressed together with this KIP (since it is mostly for
> > >>>>>> internal
> > >>>>>>>> implementations the wiki itself does not need to be modified):
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
>
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> > >>>>>>>>
> > >>>>>>>> Guozhang
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <
> jason@confluent.io
> > >>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hey Guozhang,
> > >>>>>>>>>
> > >>>>>>>>> To clarify, the broker does not actually use the ApiVersion API
> > >>>> for
> > >>>>>>>>> inter-broker communications. The use of an API and its
> > >>>> corresponding
> > >>>>>>>>> version is controlled by `inter.broker.protocol.version`.
> > >>>>>>>>>
> > >>>>>>>>> Nevertheless, it sounds like we're on the same page about
> removing
> > >>>>>>>>> DescribeTransactionState. The impact of a dangling transaction
> is
> > >>>> a
> > >>>>>>>> little
> > >>>>>>>>> worse than what you describe though. Consumers with the
> > >>>>>> read_committed
> > >>>>>>>>> isolation level will be stuck. Still, I think we agree that
> this
> > >>>>>> case
> > >>>>>>>>> should be rare and we can reconsider for future work. Rather
> than
> > >>>>>>>>> preventing dangling transactions, perhaps we should consider
> > >>>> options
> > >>>>>>>> which
> > >>>>>>>>> allows us to detect them and recover. Anyway, this needs more
> > >>>>>> thought.
> > >>>>>>> I
> > >>>>>>>>> will update the KIP.
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Jason
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <
> wangguoz@gmail.com
> > >>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> 0. My original question is about the implementation details
> > >>>>>>> primarily,
> > >>>>>>>>>> since current the handling logic of the APIVersionResponse is
> > >>>>>> simply
> > >>>>>>>> "use
> > >>>>>>>>>> the highest supported version of the corresponding request",
> but
> > >>>>>> if
> > >>>>>>> the
> > >>>>>>>>>> returned response from APIVersionRequest says "I don't even
> know
> > >>>>>>> about
> > >>>>>>>>> the
> > >>>>>>>>>> DescribeTransactionStateRequest at all", then we need
> additional
> > >>>>>>> logic
> > >>>>>>>>> for
> > >>>>>>>>>> the falling back logic. Currently this logic is embedded in
> > >>>>>>>> NetworkClient
> > >>>>>>>>>> which is shared by all clients, so I'd like to avoid making
> this
> > >>>>>>> logic
> > >>>>>>>>> more
> > >>>>>>>>>> complicated.
> > >>>>>>>>>>
> > >>>>>>>>>> As for the general issue that a broker does not recognize a
> > >>>>>> producer
> > >>>>>>>> with
> > >>>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
> > >>>>>> wiki,
> > >>>>>>>> this
> > >>>>>>>>>> is only a concern for transactional producer since for
> > >>>> idempotent
> > >>>>>>>>> producer
> > >>>>>>>>>> it can just bump the epoch and go. For transactional producer,
> > >>>>>> even
> > >>>>>>> if
> > >>>>>>>>> the
> > >>>>>>>>>> producer request from a fenced producer gets accepted, its
> > >>>>>>> transaction
> > >>>>>>>>> will
> > >>>>>>>>>> never be committed and hence messages not exposed to
> > >>>>>> read-committed
> > >>>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
> > >>>>>>>> consumers
> > >>>>>>>>>> will still read those messages, 2) unnecessary storage for
> those
> > >>>>>>> fenced
> > >>>>>>>>>> produce messages, but in practice should not accumulate to a
> > >>>> large
> > >>>>>>>> amount
> > >>>>>>>>>> since producer should soon try to commit and be told it is
> > >>>> fenced
> > >>>>>> and
> > >>>>>>>>> then
> > >>>>>>>>>> stop, 3) there will be no markers for those transactional
> > >>>> messages
> > >>>>>>>> ever.
> > >>>>>>>>>> Looking at the list and thinking about the likelihood it may
> > >>>>>> happen
> > >>>>>>>>>> assuming we retain the producer up to transactional.id.timeout
> > >>>>>>> (default
> > >>>>>>>>> is
> > >>>>>>>>>> 7 days), I feel comfortable leaving it as is.
> > >>>>>>>>>>
> > >>>>>>>>>> Guozhang
> > >>>>>>>>>>
> > >>>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> > >>>>>> jason@confluent.io>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hey Guozhang,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the comments. Responses below:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 0. The new API is used between brokers, so we govern its
> usage
> > >>>>>>> using
> > >>>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
> > >>>>>>> upgraded,
> > >>>>>>>>> we
> > >>>>>>>>>>> will just fallback to the old logic, which is to accept the
> > >>>>>> write.
> > >>>>>>>> This
> > >>>>>>>>>> is
> > >>>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
> > >>>> Does
> > >>>>>>> that
> > >>>>>>>>>> seem
> > >>>>>>>>>>> reasonable?
> > >>>>>>>>>>>
> > >>>>>>>>>>> To tell the truth, after digging this KIP up and reading it
> > >>>>>> over, I
> > >>>>>>>> am
> > >>>>>>>>>>> doubting how crucial this API is. It is attempting to protect
> > >>>> a
> > >>>>>>> write
> > >>>>>>>>>> from
> > >>>>>>>>>>> a zombie which has just reset its sequence number after that
> > >>>>>>> producer
> > >>>>>>>>> had
> > >>>>>>>>>>> had its state cleaned up. However, one of the other
> > >>>>>> improvements in
> > >>>>>>>>> this
> > >>>>>>>>>>> KIP is to maintain producer state beyond its retention in the
> > >>>>>> log.
> > >>>>>>> I
> > >>>>>>>>>> think
> > >>>>>>>>>>> that makes this case sufficiently unlikely that we can leave
> > >>>> it
> > >>>>>> for
> > >>>>>>>>>> future
> > >>>>>>>>>>> work. I am not 100% sure this is the only scenario where
> > >>>>>>> transaction
> > >>>>>>>>>> state
> > >>>>>>>>>>> and log state can diverge anyway, so it would be better to
> > >>>>>> consider
> > >>>>>>>>> this
> > >>>>>>>>>>> problem more generally. What do you think?
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. Thanks, from memory, the term changed after the first
> > >>>>>> iteration.
> > >>>>>>>>> I'll
> > >>>>>>>>>>> make a pass and try to clarify usage.
> > >>>>>>>>>>> 2. I was attempting to handle some edge cases since this
> check
> > >>>>>>> would
> > >>>>>>>> be
> > >>>>>>>>>>> asynchronous. In any case, if we drop this validation as
> > >>>>>> suggested
> > >>>>>>>>> above,
> > >>>>>>>>>>> then we can ignore this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> -Jason
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> > >>>>>> wangguoz@gmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hello Jason, thanks for the great write-up.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 0. One question about the migration plan: "The new
> > >>>>>>>>> GetTransactionState
> > >>>>>>>>>>> API
> > >>>>>>>>>>>> and the new version of the transaction state message will
> > >>>> not
> > >>>>>> be
> > >>>>>>>> used
> > >>>>>>>>>>> until
> > >>>>>>>>>>>> the inter-broker version supports it." I'm not so clear
> > >>>> about
> > >>>>>> the
> > >>>>>>>>>>>> implementation details here: say a broker is on the newer
> > >>>>>> version
> > >>>>>>>> and
> > >>>>>>>>>> the
> > >>>>>>>>>>>> txn-coordinator is still on older version. Today the
> > >>>>>>>>> APIVersionsRequest
> > >>>>>>>>>>> can
> > >>>>>>>>>>>> only help upgrade / downgrade the request version, but not
> > >>>>>>>> forbidding
> > >>>>>>>>>>>> sending any. Are you suggesting we add additional logic on
> > >>>> the
> > >>>>>>>> broker
> > >>>>>>>>>>> side
> > >>>>>>>>>>>> to handle the case of "not sending the request"? If yes my
> > >>>>>>> concern
> > >>>>>>>> is
> > >>>>>>>>>>> that
> > >>>>>>>>>>>> this will be some tech-debt code that will live long before
> > >>>>>> being
> > >>>>>>>>>>> removed.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Some additional minor comments:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
> > >>>>>> the
> > >>>>>>>> same
> > >>>>>>>>>>> thing
> > >>>>>>>>>>>> in your wiki.
> > >>>>>>>>>>>> 2. "The broker must verify after receiving the response that
> > >>>>>> the
> > >>>>>>>>>> producer
> > >>>>>>>>>>>> state is still unknown.": not sure why we have to validate?
> > >>>> If
> > >>>>>>> the
> > >>>>>>>>>>> metadata
> > >>>>>>>>>>>> returned from the txn-coordinator can always be considered
> > >>>> the
> > >>>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
> > >>>>>> cache?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Guozhang
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> > >>>>>>>>> matthias@confluent.io>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> I am +1 on this :)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -Matthias
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > >>>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
> > >>>> link
> > >>>>>> to
> > >>>>>>>> the
> > >>>>>>>>>> KIP:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > >>>>>>>>>>>>>> .
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -Jason
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> > >>>>>>>>>> jason@confluent.io
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi All,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I have a proposal to improve the
> > >>>> transactional/idempotent
> > >>>>>>>>>> producer's
> > >>>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
> > >>>>>> result
> > >>>>>>> of
> > >>>>>>>>>>> losing
> > >>>>>>>>>>>>>>> producer state following segment removal. The current
> > >>>>>>> behavior
> > >>>>>>>>> is
> > >>>>>>>>>>> both
> > >>>>>>>>>>>>>>> complex and limited. Please take a look and let me know
> > >>>>>> what
> > >>>>>>>> you
> > >>>>>>>>>>>> think.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
> > >>>>>>> initial
> > >>>>>>>>>> draft.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> -Jason
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> -- Guozhang
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> -- Guozhang
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> -- Guozhang
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> -- Guozhang
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Guozhang Wang <wa...@gmail.com>.

Hi Jason,

I've made another pass on the wiki page and it reads much better now. One
more clarification about the "Simplified error handling" section:

1. There will be no "retriable error" from the broker side regarding any
send requests and txn requests (to txn coordinators). All errors would
cause the corresponding txn to eventually be aborted.
2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the
producer entering the ABORTABLE_ERROR state, but only the current txn to be
aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to
enter the FATAL_ERROR state, plus it would cause all future txns to be
aborted.

Is that right?


Guozhang


On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax <ma...@confluent.io>
wrote:
>
> Thanks Jason!
>
> LGTM.
>
> On 8/21/19 3:07 PM, Jason Gustafson wrote:
> > Hi Matthias,
> >
> > Thanks, I appreciate the thorough review. I've revised the section to
make
> > the logic clearer. I think you have it right except for the 1). We only
> > generate a new PID if the epoch cannot be incremented without overflow.
> >
> > -Jason
> >
> > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <ma...@confluent.io>
> > wrote:
> >
> >> Thanks for the KIP. I just have some clarification questions to make
> >> sure I understand the proposal correctly:
> >>
> >> 1) "Safe Epoch Incrementing"
> >>
> >>> When the coordinator receives a new InitProducerId request, we will
use
> >> the following logic to update the epoch:
> >>>
> >>> 1. No epoch is provided: the current epoch will be bumped and the last
> >> epoch will be set to -1.
> >>> 2. Epoch and producerId are provided, and the provided producerId
> >> matches the current producerId or the provided producerId matches the
> >> previous producerId and the provided epoch is exhausted:
> >>>       a. Provided epoch matches current epoch: the last epoch will be
> >> set to the current epoch, and the current epoch will be bumped .
> >>>       b. Provided epoch matches last epoch: the current epoch will be
> >> returned
> >>>       c. Else: return INVALID_PRODUCER_EPOCH
> >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH
> >>
> >> Case (1) would be for a new producer. Hence, should we state that "no
> >> PID" is provided (instead of "no epoch" is provided?). That might be
> >> clearer and it implies that there is no epoch anyway.
> >>
> >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
> >> error and tries to re-initialize itself.
> >>
> >> Case (2a) implies that the producer send its first request and is not
> >> fenced. Case (2b) implies that the producer re-tries to re-initialize
> >> itself, ie, it first request to re-initilize did not get a respond but
> >> was processed by the transaction coordinator. Case (2c) implies that a
> >> producer was fenced (similar case 3, even if I am not sure what case 3
> >> actually would be?)
> >>
> >> Please let me know if my understanding is correct.
> >>
> >> What is still unclear to me is, why case (2 -- or is it only 2b?)
> >> requires that the "provide epoch is exhausted"?
> >>
> >> For case 2b:
> >>
> >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
> >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
> >> PID/epoch pair. The TC processes the request and creates a new PID=101
> >> with new epoch=0, however, the respond to the producer is lost. The TC
> >> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
> >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
> >> PID/epoch still matches the previous PID/epoch pair and hence the TC
> >> know it's a retry?
> >>
> >> If this reasoning is correct, should the logic be as follows:
> >>
> >> 1. No PID is provided: create a new PID with epoch=0 and set the last
> >> epoch to -1.
> >> 2. Epoch and producerId are provided
> >>    a) the provided producerId/epoch matches the current
producerId/epoch:
> >>       i) if the epoch is not exhausted, bump the epoch
> >>       ii) if the epoch is exhausted, create a new PID with epoch=0
> >>    b) the provided producerId/epoch matches the previous
> >> producerId/epoch: respond with current PID/epoch
> >>    c) Otherwise, return INVALID_PRODUCER_EPOCH
> >>
> >>
> >>
> >> -Matthias
> >>
> >>
> >>
> >>
> >> On 4/4/19 3:47 PM, Jason Gustafson wrote:
> >>> Hi Everyone,
> >>>
> >>> Sorry for the long delay on this KIP. I have updated it to include the
> >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there
are
> >> no
> >>> further comments, I will plan to start a vote early next week.
> >>>
> >>> Thanks!
> >>> Jason
> >>>
> >>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <
adam.bellemare@gmail.com
> >>>
> >>> wrote:
> >>>
> >>>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
> >>>>
> >>>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
> >> adam.bellemare@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi John
> >>>>>
> >>>>> What is the status of this KIP?
> >>>>>
> >>>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error
on
> >>>>> 2.1.1 for a multitude of our internal topics, and I suspect that a
> >> proper
> >>>>> fix is needed.
> >>>>>
> >>>>> Adam
> >>>>>
> >>>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> Thanks Jason. The proposed solution sounds good to me.
> >>>>>>
> >>>>>>
> >>>>>> Guozhang
> >>>>>>
> >>>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hey Guozhang,
> >>>>>>>
> >>>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING
error
> >>>>>>> occurs following expiration of the producerId. It's possible that
> >>>>>> another
> >>>>>>> producerId has been installed in its place following expiration
(if
> >>>>>> another
> >>>>>>> producer instance has become active), or the mapping is empty. We
can
> >>>>>>> safely retry the InitProducerId with the logic in this KIP in
order
> >> to
> >>>>>>> detect which case it is. So I'd suggest something like this:
> >>>>>>>
> >>>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can
send
> >>>>>>> InitProducerId using the current producerId and epoch.
> >>>>>>> 2. If no mapping exists, the coordinator can generate a new
> >> producerId
> >>>>>> and
> >>>>>>> return it. If a transaction is in progress on the client, it will
> >> have
> >>>>>> to
> >>>>>>> be aborted, but the producer can continue afterwards.
> >>>>>>> 3. Otherwise if a different producerId has been assigned, then we
can
> >>>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we
> >> can
> >>>>>>> probably raise this as ProducerFencedException since that is
> >>>> effectively
> >>>>>>> what has happened. Ideally this is the only fatal case that users
> >> have
> >>>>>> to
> >>>>>>> handle.
> >>>>>>>
> >>>>>>> I'll give it a little more thought and update the KIP.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Jason
> >>>>>>>
> >>>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> You're right about the dangling txn since it will actually block
> >>>>>>>> read-committed consumers from proceeding at all. I'd agree that
> >>>> since
> >>>>>>> this
> >>>>>>>> is a very rare case, we can consider fixing it not via
broker-side
> >>>>>> logic
> >>>>>>>> but via tooling in a future work.
> >>>>>>>>
> >>>>>>>> I've also discovered some related error handling logic inside
> >>>> producer
> >>>>>>> that
> >>>>>>>> may be addressed together with this KIP (since it is mostly for
> >>>>>> internal
> >>>>>>>> implementations the wiki itself does not need to be modified):
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> >>>>>>>>
> >>>>>>>> Guozhang
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <
jason@confluent.io
> >>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Guozhang,
> >>>>>>>>>
> >>>>>>>>> To clarify, the broker does not actually use the ApiVersion API
> >>>> for
> >>>>>>>>> inter-broker communications. The use of an API and its
> >>>> corresponding
> >>>>>>>>> version is controlled by `inter.broker.protocol.version`.
> >>>>>>>>>
> >>>>>>>>> Nevertheless, it sounds like we're on the same page about
removing
> >>>>>>>>> DescribeTransactionState. The impact of a dangling transaction
is
> >>>> a
> >>>>>>>> little
> >>>>>>>>> worse than what you describe though. Consumers with the
> >>>>>> read_committed
> >>>>>>>>> isolation level will be stuck. Still, I think we agree that this
> >>>>>> case
> >>>>>>>>> should be rare and we can reconsider for future work. Rather
than
> >>>>>>>>> preventing dangling transactions, perhaps we should consider
> >>>> options
> >>>>>>>> which
> >>>>>>>>> allows us to detect them and recover. Anyway, this needs more
> >>>>>> thought.
> >>>>>>> I
> >>>>>>>>> will update the KIP.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jason
> >>>>>>>>>
> >>>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <
wangguoz@gmail.com
> >>>>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> 0. My original question is about the implementation details
> >>>>>>> primarily,
> >>>>>>>>>> since current the handling logic of the APIVersionResponse is
> >>>>>> simply
> >>>>>>>> "use
> >>>>>>>>>> the highest supported version of the corresponding request",
but
> >>>>>> if
> >>>>>>> the
> >>>>>>>>>> returned response from APIVersionRequest says "I don't even
know
> >>>>>>> about
> >>>>>>>>> the
> >>>>>>>>>> DescribeTransactionStateRequest at all", then we need
additional
> >>>>>>> logic
> >>>>>>>>> for
> >>>>>>>>>> the falling back logic. Currently this logic is embedded in
> >>>>>>>> NetworkClient
> >>>>>>>>>> which is shared by all clients, so I'd like to avoid making
this
> >>>>>>> logic
> >>>>>>>>> more
> >>>>>>>>>> complicated.
> >>>>>>>>>>
> >>>>>>>>>> As for the general issue that a broker does not recognize a
> >>>>>> producer
> >>>>>>>> with
> >>>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
> >>>>>> wiki,
> >>>>>>>> this
> >>>>>>>>>> is only a concern for transactional producer since for
> >>>> idempotent
> >>>>>>>>> producer
> >>>>>>>>>> it can just bump the epoch and go. For transactional producer,
> >>>>>> even
> >>>>>>> if
> >>>>>>>>> the
> >>>>>>>>>> producer request from a fenced producer gets accepted, its
> >>>>>>> transaction
> >>>>>>>>> will
> >>>>>>>>>> never be committed and hence messages not exposed to
> >>>>>> read-committed
> >>>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
> >>>>>>>> consumers
> >>>>>>>>>> will still read those messages, 2) unnecessary storage for
those
> >>>>>>> fenced
> >>>>>>>>>> produce messages, but in practice should not accumulate to a
> >>>> large
> >>>>>>>> amount
> >>>>>>>>>> since producer should soon try to commit and be told it is
> >>>> fenced
> >>>>>> and
> >>>>>>>>> then
> >>>>>>>>>> stop, 3) there will be no markers for those transactional
> >>>> messages
> >>>>>>>> ever.
> >>>>>>>>>> Looking at the list and thinking about the likelihood it may
> >>>>>> happen
> >>>>>>>>>> assuming we retain the producer up to transactional.id.timeout
> >>>>>>> (default
> >>>>>>>>> is
> >>>>>>>>>> 7 days), I feel comfortable leaving it as is.
> >>>>>>>>>>
> >>>>>>>>>> Guozhang
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> >>>>>> jason@confluent.io>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hey Guozhang,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the comments. Responses below:
> >>>>>>>>>>>
> >>>>>>>>>>> 0. The new API is used between brokers, so we govern its usage
> >>>>>>> using
> >>>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
> >>>>>>> upgraded,
> >>>>>>>>> we
> >>>>>>>>>>> will just fallback to the old logic, which is to accept the
> >>>>>> write.
> >>>>>>>> This
> >>>>>>>>>> is
> >>>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
> >>>> Does
> >>>>>>> that
> >>>>>>>>>> seem
> >>>>>>>>>>> reasonable?
> >>>>>>>>>>>
> >>>>>>>>>>> To tell the truth, after digging this KIP up and reading it
> >>>>>> over, I
> >>>>>>>> am
> >>>>>>>>>>> doubting how crucial this API is. It is attempting to protect
> >>>> a
> >>>>>>> write
> >>>>>>>>>> from
> >>>>>>>>>>> a zombie which has just reset its sequence number after that
> >>>>>>> producer
> >>>>>>>>> had
> >>>>>>>>>>> had its state cleaned up. However, one of the other
> >>>>>> improvements in
> >>>>>>>>> this
> >>>>>>>>>>> KIP is to maintain producer state beyond its retention in the
> >>>>>> log.
> >>>>>>> I
> >>>>>>>>>> think
> >>>>>>>>>>> that makes this case sufficiently unlikely that we can leave
> >>>> it
> >>>>>> for
> >>>>>>>>>> future
> >>>>>>>>>>> work. I am not 100% sure this is the only scenario where
> >>>>>>> transaction
> >>>>>>>>>> state
> >>>>>>>>>>> and log state can diverge anyway, so it would be better to
> >>>>>> consider
> >>>>>>>>> this
> >>>>>>>>>>> problem more generally. What do you think?
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Thanks, from memory, the term changed after the first
> >>>>>> iteration.
> >>>>>>>>> I'll
> >>>>>>>>>>> make a pass and try to clarify usage.
> >>>>>>>>>>> 2. I was attempting to handle some edge cases since this check
> >>>>>>> would
> >>>>>>>> be
> >>>>>>>>>>> asynchronous. In any case, if we drop this validation as
> >>>>>> suggested
> >>>>>>>>> above,
> >>>>>>>>>>> then we can ignore this.
> >>>>>>>>>>>
> >>>>>>>>>>> -Jason
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> >>>>>> wangguoz@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hello Jason, thanks for the great write-up.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 0. One question about the migration plan: "The new
> >>>>>>>>> GetTransactionState
> >>>>>>>>>>> API
> >>>>>>>>>>>> and the new version of the transaction state message will
> >>>> not
> >>>>>> be
> >>>>>>>> used
> >>>>>>>>>>> until
> >>>>>>>>>>>> the inter-broker version supports it." I'm not so clear
> >>>> about
> >>>>>> the
> >>>>>>>>>>>> implementation details here: say a broker is on the newer
> >>>>>> version
> >>>>>>>> and
> >>>>>>>>>> the
> >>>>>>>>>>>> txn-coordinator is still on older version. Today the
> >>>>>>>>> APIVersionsRequest
> >>>>>>>>>>> can
> >>>>>>>>>>>> only help upgrade / downgrade the request version, but not
> >>>>>>>> forbidding
> >>>>>>>>>>>> sending any. Are you suggesting we add additional logic on
> >>>> the
> >>>>>>>> broker
> >>>>>>>>>>> side
> >>>>>>>>>>>> to handle the case of "not sending the request"? If yes my
> >>>>>>> concern
> >>>>>>>> is
> >>>>>>>>>>> that
> >>>>>>>>>>>> this will be some tech-debt code that will live long before
> >>>>>> being
> >>>>>>>>>>> removed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Some additional minor comments:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
> >>>>>> the
> >>>>>>>> same
> >>>>>>>>>>> thing
> >>>>>>>>>>>> in your wiki.
> >>>>>>>>>>>> 2. "The broker must verify after receiving the response that
> >>>>>> the
> >>>>>>>>>> producer
> >>>>>>>>>>>> state is still unknown.": not sure why we have to validate?
> >>>> If
> >>>>>>> the
> >>>>>>>>>>> metadata
> >>>>>>>>>>>> returned from the txn-coordinator can always be considered
> >>>> the
> >>>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
> >>>>>> cache?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> >>>>>>>>> matthias@confluent.io>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I am +1 on this :)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
> >>>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
> >>>> link
> >>>>>> to
> >>>>>>>> the
> >>>>>>>>>> KIP:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> >>>>>>>>>>>>>> .
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Jason
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> >>>>>>>>>> jason@confluent.io
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi All,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I have a proposal to improve the
> >>>> transactional/idempotent
> >>>>>>>>>> producer's
> >>>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
> >>>>>> result
> >>>>>>> of
> >>>>>>>>>>> losing
> >>>>>>>>>>>>>>> producer state following segment removal. The current
> >>>>>>> behavior
> >>>>>>>>> is
> >>>>>>>>>>> both
> >>>>>>>>>>>>>>> complex and limited. Please take a look and let me know
> >>>>>> what
> >>>>>>>> you
> >>>>>>>>>>>> think.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
> >>>>>>> initial
> >>>>>>>>>> draft.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Jason
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> -- Guozhang
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> -- Guozhang
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -- Guozhang
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>


--
-- Guozhang

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Thanks Jason!

LGTM.

On 8/21/19 3:07 PM, Jason Gustafson wrote:
> Hi Matthias,
> 
> Thanks, I appreciate the thorough review. I've revised the section to make
> the logic clearer. I think you have it right except for the 1). We only
> generate a new PID if the epoch cannot be incremented without overflow.
> 
> -Jason
> 
> On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <ma...@confluent.io>
> wrote:
> 
>> Thanks for the KIP. I just have some clarification questions to make
>> sure I understand the proposal correctly:
>>
>> 1) "Safe Epoch Incrementing"
>>
>>> When the coordinator receives a new InitProducerId request, we will use
>> the following logic to update the epoch:
>>>
>>> 1. No epoch is provided: the current epoch will be bumped and the last
>> epoch will be set to -1.
>>> 2. Epoch and producerId are provided, and the provided producerId
>> matches the current producerId or the provided producerId matches the
>> previous producerId and the provided epoch is exhausted:
>>>       a. Provided epoch matches current epoch: the last epoch will be
>> set to the current epoch, and the current epoch will be bumped .
>>>       b. Provided epoch matches last epoch: the current epoch will be
>> returned
>>>       c. Else: return INVALID_PRODUCER_EPOCH
>>> 3. Otherwise, return INVALID_PRODUCER_EPOCH
>>
>> Case (1) would be for a new producer. Hence, should we state that "no
>> PID" is provided (instead of "no epoch" is provided?). That might be
>> clearer and it implies that there is no epoch anyway.
>>
>> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
>> error and tries to re-initialize itself.
>>
>> Case (2a) implies that the producer send its first request and is not
>> fenced. Case (2b) implies that the producer re-tries to re-initialize
>> itself, ie, it first request to re-initilize did not get a respond but
>> was processed by the transaction coordinator. Case (2c) implies that a
>> producer was fenced (similar case 3, even if I am not sure what case 3
>> actually would be?)
>>
>> Please let me know if my understanding is correct.
>>
>> What is still unclear to me is, why case (2 -- or is it only 2b?)
>> requires that the "provide epoch is exhausted"?
>>
>> For case 2b:
>>
>> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
>> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
>> PID/epoch pair. The TC processes the request and creates a new PID=101
>> with new epoch=0, however, the respond to the producer is lost. The TC
>> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
>> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
>> PID/epoch still matches the previous PID/epoch pair and hence the TC
>> know it's a retry?
>>
>> If this reasoning is correct, should the logic be as follows:
>>
>> 1. No PID is provided: create a new PID with epoch=0 and set the last
>> epoch to -1.
>> 2. Epoch and producerId are provided
>>    a) the provided producerId/epoch matches the current producerId/epoch:
>>       i) if the epoch is not exhausted, bump the epoch
>>       ii) if the epoch is exhausted, create a new PID with epoch=0
>>    b) the provided producerId/epoch matches the previous
>> producerId/epoch: respond with current PID/epoch
>>    c) Otherwise, return INVALID_PRODUCER_EPOCH
>>
>>
>>
>> -Matthias
>>
>>
>>
>>
>> On 4/4/19 3:47 PM, Jason Gustafson wrote:
>>> Hi Everyone,
>>>
>>> Sorry for the long delay on this KIP. I have updated it to include the
>>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are
>> no
>>> further comments, I will plan to start a vote early next week.
>>>
>>> Thanks!
>>> Jason
>>>
>>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <adam.bellemare@gmail.com
>>>
>>> wrote:
>>>
>>>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
>>>>
>>>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
>> adam.bellemare@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi John
>>>>>
>>>>> What is the status of this KIP?
>>>>>
>>>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
>>>>> 2.1.1 for a multitude of our internal topics, and I suspect that a
>> proper
>>>>> fix is needed.
>>>>>
>>>>> Adam
>>>>>
>>>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com>
>> wrote:
>>>>>
>>>>>> Thanks Jason. The proposed solution sounds good to me.
>>>>>>
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Guozhang,
>>>>>>>
>>>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
>>>>>>> occurs following expiration of the producerId. It's possible that
>>>>>> another
>>>>>>> producerId has been installed in its place following expiration (if
>>>>>> another
>>>>>>> producer instance has become active), or the mapping is empty. We can
>>>>>>> safely retry the InitProducerId with the logic in this KIP in order
>> to
>>>>>>> detect which case it is. So I'd suggest something like this:
>>>>>>>
>>>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
>>>>>>> InitProducerId using the current producerId and epoch.
>>>>>>> 2. If no mapping exists, the coordinator can generate a new
>> producerId
>>>>>> and
>>>>>>> return it. If a transaction is in progress on the client, it will
>> have
>>>>>> to
>>>>>>> be aborted, but the producer can continue afterwards.
>>>>>>> 3. Otherwise if a different producerId has been assigned, then we can
>>>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we
>> can
>>>>>>> probably raise this as ProducerFencedException since that is
>>>> effectively
>>>>>>> what has happened. Ideally this is the only fatal case that users
>> have
>>>>>> to
>>>>>>> handle.
>>>>>>>
>>>>>>> I'll give it a little more thought and update the KIP.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jason
>>>>>>>
>>>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>> You're right about the dangling txn since it will actually block
>>>>>>>> read-committed consumers from proceeding at all. I'd agree that
>>>> since
>>>>>>> this
>>>>>>>> is a very rare case, we can consider fixing it not via broker-side
>>>>>> logic
>>>>>>>> but via tooling in a future work.
>>>>>>>>
>>>>>>>> I've also discovered some related error handling logic inside
>>>> producer
>>>>>>> that
>>>>>>>> may be addressed together with this KIP (since it is mostly for
>>>>>> internal
>>>>>>>> implementations the wiki itself does not need to be modified):
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <jason@confluent.io
>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Guozhang,
>>>>>>>>>
>>>>>>>>> To clarify, the broker does not actually use the ApiVersion API
>>>> for
>>>>>>>>> inter-broker communications. The use of an API and its
>>>> corresponding
>>>>>>>>> version is controlled by `inter.broker.protocol.version`.
>>>>>>>>>
>>>>>>>>> Nevertheless, it sounds like we're on the same page about removing
>>>>>>>>> DescribeTransactionState. The impact of a dangling transaction is
>>>> a
>>>>>>>> little
>>>>>>>>> worse than what you describe though. Consumers with the
>>>>>> read_committed
>>>>>>>>> isolation level will be stuck. Still, I think we agree that this
>>>>>> case
>>>>>>>>> should be rare and we can reconsider for future work. Rather than
>>>>>>>>> preventing dangling transactions, perhaps we should consider
>>>> options
>>>>>>>> which
>>>>>>>>> allows us to detect them and recover. Anyway, this needs more
>>>>>> thought.
>>>>>>> I
>>>>>>>>> will update the KIP.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jason
>>>>>>>>>
>>>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wangguoz@gmail.com
>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> 0. My original question is about the implementation details
>>>>>>> primarily,
>>>>>>>>>> since current the handling logic of the APIVersionResponse is
>>>>>> simply
>>>>>>>> "use
>>>>>>>>>> the highest supported version of the corresponding request", but
>>>>>> if
>>>>>>> the
>>>>>>>>>> returned response from APIVersionRequest says "I don't even know
>>>>>>> about
>>>>>>>>> the
>>>>>>>>>> DescribeTransactionStateRequest at all", then we need additional
>>>>>>> logic
>>>>>>>>> for
>>>>>>>>>> the falling back logic. Currently this logic is embedded in
>>>>>>>> NetworkClient
>>>>>>>>>> which is shared by all clients, so I'd like to avoid making this
>>>>>>> logic
>>>>>>>>> more
>>>>>>>>>> complicated.
>>>>>>>>>>
>>>>>>>>>> As for the general issue that a broker does not recognize a
>>>>>> producer
>>>>>>>> with
>>>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
>>>>>> wiki,
>>>>>>>> this
>>>>>>>>>> is only a concern for transactional producer since for
>>>> idempotent
>>>>>>>>> producer
>>>>>>>>>> it can just bump the epoch and go. For transactional producer,
>>>>>> even
>>>>>>> if
>>>>>>>>> the
>>>>>>>>>> producer request from a fenced producer gets accepted, its
>>>>>>> transaction
>>>>>>>>> will
>>>>>>>>>> never be committed and hence messages not exposed to
>>>>>> read-committed
>>>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
>>>>>>>> consumers
>>>>>>>>>> will still read those messages, 2) unnecessary storage for those
>>>>>>> fenced
>>>>>>>>>> produce messages, but in practice should not accumulate to a
>>>> large
>>>>>>>> amount
>>>>>>>>>> since producer should soon try to commit and be told it is
>>>> fenced
>>>>>> and
>>>>>>>>> then
>>>>>>>>>> stop, 3) there will be no markers for those transactional
>>>> messages
>>>>>>>> ever.
>>>>>>>>>> Looking at the list and thinking about the likelihood it may
>>>>>> happen
>>>>>>>>>> assuming we retain the producer up to transactional.id.timeout
>>>>>>> (default
>>>>>>>>> is
>>>>>>>>>> 7 days), I feel comfortable leaving it as is.
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
>>>>>> jason@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the comments. Responses below:
>>>>>>>>>>>
>>>>>>>>>>> 0. The new API is used between brokers, so we govern its usage
>>>>>>> using
>>>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
>>>>>>> upgraded,
>>>>>>>>> we
>>>>>>>>>>> will just fallback to the old logic, which is to accept the
>>>>>> write.
>>>>>>>> This
>>>>>>>>>> is
>>>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
>>>> Does
>>>>>>> that
>>>>>>>>>> seem
>>>>>>>>>>> reasonable?
>>>>>>>>>>>
>>>>>>>>>>> To tell the truth, after digging this KIP up and reading it
>>>>>> over, I
>>>>>>>> am
>>>>>>>>>>> doubting how crucial this API is. It is attempting to protect
>>>> a
>>>>>>> write
>>>>>>>>>> from
>>>>>>>>>>> a zombie which has just reset its sequence number after that
>>>>>>> producer
>>>>>>>>> had
>>>>>>>>>>> had its state cleaned up. However, one of the other
>>>>>> improvements in
>>>>>>>>> this
>>>>>>>>>>> KIP is to maintain producer state beyond its retention in the
>>>>>> log.
>>>>>>> I
>>>>>>>>>> think
>>>>>>>>>>> that makes this case sufficiently unlikely that we can leave
>>>> it
>>>>>> for
>>>>>>>>>> future
>>>>>>>>>>> work. I am not 100% sure this is the only scenario where
>>>>>>> transaction
>>>>>>>>>> state
>>>>>>>>>>> and log state can diverge anyway, so it would be better to
>>>>>> consider
>>>>>>>>> this
>>>>>>>>>>> problem more generally. What do you think?
>>>>>>>>>>>
>>>>>>>>>>> 1. Thanks, from memory, the term changed after the first
>>>>>> iteration.
>>>>>>>>> I'll
>>>>>>>>>>> make a pass and try to clarify usage.
>>>>>>>>>>> 2. I was attempting to handle some edge cases since this check
>>>>>>> would
>>>>>>>> be
>>>>>>>>>>> asynchronous. In any case, if we drop this validation as
>>>>>> suggested
>>>>>>>>> above,
>>>>>>>>>>> then we can ignore this.
>>>>>>>>>>>
>>>>>>>>>>> -Jason
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
>>>>>> wangguoz@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello Jason, thanks for the great write-up.
>>>>>>>>>>>>
>>>>>>>>>>>> 0. One question about the migration plan: "The new
>>>>>>>>> GetTransactionState
>>>>>>>>>>> API
>>>>>>>>>>>> and the new version of the transaction state message will
>>>> not
>>>>>> be
>>>>>>>> used
>>>>>>>>>>> until
>>>>>>>>>>>> the inter-broker version supports it." I'm not so clear
>>>> about
>>>>>> the
>>>>>>>>>>>> implementation details here: say a broker is on the newer
>>>>>> version
>>>>>>>> and
>>>>>>>>>> the
>>>>>>>>>>>> txn-coordinator is still on older version. Today the
>>>>>>>>> APIVersionsRequest
>>>>>>>>>>> can
>>>>>>>>>>>> only help upgrade / downgrade the request version, but not
>>>>>>>> forbidding
>>>>>>>>>>>> sending any. Are you suggesting we add additional logic on
>>>> the
>>>>>>>> broker
>>>>>>>>>>> side
>>>>>>>>>>>> to handle the case of "not sending the request"? If yes my
>>>>>>> concern
>>>>>>>> is
>>>>>>>>>>> that
>>>>>>>>>>>> this will be some tech-debt code that will live long before
>>>>>> being
>>>>>>>>>>> removed.
>>>>>>>>>>>>
>>>>>>>>>>>> Some additional minor comments:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
>>>>>> the
>>>>>>>> same
>>>>>>>>>>> thing
>>>>>>>>>>>> in your wiki.
>>>>>>>>>>>> 2. "The broker must verify after receiving the response that
>>>>>> the
>>>>>>>>>> producer
>>>>>>>>>>>> state is still unknown.": not sure why we have to validate?
>>>> If
>>>>>>> the
>>>>>>>>>>> metadata
>>>>>>>>>>>> returned from the txn-coordinator can always be considered
>>>> the
>>>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
>>>>>> cache?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
>>>>>>>>> matthias@confluent.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I am +1 on this :)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
>>>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
>>>> link
>>>>>> to
>>>>>>>> the
>>>>>>>>>> KIP:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Jason
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
>>>>>>>>>> jason@confluent.io
>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have a proposal to improve the
>>>> transactional/idempotent
>>>>>>>>>> producer's
>>>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
>>>>>> result
>>>>>>> of
>>>>>>>>>>> losing
>>>>>>>>>>>>>>> producer state following segment removal. The current
>>>>>>> behavior
>>>>>>>>> is
>>>>>>>>>>> both
>>>>>>>>>>>>>>> complex and limited. Please take a look and let me know
>>>>>> what
>>>>>>>> you
>>>>>>>>>>>> think.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
>>>>>>> initial
>>>>>>>>>> draft.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Jason
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -- Guozhang
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Jason Gustafson <ja...@confluent.io>.

Hi Matthias,

Thanks, I appreciate the thorough review. I've revised the section to make
the logic clearer. I think you have it right except for the 1). We only
generate a new PID if the epoch cannot be incremented without overflow.

-Jason

On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax <ma...@confluent.io>
wrote:

> Thanks for the KIP. I just have some clarification questions to make
> sure I understand the proposal correctly:
>
> 1) "Safe Epoch Incrementing"
>
> > When the coordinator receives a new InitProducerId request, we will use
> the following logic to update the epoch:
> >
> > 1. No epoch is provided: the current epoch will be bumped and the last
> epoch will be set to -1.
> > 2. Epoch and producerId are provided, and the provided producerId
> matches the current producerId or the provided producerId matches the
> previous producerId and the provided epoch is exhausted:
> >       a. Provided epoch matches current epoch: the last epoch will be
> set to the current epoch, and the current epoch will be bumped .
> >       b. Provided epoch matches last epoch: the current epoch will be
> returned
> >       c. Else: return INVALID_PRODUCER_EPOCH
> > 3. Otherwise, return INVALID_PRODUCER_EPOCH
>
> Case (1) would be for a new producer. Hence, should we state that "no
> PID" is provided (instead of "no epoch" is provided?). That might be
> clearer and it implies that there is no epoch anyway.
>
> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
> error and tries to re-initialize itself.
>
> Case (2a) implies that the producer send its first request and is not
> fenced. Case (2b) implies that the producer re-tries to re-initialize
> itself, ie, it first request to re-initilize did not get a respond but
> was processed by the transaction coordinator. Case (2c) implies that a
> producer was fenced (similar case 3, even if I am not sure what case 3
> actually would be?)
>
> Please let me know if my understanding is correct.
>
> What is still unclear to me is, why case (2 -- or is it only 2b?)
> requires that the "provide epoch is exhausted"?
>
> For case 2b:
>
> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
> PID/epoch pair. The TC processes the request and creates a new PID=101
> with new epoch=0, however, the respond to the producer is lost. The TC
> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
> PID/epoch still matches the previous PID/epoch pair and hence the TC
> know it's a retry?
>
> If this reasoning is correct, should the logic be as follows:
>
> 1. No PID is provided: create a new PID with epoch=0 and set the last
> epoch to -1.
> 2. Epoch and producerId are provided
>    a) the provided producerId/epoch matches the current producerId/epoch:
>       i) if the epoch is not exhausted, bump the epoch
>       ii) if the epoch is exhausted, create a new PID with epoch=0
>    b) the provided producerId/epoch matches the previous
> producerId/epoch: respond with current PID/epoch
>    c) Otherwise, return INVALID_PRODUCER_EPOCH
>
>
>
> -Matthias
>
>
>
>
> On 4/4/19 3:47 PM, Jason Gustafson wrote:
> > Hi Everyone,
> >
> > Sorry for the long delay on this KIP. I have updated it to include the
> > handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are
> no
> > further comments, I will plan to start a vote early next week.
> >
> > Thanks!
> > Jason
> >
> > On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <adam.bellemare@gmail.com
> >
> > wrote:
> >
> >> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
> >>
> >> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <
> adam.bellemare@gmail.com>
> >> wrote:
> >>
> >>> Hi John
> >>>
> >>> What is the status of this KIP?
> >>>
> >>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
> >>> 2.1.1 for a multitude of our internal topics, and I suspect that a
> proper
> >>> fix is needed.
> >>>
> >>> Adam
> >>>
> >>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >>>
> >>>> Thanks Jason. The proposed solution sounds good to me.
> >>>>
> >>>>
> >>>> Guozhang
> >>>>
> >>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
> >>>> wrote:
> >>>>
> >>>>> Hey Guozhang,
> >>>>>
> >>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
> >>>>> occurs following expiration of the producerId. It's possible that
> >>>> another
> >>>>> producerId has been installed in its place following expiration (if
> >>>> another
> >>>>> producer instance has become active), or the mapping is empty. We can
> >>>>> safely retry the InitProducerId with the logic in this KIP in order
> to
> >>>>> detect which case it is. So I'd suggest something like this:
> >>>>>
> >>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
> >>>>> InitProducerId using the current producerId and epoch.
> >>>>> 2. If no mapping exists, the coordinator can generate a new
> producerId
> >>>> and
> >>>>> return it. If a transaction is in progress on the client, it will
> have
> >>>> to
> >>>>> be aborted, but the producer can continue afterwards.
> >>>>> 3. Otherwise if a different producerId has been assigned, then we can
> >>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we
> can
> >>>>> probably raise this as ProducerFencedException since that is
> >> effectively
> >>>>> what has happened. Ideally this is the only fatal case that users
> have
> >>>> to
> >>>>> handle.
> >>>>>
> >>>>> I'll give it a little more thought and update the KIP.
> >>>>>
> >>>>> Thanks,
> >>>>> Jason
> >>>>>
> >>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> You're right about the dangling txn since it will actually block
> >>>>>> read-committed consumers from proceeding at all. I'd agree that
> >> since
> >>>>> this
> >>>>>> is a very rare case, we can consider fixing it not via broker-side
> >>>> logic
> >>>>>> but via tooling in a future work.
> >>>>>>
> >>>>>> I've also discovered some related error handling logic inside
> >> producer
> >>>>> that
> >>>>>> may be addressed together with this KIP (since it is mostly for
> >>>> internal
> >>>>>> implementations the wiki itself does not need to be modified):
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> >>>>>>
> >>>>>> Guozhang
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <jason@confluent.io
> >>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hey Guozhang,
> >>>>>>>
> >>>>>>> To clarify, the broker does not actually use the ApiVersion API
> >> for
> >>>>>>> inter-broker communications. The use of an API and its
> >> corresponding
> >>>>>>> version is controlled by `inter.broker.protocol.version`.
> >>>>>>>
> >>>>>>> Nevertheless, it sounds like we're on the same page about removing
> >>>>>>> DescribeTransactionState. The impact of a dangling transaction is
> >> a
> >>>>>> little
> >>>>>>> worse than what you describe though. Consumers with the
> >>>> read_committed
> >>>>>>> isolation level will be stuck. Still, I think we agree that this
> >>>> case
> >>>>>>> should be rare and we can reconsider for future work. Rather than
> >>>>>>> preventing dangling transactions, perhaps we should consider
> >> options
> >>>>>> which
> >>>>>>> allows us to detect them and recover. Anyway, this needs more
> >>>> thought.
> >>>>> I
> >>>>>>> will update the KIP.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jason
> >>>>>>>
> >>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wangguoz@gmail.com
> >>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> 0. My original question is about the implementation details
> >>>>> primarily,
> >>>>>>>> since current the handling logic of the APIVersionResponse is
> >>>> simply
> >>>>>> "use
> >>>>>>>> the highest supported version of the corresponding request", but
> >>>> if
> >>>>> the
> >>>>>>>> returned response from APIVersionRequest says "I don't even know
> >>>>> about
> >>>>>>> the
> >>>>>>>> DescribeTransactionStateRequest at all", then we need additional
> >>>>> logic
> >>>>>>> for
> >>>>>>>> the falling back logic. Currently this logic is embedded in
> >>>>>> NetworkClient
> >>>>>>>> which is shared by all clients, so I'd like to avoid making this
> >>>>> logic
> >>>>>>> more
> >>>>>>>> complicated.
> >>>>>>>>
> >>>>>>>> As for the general issue that a broker does not recognize a
> >>>> producer
> >>>>>> with
> >>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
> >>>> wiki,
> >>>>>> this
> >>>>>>>> is only a concern for transactional producer since for
> >> idempotent
> >>>>>>> producer
> >>>>>>>> it can just bump the epoch and go. For transactional producer,
> >>>> even
> >>>>> if
> >>>>>>> the
> >>>>>>>> producer request from a fenced producer gets accepted, its
> >>>>> transaction
> >>>>>>> will
> >>>>>>>> never be committed and hence messages not exposed to
> >>>> read-committed
> >>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
> >>>>>> consumers
> >>>>>>>> will still read those messages, 2) unnecessary storage for those
> >>>>> fenced
> >>>>>>>> produce messages, but in practice should not accumulate to a
> >> large
> >>>>>> amount
> >>>>>>>> since producer should soon try to commit and be told it is
> >> fenced
> >>>> and
> >>>>>>> then
> >>>>>>>> stop, 3) there will be no markers for those transactional
> >> messages
> >>>>>> ever.
> >>>>>>>> Looking at the list and thinking about the likelihood it may
> >>>> happen
> >>>>>>>> assuming we retain the producer up to transactional.id.timeout
> >>>>> (default
> >>>>>>> is
> >>>>>>>> 7 days), I feel comfortable leaving it as is.
> >>>>>>>>
> >>>>>>>> Guozhang
> >>>>>>>>
> >>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> >>>> jason@confluent.io>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Guozhang,
> >>>>>>>>>
> >>>>>>>>> Thanks for the comments. Responses below:
> >>>>>>>>>
> >>>>>>>>> 0. The new API is used between brokers, so we govern its usage
> >>>>> using
> >>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
> >>>>> upgraded,
> >>>>>>> we
> >>>>>>>>> will just fallback to the old logic, which is to accept the
> >>>> write.
> >>>>>> This
> >>>>>>>> is
> >>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
> >> Does
> >>>>> that
> >>>>>>>> seem
> >>>>>>>>> reasonable?
> >>>>>>>>>
> >>>>>>>>> To tell the truth, after digging this KIP up and reading it
> >>>> over, I
> >>>>>> am
> >>>>>>>>> doubting how crucial this API is. It is attempting to protect
> >> a
> >>>>> write
> >>>>>>>> from
> >>>>>>>>> a zombie which has just reset its sequence number after that
> >>>>> producer
> >>>>>>> had
> >>>>>>>>> had its state cleaned up. However, one of the other
> >>>> improvements in
> >>>>>>> this
> >>>>>>>>> KIP is to maintain producer state beyond its retention in the
> >>>> log.
> >>>>> I
> >>>>>>>> think
> >>>>>>>>> that makes this case sufficiently unlikely that we can leave
> >> it
> >>>> for
> >>>>>>>> future
> >>>>>>>>> work. I am not 100% sure this is the only scenario where
> >>>>> transaction
> >>>>>>>> state
> >>>>>>>>> and log state can diverge anyway, so it would be better to
> >>>> consider
> >>>>>>> this
> >>>>>>>>> problem more generally. What do you think?
> >>>>>>>>>
> >>>>>>>>> 1. Thanks, from memory, the term changed after the first
> >>>> iteration.
> >>>>>>> I'll
> >>>>>>>>> make a pass and try to clarify usage.
> >>>>>>>>> 2. I was attempting to handle some edge cases since this check
> >>>>> would
> >>>>>> be
> >>>>>>>>> asynchronous. In any case, if we drop this validation as
> >>>> suggested
> >>>>>>> above,
> >>>>>>>>> then we can ignore this.
> >>>>>>>>>
> >>>>>>>>> -Jason
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> >>>> wangguoz@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hello Jason, thanks for the great write-up.
> >>>>>>>>>>
> >>>>>>>>>> 0. One question about the migration plan: "The new
> >>>>>>> GetTransactionState
> >>>>>>>>> API
> >>>>>>>>>> and the new version of the transaction state message will
> >> not
> >>>> be
> >>>>>> used
> >>>>>>>>> until
> >>>>>>>>>> the inter-broker version supports it." I'm not so clear
> >> about
> >>>> the
> >>>>>>>>>> implementation details here: say a broker is on the newer
> >>>> version
> >>>>>> and
> >>>>>>>> the
> >>>>>>>>>> txn-coordinator is still on older version. Today the
> >>>>>>> APIVersionsRequest
> >>>>>>>>> can
> >>>>>>>>>> only help upgrade / downgrade the request version, but not
> >>>>>> forbidding
> >>>>>>>>>> sending any. Are you suggesting we add additional logic on
> >> the
> >>>>>> broker
> >>>>>>>>> side
> >>>>>>>>>> to handle the case of "not sending the request"? If yes my
> >>>>> concern
> >>>>>> is
> >>>>>>>>> that
> >>>>>>>>>> this will be some tech-debt code that will live long before
> >>>> being
> >>>>>>>>> removed.
> >>>>>>>>>>
> >>>>>>>>>> Some additional minor comments:
> >>>>>>>>>>
> >>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
> >>>> the
> >>>>>> same
> >>>>>>>>> thing
> >>>>>>>>>> in your wiki.
> >>>>>>>>>> 2. "The broker must verify after receiving the response that
> >>>> the
> >>>>>>>> producer
> >>>>>>>>>> state is still unknown.": not sure why we have to validate?
> >> If
> >>>>> the
> >>>>>>>>> metadata
> >>>>>>>>>> returned from the txn-coordinator can always be considered
> >> the
> >>>>>>>>>> source-of-truth, can't we just bindly use it to update the
> >>>> cache?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Guozhang
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> >>>>>>> matthias@confluent.io>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I am +1 on this :)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> -Matthias
> >>>>>>>>>>>
> >>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
> >>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
> >> link
> >>>> to
> >>>>>> the
> >>>>>>>> KIP:
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> >>>>>>>>>>>> .
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Jason
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> >>>>>>>> jason@confluent.io
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi All,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I have a proposal to improve the
> >> transactional/idempotent
> >>>>>>>> producer's
> >>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
> >>>> result
> >>>>> of
> >>>>>>>>> losing
> >>>>>>>>>>>>> producer state following segment removal. The current
> >>>>> behavior
> >>>>>>> is
> >>>>>>>>> both
> >>>>>>>>>>>>> complex and limited. Please take a look and let me know
> >>>> what
> >>>>>> you
> >>>>>>>>>> think.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
> >>>>> initial
> >>>>>>>> draft.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Jason
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> -- Guozhang
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> -- Guozhang
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -- Guozhang
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> -- Guozhang
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Thanks for the KIP. I just have some clarification questions to make
sure I understand the proposal correctly:

1) "Safe Epoch Incrementing"

> When the coordinator receives a new InitProducerId request, we will use the following logic to update the epoch:
> 
> 1. No epoch is provided: the current epoch will be bumped and the last epoch will be set to -1.
> 2. Epoch and producerId are provided, and the provided producerId matches the current producerId or the provided producerId matches the previous producerId and the provided epoch is exhausted:
> 	a. Provided epoch matches current epoch: the last epoch will be set to the current epoch, and the current epoch will be bumped .
> 	b. Provided epoch matches last epoch: the current epoch will be returned
> 	c. Else: return INVALID_PRODUCER_EPOCH
> 3. Otherwise, return INVALID_PRODUCER_EPOCH

Case (1) would be for a new producer. Hence, should we state that "no
PID" is provided (instead of "no epoch" is provided?). That might be
clearer and it implies that there is no epoch anyway.

Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID`
error and tries to re-initialize itself.

Case (2a) implies that the producer send its first request and is not
fenced. Case (2b) implies that the producer re-tries to re-initialize
itself, ie, it first request to re-initilize did not get a respond but
was processed by the transaction coordinator. Case (2c) implies that a
producer was fenced (similar case 3, even if I am not sure what case 3
actually would be?)

Please let me know if my understanding is correct.

What is still unclear to me is, why case (2 -- or is it only 2b?)
requires that the "provide epoch is exhausted"?

For case 2b:

Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an
`UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding
PID/epoch pair. The TC processes the request and creates a new PID=101
with new epoch=0, however, the respond to the producer is lost. The TC
still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`,
`previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent
PID/epoch still matches the previous PID/epoch pair and hence the TC
know it's a retry?

If this reasoning is correct, should the logic be as follows:

1. No PID is provided: create a new PID with epoch=0 and set the last
epoch to -1.
2. Epoch and producerId are provided
   a) the provided producerId/epoch matches the current producerId/epoch:
      i) if the epoch is not exhausted, bump the epoch
      ii) if the epoch is exhausted, create a new PID with epoch=0
   b) the provided producerId/epoch matches the previous
producerId/epoch: respond with current PID/epoch
   c) Otherwise, return INVALID_PRODUCER_EPOCH



-Matthias




On 4/4/19 3:47 PM, Jason Gustafson wrote:
> Hi Everyone,
> 
> Sorry for the long delay on this KIP. I have updated it to include the
> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are no
> further comments, I will plan to start a vote early next week.
> 
> Thanks!
> Jason
> 
> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <ad...@gmail.com>
> wrote:
> 
>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
>>
>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <ad...@gmail.com>
>> wrote:
>>
>>> Hi John
>>>
>>> What is the status of this KIP?
>>>
>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
>>> 2.1.1 for a multitude of our internal topics, and I suspect that a proper
>>> fix is needed.
>>>
>>> Adam
>>>
>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com> wrote:
>>>
>>>> Thanks Jason. The proposed solution sounds good to me.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
>>>> wrote:
>>>>
>>>>> Hey Guozhang,
>>>>>
>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
>>>>> occurs following expiration of the producerId. It's possible that
>>>> another
>>>>> producerId has been installed in its place following expiration (if
>>>> another
>>>>> producer instance has become active), or the mapping is empty. We can
>>>>> safely retry the InitProducerId with the logic in this KIP in order to
>>>>> detect which case it is. So I'd suggest something like this:
>>>>>
>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
>>>>> InitProducerId using the current producerId and epoch.
>>>>> 2. If no mapping exists, the coordinator can generate a new producerId
>>>> and
>>>>> return it. If a transaction is in progress on the client, it will have
>>>> to
>>>>> be aborted, but the producer can continue afterwards.
>>>>> 3. Otherwise if a different producerId has been assigned, then we can
>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can
>>>>> probably raise this as ProducerFencedException since that is
>> effectively
>>>>> what has happened. Ideally this is the only fatal case that users have
>>>> to
>>>>> handle.
>>>>>
>>>>> I'll give it a little more thought and update the KIP.
>>>>>
>>>>> Thanks,
>>>>> Jason
>>>>>
>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> You're right about the dangling txn since it will actually block
>>>>>> read-committed consumers from proceeding at all. I'd agree that
>> since
>>>>> this
>>>>>> is a very rare case, we can consider fixing it not via broker-side
>>>> logic
>>>>>> but via tooling in a future work.
>>>>>>
>>>>>> I've also discovered some related error handling logic inside
>> producer
>>>>> that
>>>>>> may be addressed together with this KIP (since it is mostly for
>>>> internal
>>>>>> implementations the wiki itself does not need to be modified):
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <jason@confluent.io
>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Guozhang,
>>>>>>>
>>>>>>> To clarify, the broker does not actually use the ApiVersion API
>> for
>>>>>>> inter-broker communications. The use of an API and its
>> corresponding
>>>>>>> version is controlled by `inter.broker.protocol.version`.
>>>>>>>
>>>>>>> Nevertheless, it sounds like we're on the same page about removing
>>>>>>> DescribeTransactionState. The impact of a dangling transaction is
>> a
>>>>>> little
>>>>>>> worse than what you describe though. Consumers with the
>>>> read_committed
>>>>>>> isolation level will be stuck. Still, I think we agree that this
>>>> case
>>>>>>> should be rare and we can reconsider for future work. Rather than
>>>>>>> preventing dangling transactions, perhaps we should consider
>> options
>>>>>> which
>>>>>>> allows us to detect them and recover. Anyway, this needs more
>>>> thought.
>>>>> I
>>>>>>> will update the KIP.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jason
>>>>>>>
>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wangguoz@gmail.com
>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>> 0. My original question is about the implementation details
>>>>> primarily,
>>>>>>>> since current the handling logic of the APIVersionResponse is
>>>> simply
>>>>>> "use
>>>>>>>> the highest supported version of the corresponding request", but
>>>> if
>>>>> the
>>>>>>>> returned response from APIVersionRequest says "I don't even know
>>>>> about
>>>>>>> the
>>>>>>>> DescribeTransactionStateRequest at all", then we need additional
>>>>> logic
>>>>>>> for
>>>>>>>> the falling back logic. Currently this logic is embedded in
>>>>>> NetworkClient
>>>>>>>> which is shared by all clients, so I'd like to avoid making this
>>>>> logic
>>>>>>> more
>>>>>>>> complicated.
>>>>>>>>
>>>>>>>> As for the general issue that a broker does not recognize a
>>>> producer
>>>>>> with
>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
>>>> wiki,
>>>>>> this
>>>>>>>> is only a concern for transactional producer since for
>> idempotent
>>>>>>> producer
>>>>>>>> it can just bump the epoch and go. For transactional producer,
>>>> even
>>>>> if
>>>>>>> the
>>>>>>>> producer request from a fenced producer gets accepted, its
>>>>> transaction
>>>>>>> will
>>>>>>>> never be committed and hence messages not exposed to
>>>> read-committed
>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
>>>>>> consumers
>>>>>>>> will still read those messages, 2) unnecessary storage for those
>>>>> fenced
>>>>>>>> produce messages, but in practice should not accumulate to a
>> large
>>>>>> amount
>>>>>>>> since producer should soon try to commit and be told it is
>> fenced
>>>> and
>>>>>>> then
>>>>>>>> stop, 3) there will be no markers for those transactional
>> messages
>>>>>> ever.
>>>>>>>> Looking at the list and thinking about the likelihood it may
>>>> happen
>>>>>>>> assuming we retain the producer up to transactional.id.timeout
>>>>> (default
>>>>>>> is
>>>>>>>> 7 days), I feel comfortable leaving it as is.
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
>>>> jason@confluent.io>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Guozhang,
>>>>>>>>>
>>>>>>>>> Thanks for the comments. Responses below:
>>>>>>>>>
>>>>>>>>> 0. The new API is used between brokers, so we govern its usage
>>>>> using
>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
>>>>> upgraded,
>>>>>>> we
>>>>>>>>> will just fallback to the old logic, which is to accept the
>>>> write.
>>>>>> This
>>>>>>>> is
>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
>> Does
>>>>> that
>>>>>>>> seem
>>>>>>>>> reasonable?
>>>>>>>>>
>>>>>>>>> To tell the truth, after digging this KIP up and reading it
>>>> over, I
>>>>>> am
>>>>>>>>> doubting how crucial this API is. It is attempting to protect
>> a
>>>>> write
>>>>>>>> from
>>>>>>>>> a zombie which has just reset its sequence number after that
>>>>> producer
>>>>>>> had
>>>>>>>>> had its state cleaned up. However, one of the other
>>>> improvements in
>>>>>>> this
>>>>>>>>> KIP is to maintain producer state beyond its retention in the
>>>> log.
>>>>> I
>>>>>>>> think
>>>>>>>>> that makes this case sufficiently unlikely that we can leave
>> it
>>>> for
>>>>>>>> future
>>>>>>>>> work. I am not 100% sure this is the only scenario where
>>>>> transaction
>>>>>>>> state
>>>>>>>>> and log state can diverge anyway, so it would be better to
>>>> consider
>>>>>>> this
>>>>>>>>> problem more generally. What do you think?
>>>>>>>>>
>>>>>>>>> 1. Thanks, from memory, the term changed after the first
>>>> iteration.
>>>>>>> I'll
>>>>>>>>> make a pass and try to clarify usage.
>>>>>>>>> 2. I was attempting to handle some edge cases since this check
>>>>> would
>>>>>> be
>>>>>>>>> asynchronous. In any case, if we drop this validation as
>>>> suggested
>>>>>>> above,
>>>>>>>>> then we can ignore this.
>>>>>>>>>
>>>>>>>>> -Jason
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
>>>> wangguoz@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Jason, thanks for the great write-up.
>>>>>>>>>>
>>>>>>>>>> 0. One question about the migration plan: "The new
>>>>>>> GetTransactionState
>>>>>>>>> API
>>>>>>>>>> and the new version of the transaction state message will
>> not
>>>> be
>>>>>> used
>>>>>>>>> until
>>>>>>>>>> the inter-broker version supports it." I'm not so clear
>> about
>>>> the
>>>>>>>>>> implementation details here: say a broker is on the newer
>>>> version
>>>>>> and
>>>>>>>> the
>>>>>>>>>> txn-coordinator is still on older version. Today the
>>>>>>> APIVersionsRequest
>>>>>>>>> can
>>>>>>>>>> only help upgrade / downgrade the request version, but not
>>>>>> forbidding
>>>>>>>>>> sending any. Are you suggesting we add additional logic on
>> the
>>>>>> broker
>>>>>>>>> side
>>>>>>>>>> to handle the case of "not sending the request"? If yes my
>>>>> concern
>>>>>> is
>>>>>>>>> that
>>>>>>>>>> this will be some tech-debt code that will live long before
>>>> being
>>>>>>>>> removed.
>>>>>>>>>>
>>>>>>>>>> Some additional minor comments:
>>>>>>>>>>
>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
>>>> the
>>>>>> same
>>>>>>>>> thing
>>>>>>>>>> in your wiki.
>>>>>>>>>> 2. "The broker must verify after receiving the response that
>>>> the
>>>>>>>> producer
>>>>>>>>>> state is still unknown.": not sure why we have to validate?
>> If
>>>>> the
>>>>>>>>> metadata
>>>>>>>>>> returned from the txn-coordinator can always be considered
>> the
>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
>>>> cache?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
>>>>>>> matthias@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am +1 on this :)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
>> link
>>>> to
>>>>>> the
>>>>>>>> KIP:
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>> -Jason
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
>>>>>>>> jason@confluent.io
>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a proposal to improve the
>> transactional/idempotent
>>>>>>>> producer's
>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
>>>> result
>>>>> of
>>>>>>>>> losing
>>>>>>>>>>>>> producer state following segment removal. The current
>>>>> behavior
>>>>>>> is
>>>>>>>>> both
>>>>>>>>>>>>> complex and limited. Please take a look and let me know
>>>> what
>>>>>> you
>>>>>>>>>> think.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
>>>>> initial
>>>>>>>> draft.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jason
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -- Guozhang
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by "Matthias J. Sax" <ma...@confluent.io>.

@Adam:

As workaround, you can increase the repartition topic config
`segment.bytes` and set a larger segment size. This should mitigate the
issue.


-Matthias



On 4/4/19 3:47 PM, Jason Gustafson wrote:
> Hi Everyone,
> 
> Sorry for the long delay on this KIP. I have updated it to include the
> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are no
> further comments, I will plan to start a vote early next week.
> 
> Thanks!
> Jason
> 
> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <ad...@gmail.com>
> wrote:
> 
>> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
>>
>> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <ad...@gmail.com>
>> wrote:
>>
>>> Hi John
>>>
>>> What is the status of this KIP?
>>>
>>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
>>> 2.1.1 for a multitude of our internal topics, and I suspect that a proper
>>> fix is needed.
>>>
>>> Adam
>>>
>>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com> wrote:
>>>
>>>> Thanks Jason. The proposed solution sounds good to me.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
>>>> wrote:
>>>>
>>>>> Hey Guozhang,
>>>>>
>>>>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
>>>>> occurs following expiration of the producerId. It's possible that
>>>> another
>>>>> producerId has been installed in its place following expiration (if
>>>> another
>>>>> producer instance has become active), or the mapping is empty. We can
>>>>> safely retry the InitProducerId with the logic in this KIP in order to
>>>>> detect which case it is. So I'd suggest something like this:
>>>>>
>>>>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
>>>>> InitProducerId using the current producerId and epoch.
>>>>> 2. If no mapping exists, the coordinator can generate a new producerId
>>>> and
>>>>> return it. If a transaction is in progress on the client, it will have
>>>> to
>>>>> be aborted, but the producer can continue afterwards.
>>>>> 3. Otherwise if a different producerId has been assigned, then we can
>>>>> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can
>>>>> probably raise this as ProducerFencedException since that is
>> effectively
>>>>> what has happened. Ideally this is the only fatal case that users have
>>>> to
>>>>> handle.
>>>>>
>>>>> I'll give it a little more thought and update the KIP.
>>>>>
>>>>> Thanks,
>>>>> Jason
>>>>>
>>>>> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> You're right about the dangling txn since it will actually block
>>>>>> read-committed consumers from proceeding at all. I'd agree that
>> since
>>>>> this
>>>>>> is a very rare case, we can consider fixing it not via broker-side
>>>> logic
>>>>>> but via tooling in a future work.
>>>>>>
>>>>>> I've also discovered some related error handling logic inside
>> producer
>>>>> that
>>>>>> may be addressed together with this KIP (since it is mostly for
>>>> internal
>>>>>> implementations the wiki itself does not need to be modified):
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <jason@confluent.io
>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Guozhang,
>>>>>>>
>>>>>>> To clarify, the broker does not actually use the ApiVersion API
>> for
>>>>>>> inter-broker communications. The use of an API and its
>> corresponding
>>>>>>> version is controlled by `inter.broker.protocol.version`.
>>>>>>>
>>>>>>> Nevertheless, it sounds like we're on the same page about removing
>>>>>>> DescribeTransactionState. The impact of a dangling transaction is
>> a
>>>>>> little
>>>>>>> worse than what you describe though. Consumers with the
>>>> read_committed
>>>>>>> isolation level will be stuck. Still, I think we agree that this
>>>> case
>>>>>>> should be rare and we can reconsider for future work. Rather than
>>>>>>> preventing dangling transactions, perhaps we should consider
>> options
>>>>>> which
>>>>>>> allows us to detect them and recover. Anyway, this needs more
>>>> thought.
>>>>> I
>>>>>>> will update the KIP.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jason
>>>>>>>
>>>>>>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wangguoz@gmail.com
>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>> 0. My original question is about the implementation details
>>>>> primarily,
>>>>>>>> since current the handling logic of the APIVersionResponse is
>>>> simply
>>>>>> "use
>>>>>>>> the highest supported version of the corresponding request", but
>>>> if
>>>>> the
>>>>>>>> returned response from APIVersionRequest says "I don't even know
>>>>> about
>>>>>>> the
>>>>>>>> DescribeTransactionStateRequest at all", then we need additional
>>>>> logic
>>>>>>> for
>>>>>>>> the falling back logic. Currently this logic is embedded in
>>>>>> NetworkClient
>>>>>>>> which is shared by all clients, so I'd like to avoid making this
>>>>> logic
>>>>>>> more
>>>>>>>> complicated.
>>>>>>>>
>>>>>>>> As for the general issue that a broker does not recognize a
>>>> producer
>>>>>> with
>>>>>>>> sequence number 0, here's my thinking: as you mentioned in the
>>>> wiki,
>>>>>> this
>>>>>>>> is only a concern for transactional producer since for
>> idempotent
>>>>>>> producer
>>>>>>>> it can just bump the epoch and go. For transactional producer,
>>>> even
>>>>> if
>>>>>>> the
>>>>>>>> producer request from a fenced producer gets accepted, its
>>>>> transaction
>>>>>>> will
>>>>>>>> never be committed and hence messages not exposed to
>>>> read-committed
>>>>>>>> consumers as well. The drawback is though, 1) read-uncommitted
>>>>>> consumers
>>>>>>>> will still read those messages, 2) unnecessary storage for those
>>>>> fenced
>>>>>>>> produce messages, but in practice should not accumulate to a
>> large
>>>>>> amount
>>>>>>>> since producer should soon try to commit and be told it is
>> fenced
>>>> and
>>>>>>> then
>>>>>>>> stop, 3) there will be no markers for those transactional
>> messages
>>>>>> ever.
>>>>>>>> Looking at the list and thinking about the likelihood it may
>>>> happen
>>>>>>>> assuming we retain the producer up to transactional.id.timeout
>>>>> (default
>>>>>>> is
>>>>>>>> 7 days), I feel comfortable leaving it as is.
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
>>>> jason@confluent.io>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Guozhang,
>>>>>>>>>
>>>>>>>>> Thanks for the comments. Responses below:
>>>>>>>>>
>>>>>>>>> 0. The new API is used between brokers, so we govern its usage
>>>>> using
>>>>>>>>> `inter.broker.protocol.version`. If the other broker hasn't
>>>>> upgraded,
>>>>>>> we
>>>>>>>>> will just fallback to the old logic, which is to accept the
>>>> write.
>>>>>> This
>>>>>>>> is
>>>>>>>>> similar to how we introduced the OffsetsForLeaderEpoch API.
>> Does
>>>>> that
>>>>>>>> seem
>>>>>>>>> reasonable?
>>>>>>>>>
>>>>>>>>> To tell the truth, after digging this KIP up and reading it
>>>> over, I
>>>>>> am
>>>>>>>>> doubting how crucial this API is. It is attempting to protect
>> a
>>>>> write
>>>>>>>> from
>>>>>>>>> a zombie which has just reset its sequence number after that
>>>>> producer
>>>>>>> had
>>>>>>>>> had its state cleaned up. However, one of the other
>>>> improvements in
>>>>>>> this
>>>>>>>>> KIP is to maintain producer state beyond its retention in the
>>>> log.
>>>>> I
>>>>>>>> think
>>>>>>>>> that makes this case sufficiently unlikely that we can leave
>> it
>>>> for
>>>>>>>> future
>>>>>>>>> work. I am not 100% sure this is the only scenario where
>>>>> transaction
>>>>>>>> state
>>>>>>>>> and log state can diverge anyway, so it would be better to
>>>> consider
>>>>>>> this
>>>>>>>>> problem more generally. What do you think?
>>>>>>>>>
>>>>>>>>> 1. Thanks, from memory, the term changed after the first
>>>> iteration.
>>>>>>> I'll
>>>>>>>>> make a pass and try to clarify usage.
>>>>>>>>> 2. I was attempting to handle some edge cases since this check
>>>>> would
>>>>>> be
>>>>>>>>> asynchronous. In any case, if we drop this validation as
>>>> suggested
>>>>>>> above,
>>>>>>>>> then we can ignore this.
>>>>>>>>>
>>>>>>>>> -Jason
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
>>>> wangguoz@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Jason, thanks for the great write-up.
>>>>>>>>>>
>>>>>>>>>> 0. One question about the migration plan: "The new
>>>>>>> GetTransactionState
>>>>>>>>> API
>>>>>>>>>> and the new version of the transaction state message will
>> not
>>>> be
>>>>>> used
>>>>>>>>> until
>>>>>>>>>> the inter-broker version supports it." I'm not so clear
>> about
>>>> the
>>>>>>>>>> implementation details here: say a broker is on the newer
>>>> version
>>>>>> and
>>>>>>>> the
>>>>>>>>>> txn-coordinator is still on older version. Today the
>>>>>>> APIVersionsRequest
>>>>>>>>> can
>>>>>>>>>> only help upgrade / downgrade the request version, but not
>>>>>> forbidding
>>>>>>>>>> sending any. Are you suggesting we add additional logic on
>> the
>>>>>> broker
>>>>>>>>> side
>>>>>>>>>> to handle the case of "not sending the request"? If yes my
>>>>> concern
>>>>>> is
>>>>>>>>> that
>>>>>>>>>> this will be some tech-debt code that will live long before
>>>> being
>>>>>>>>> removed.
>>>>>>>>>>
>>>>>>>>>> Some additional minor comments:
>>>>>>>>>>
>>>>>>>>>> 1. "last epoch" and "instance epoch" seem to be referring to
>>>> the
>>>>>> same
>>>>>>>>> thing
>>>>>>>>>> in your wiki.
>>>>>>>>>> 2. "The broker must verify after receiving the response that
>>>> the
>>>>>>>> producer
>>>>>>>>>> state is still unknown.": not sure why we have to validate?
>> If
>>>>> the
>>>>>>>>> metadata
>>>>>>>>>> returned from the txn-coordinator can always be considered
>> the
>>>>>>>>>> source-of-truth, can't we just bindly use it to update the
>>>> cache?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
>>>>>>> matthias@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am +1 on this :)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>> On 9/4/18 9:55 AM, Jason Gustafson wrote:
>>>>>>>>>>>> Bump. Thanks to Magnus for noticing that I forgot to
>> link
>>>> to
>>>>>> the
>>>>>>>> KIP:
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>> -Jason
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
>>>>>>>> jason@confluent.io
>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a proposal to improve the
>> transactional/idempotent
>>>>>>>> producer's
>>>>>>>>>>>>> handling of the UNKNOWN_PRODUCER error, which is the
>>>> result
>>>>> of
>>>>>>>>> losing
>>>>>>>>>>>>> producer state following segment removal. The current
>>>>> behavior
>>>>>>> is
>>>>>>>>> both
>>>>>>>>>>>>> complex and limited. Please take a look and let me know
>>>> what
>>>>>> you
>>>>>>>>>> think.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance to Matthias Sax for feedback on the
>>>>> initial
>>>>>>>> draft.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jason
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -- Guozhang
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Jason Gustafson <ja...@confluent.io>.

Hi Everyone,

Sorry for the long delay on this KIP. I have updated it to include the
handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are no
further comments, I will plan to start a vote early next week.

Thanks!
Jason

On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare <ad...@gmail.com>
wrote:

> Ach - Sorry. I meant Jason. I had just read a John Roesler email.
>
> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <ad...@gmail.com>
> wrote:
>
> > Hi John
> >
> > What is the status of this KIP?
> >
> > My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
> > 2.1.1 for a multitude of our internal topics, and I suspect that a proper
> > fix is needed.
> >
> > Adam
> >
> > On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com> wrote:
> >
> >> Thanks Jason. The proposed solution sounds good to me.
> >>
> >>
> >> Guozhang
> >>
> >> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
> >> wrote:
> >>
> >> > Hey Guozhang,
> >> >
> >> > Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
> >> > occurs following expiration of the producerId. It's possible that
> >> another
> >> > producerId has been installed in its place following expiration (if
> >> another
> >> > producer instance has become active), or the mapping is empty. We can
> >> > safely retry the InitProducerId with the logic in this KIP in order to
> >> > detect which case it is. So I'd suggest something like this:
> >> >
> >> > 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
> >> > InitProducerId using the current producerId and epoch.
> >> > 2. If no mapping exists, the coordinator can generate a new producerId
> >> and
> >> > return it. If a transaction is in progress on the client, it will have
> >> to
> >> > be aborted, but the producer can continue afterwards.
> >> > 3. Otherwise if a different producerId has been assigned, then we can
> >> > return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can
> >> > probably raise this as ProducerFencedException since that is
> effectively
> >> > what has happened. Ideally this is the only fatal case that users have
> >> to
> >> > handle.
> >> >
> >> > I'll give it a little more thought and update the KIP.
> >> >
> >> > Thanks,
> >> > Jason
> >> >
> >> > On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
> >> wrote:
> >> >
> >> > > You're right about the dangling txn since it will actually block
> >> > > read-committed consumers from proceeding at all. I'd agree that
> since
> >> > this
> >> > > is a very rare case, we can consider fixing it not via broker-side
> >> logic
> >> > > but via tooling in a future work.
> >> > >
> >> > > I've also discovered some related error handling logic inside
> producer
> >> > that
> >> > > may be addressed together with this KIP (since it is mostly for
> >> internal
> >> > > implementations the wiki itself does not need to be modified):
> >> > >
> >> > >
> >> > >
> >> >
> >>
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> >> > >
> >> > > Guozhang
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <jason@confluent.io
> >
> >> > > wrote:
> >> > >
> >> > > > Hey Guozhang,
> >> > > >
> >> > > > To clarify, the broker does not actually use the ApiVersion API
> for
> >> > > > inter-broker communications. The use of an API and its
> corresponding
> >> > > > version is controlled by `inter.broker.protocol.version`.
> >> > > >
> >> > > > Nevertheless, it sounds like we're on the same page about removing
> >> > > > DescribeTransactionState. The impact of a dangling transaction is
> a
> >> > > little
> >> > > > worse than what you describe though. Consumers with the
> >> read_committed
> >> > > > isolation level will be stuck. Still, I think we agree that this
> >> case
> >> > > > should be rare and we can reconsider for future work. Rather than
> >> > > > preventing dangling transactions, perhaps we should consider
> options
> >> > > which
> >> > > > allows us to detect them and recover. Anyway, this needs more
> >> thought.
> >> > I
> >> > > > will update the KIP.
> >> > > >
> >> > > > Best,
> >> > > > Jason
> >> > > >
> >> > > > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wangguoz@gmail.com
> >
> >> > > wrote:
> >> > > >
> >> > > > > 0. My original question is about the implementation details
> >> > primarily,
> >> > > > > since current the handling logic of the APIVersionResponse is
> >> simply
> >> > > "use
> >> > > > > the highest supported version of the corresponding request", but
> >> if
> >> > the
> >> > > > > returned response from APIVersionRequest says "I don't even know
> >> > about
> >> > > > the
> >> > > > > DescribeTransactionStateRequest at all", then we need additional
> >> > logic
> >> > > > for
> >> > > > > the falling back logic. Currently this logic is embedded in
> >> > > NetworkClient
> >> > > > > which is shared by all clients, so I'd like to avoid making this
> >> > logic
> >> > > > more
> >> > > > > complicated.
> >> > > > >
> >> > > > > As for the general issue that a broker does not recognize a
> >> producer
> >> > > with
> >> > > > > sequence number 0, here's my thinking: as you mentioned in the
> >> wiki,
> >> > > this
> >> > > > > is only a concern for transactional producer since for
> idempotent
> >> > > > producer
> >> > > > > it can just bump the epoch and go. For transactional producer,
> >> even
> >> > if
> >> > > > the
> >> > > > > producer request from a fenced producer gets accepted, its
> >> > transaction
> >> > > > will
> >> > > > > never be committed and hence messages not exposed to
> >> read-committed
> >> > > > > consumers as well. The drawback is though, 1) read-uncommitted
> >> > > consumers
> >> > > > > will still read those messages, 2) unnecessary storage for those
> >> > fenced
> >> > > > > produce messages, but in practice should not accumulate to a
> large
> >> > > amount
> >> > > > > since producer should soon try to commit and be told it is
> fenced
> >> and
> >> > > > then
> >> > > > > stop, 3) there will be no markers for those transactional
> messages
> >> > > ever.
> >> > > > > Looking at the list and thinking about the likelihood it may
> >> happen
> >> > > > > assuming we retain the producer up to transactional.id.timeout
> >> > (default
> >> > > > is
> >> > > > > 7 days), I feel comfortable leaving it as is.
> >> > > > >
> >> > > > > Guozhang
> >> > > > >
> >> > > > > On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> >> jason@confluent.io>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hey Guozhang,
> >> > > > > >
> >> > > > > > Thanks for the comments. Responses below:
> >> > > > > >
> >> > > > > > 0. The new API is used between brokers, so we govern its usage
> >> > using
> >> > > > > > `inter.broker.protocol.version`. If the other broker hasn't
> >> > upgraded,
> >> > > > we
> >> > > > > > will just fallback to the old logic, which is to accept the
> >> write.
> >> > > This
> >> > > > > is
> >> > > > > > similar to how we introduced the OffsetsForLeaderEpoch API.
> Does
> >> > that
> >> > > > > seem
> >> > > > > > reasonable?
> >> > > > > >
> >> > > > > > To tell the truth, after digging this KIP up and reading it
> >> over, I
> >> > > am
> >> > > > > > doubting how crucial this API is. It is attempting to protect
> a
> >> > write
> >> > > > > from
> >> > > > > > a zombie which has just reset its sequence number after that
> >> > producer
> >> > > > had
> >> > > > > > had its state cleaned up. However, one of the other
> >> improvements in
> >> > > > this
> >> > > > > > KIP is to maintain producer state beyond its retention in the
> >> log.
> >> > I
> >> > > > > think
> >> > > > > > that makes this case sufficiently unlikely that we can leave
> it
> >> for
> >> > > > > future
> >> > > > > > work. I am not 100% sure this is the only scenario where
> >> > transaction
> >> > > > > state
> >> > > > > > and log state can diverge anyway, so it would be better to
> >> consider
> >> > > > this
> >> > > > > > problem more generally. What do you think?
> >> > > > > >
> >> > > > > > 1. Thanks, from memory, the term changed after the first
> >> iteration.
> >> > > > I'll
> >> > > > > > make a pass and try to clarify usage.
> >> > > > > > 2. I was attempting to handle some edge cases since this check
> >> > would
> >> > > be
> >> > > > > > asynchronous. In any case, if we drop this validation as
> >> suggested
> >> > > > above,
> >> > > > > > then we can ignore this.
> >> > > > > >
> >> > > > > > -Jason
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> >> wangguoz@gmail.com>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > > Hello Jason, thanks for the great write-up.
> >> > > > > > >
> >> > > > > > > 0. One question about the migration plan: "The new
> >> > > > GetTransactionState
> >> > > > > > API
> >> > > > > > > and the new version of the transaction state message will
> not
> >> be
> >> > > used
> >> > > > > > until
> >> > > > > > > the inter-broker version supports it." I'm not so clear
> about
> >> the
> >> > > > > > > implementation details here: say a broker is on the newer
> >> version
> >> > > and
> >> > > > > the
> >> > > > > > > txn-coordinator is still on older version. Today the
> >> > > > APIVersionsRequest
> >> > > > > > can
> >> > > > > > > only help upgrade / downgrade the request version, but not
> >> > > forbidding
> >> > > > > > > sending any. Are you suggesting we add additional logic on
> the
> >> > > broker
> >> > > > > > side
> >> > > > > > > to handle the case of "not sending the request"? If yes my
> >> > concern
> >> > > is
> >> > > > > > that
> >> > > > > > > this will be some tech-debt code that will live long before
> >> being
> >> > > > > > removed.
> >> > > > > > >
> >> > > > > > > Some additional minor comments:
> >> > > > > > >
> >> > > > > > > 1. "last epoch" and "instance epoch" seem to be referring to
> >> the
> >> > > same
> >> > > > > > thing
> >> > > > > > > in your wiki.
> >> > > > > > > 2. "The broker must verify after receiving the response that
> >> the
> >> > > > > producer
> >> > > > > > > state is still unknown.": not sure why we have to validate?
> If
> >> > the
> >> > > > > > metadata
> >> > > > > > > returned from the txn-coordinator can always be considered
> the
> >> > > > > > > source-of-truth, can't we just bindly use it to update the
> >> cache?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Guozhang
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> >> > > > matthias@confluent.io>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > I am +1 on this :)
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > -Matthias
> >> > > > > > > >
> >> > > > > > > > On 9/4/18 9:55 AM, Jason Gustafson wrote:
> >> > > > > > > > > Bump. Thanks to Magnus for noticing that I forgot to
> link
> >> to
> >> > > the
> >> > > > > KIP:
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> >> > > > > > > > > .
> >> > > > > > > > >
> >> > > > > > > > > -Jason
> >> > > > > > > > >
> >> > > > > > > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> >> > > > > jason@confluent.io
> >> > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > >> Hi All,
> >> > > > > > > > >>
> >> > > > > > > > >> I have a proposal to improve the
> transactional/idempotent
> >> > > > > producer's
> >> > > > > > > > >> handling of the UNKNOWN_PRODUCER error, which is the
> >> result
> >> > of
> >> > > > > > losing
> >> > > > > > > > >> producer state following segment removal. The current
> >> > behavior
> >> > > > is
> >> > > > > > both
> >> > > > > > > > >> complex and limited. Please take a look and let me know
> >> what
> >> > > you
> >> > > > > > > think.
> >> > > > > > > > >>
> >> > > > > > > > >> Thanks in advance to Matthias Sax for feedback on the
> >> > initial
> >> > > > > draft.
> >> > > > > > > > >>
> >> > > > > > > > >> -Jason
> >> > > > > > > > >>
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > --
> >> > > > > > > -- Guozhang
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > -- Guozhang
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > -- Guozhang
> >> > >
> >> >
> >>
> >>
> >> --
> >> -- Guozhang
> >>
> >
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Adam Bellemare <ad...@gmail.com>.

Ach - Sorry. I meant Jason. I had just read a John Roesler email.

On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare <ad...@gmail.com>
wrote:

> Hi John
>
> What is the status of this KIP?
>
> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
> 2.1.1 for a multitude of our internal topics, and I suspect that a proper
> fix is needed.
>
> Adam
>
> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com> wrote:
>
>> Thanks Jason. The proposed solution sounds good to me.
>>
>>
>> Guozhang
>>
>> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io>
>> wrote:
>>
>> > Hey Guozhang,
>> >
>> > Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
>> > occurs following expiration of the producerId. It's possible that
>> another
>> > producerId has been installed in its place following expiration (if
>> another
>> > producer instance has become active), or the mapping is empty. We can
>> > safely retry the InitProducerId with the logic in this KIP in order to
>> > detect which case it is. So I'd suggest something like this:
>> >
>> > 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
>> > InitProducerId using the current producerId and epoch.
>> > 2. If no mapping exists, the coordinator can generate a new producerId
>> and
>> > return it. If a transaction is in progress on the client, it will have
>> to
>> > be aborted, but the producer can continue afterwards.
>> > 3. Otherwise if a different producerId has been assigned, then we can
>> > return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can
>> > probably raise this as ProducerFencedException since that is effectively
>> > what has happened. Ideally this is the only fatal case that users have
>> to
>> > handle.
>> >
>> > I'll give it a little more thought and update the KIP.
>> >
>> > Thanks,
>> > Jason
>> >
>> > On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com>
>> wrote:
>> >
>> > > You're right about the dangling txn since it will actually block
>> > > read-committed consumers from proceeding at all. I'd agree that since
>> > this
>> > > is a very rare case, we can consider fixing it not via broker-side
>> logic
>> > > but via tooling in a future work.
>> > >
>> > > I've also discovered some related error handling logic inside producer
>> > that
>> > > may be addressed together with this KIP (since it is mostly for
>> internal
>> > > implementations the wiki itself does not need to be modified):
>> > >
>> > >
>> > >
>> >
>> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
>> > >
>> > > Guozhang
>> > >
>> > >
>> > >
>> > > On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <ja...@confluent.io>
>> > > wrote:
>> > >
>> > > > Hey Guozhang,
>> > > >
>> > > > To clarify, the broker does not actually use the ApiVersion API for
>> > > > inter-broker communications. The use of an API and its corresponding
>> > > > version is controlled by `inter.broker.protocol.version`.
>> > > >
>> > > > Nevertheless, it sounds like we're on the same page about removing
>> > > > DescribeTransactionState. The impact of a dangling transaction is a
>> > > little
>> > > > worse than what you describe though. Consumers with the
>> read_committed
>> > > > isolation level will be stuck. Still, I think we agree that this
>> case
>> > > > should be rare and we can reconsider for future work. Rather than
>> > > > preventing dangling transactions, perhaps we should consider options
>> > > which
>> > > > allows us to detect them and recover. Anyway, this needs more
>> thought.
>> > I
>> > > > will update the KIP.
>> > > >
>> > > > Best,
>> > > > Jason
>> > > >
>> > > > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wa...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > 0. My original question is about the implementation details
>> > primarily,
>> > > > > since current the handling logic of the APIVersionResponse is
>> simply
>> > > "use
>> > > > > the highest supported version of the corresponding request", but
>> if
>> > the
>> > > > > returned response from APIVersionRequest says "I don't even know
>> > about
>> > > > the
>> > > > > DescribeTransactionStateRequest at all", then we need additional
>> > logic
>> > > > for
>> > > > > the falling back logic. Currently this logic is embedded in
>> > > NetworkClient
>> > > > > which is shared by all clients, so I'd like to avoid making this
>> > logic
>> > > > more
>> > > > > complicated.
>> > > > >
>> > > > > As for the general issue that a broker does not recognize a
>> producer
>> > > with
>> > > > > sequence number 0, here's my thinking: as you mentioned in the
>> wiki,
>> > > this
>> > > > > is only a concern for transactional producer since for idempotent
>> > > > producer
>> > > > > it can just bump the epoch and go. For transactional producer,
>> even
>> > if
>> > > > the
>> > > > > producer request from a fenced producer gets accepted, its
>> > transaction
>> > > > will
>> > > > > never be committed and hence messages not exposed to
>> read-committed
>> > > > > consumers as well. The drawback is though, 1) read-uncommitted
>> > > consumers
>> > > > > will still read those messages, 2) unnecessary storage for those
>> > fenced
>> > > > > produce messages, but in practice should not accumulate to a large
>> > > amount
>> > > > > since producer should soon try to commit and be told it is fenced
>> and
>> > > > then
>> > > > > stop, 3) there will be no markers for those transactional messages
>> > > ever.
>> > > > > Looking at the list and thinking about the likelihood it may
>> happen
>> > > > > assuming we retain the producer up to transactional.id.timeout
>> > (default
>> > > > is
>> > > > > 7 days), I feel comfortable leaving it as is.
>> > > > >
>> > > > > Guozhang
>> > > > >
>> > > > > On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
>> jason@confluent.io>
>> > > > > wrote:
>> > > > >
>> > > > > > Hey Guozhang,
>> > > > > >
>> > > > > > Thanks for the comments. Responses below:
>> > > > > >
>> > > > > > 0. The new API is used between brokers, so we govern its usage
>> > using
>> > > > > > `inter.broker.protocol.version`. If the other broker hasn't
>> > upgraded,
>> > > > we
>> > > > > > will just fallback to the old logic, which is to accept the
>> write.
>> > > This
>> > > > > is
>> > > > > > similar to how we introduced the OffsetsForLeaderEpoch API. Does
>> > that
>> > > > > seem
>> > > > > > reasonable?
>> > > > > >
>> > > > > > To tell the truth, after digging this KIP up and reading it
>> over, I
>> > > am
>> > > > > > doubting how crucial this API is. It is attempting to protect a
>> > write
>> > > > > from
>> > > > > > a zombie which has just reset its sequence number after that
>> > producer
>> > > > had
>> > > > > > had its state cleaned up. However, one of the other
>> improvements in
>> > > > this
>> > > > > > KIP is to maintain producer state beyond its retention in the
>> log.
>> > I
>> > > > > think
>> > > > > > that makes this case sufficiently unlikely that we can leave it
>> for
>> > > > > future
>> > > > > > work. I am not 100% sure this is the only scenario where
>> > transaction
>> > > > > state
>> > > > > > and log state can diverge anyway, so it would be better to
>> consider
>> > > > this
>> > > > > > problem more generally. What do you think?
>> > > > > >
>> > > > > > 1. Thanks, from memory, the term changed after the first
>> iteration.
>> > > > I'll
>> > > > > > make a pass and try to clarify usage.
>> > > > > > 2. I was attempting to handle some edge cases since this check
>> > would
>> > > be
>> > > > > > asynchronous. In any case, if we drop this validation as
>> suggested
>> > > > above,
>> > > > > > then we can ignore this.
>> > > > > >
>> > > > > > -Jason
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
>> wangguoz@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > Hello Jason, thanks for the great write-up.
>> > > > > > >
>> > > > > > > 0. One question about the migration plan: "The new
>> > > > GetTransactionState
>> > > > > > API
>> > > > > > > and the new version of the transaction state message will not
>> be
>> > > used
>> > > > > > until
>> > > > > > > the inter-broker version supports it." I'm not so clear about
>> the
>> > > > > > > implementation details here: say a broker is on the newer
>> version
>> > > and
>> > > > > the
>> > > > > > > txn-coordinator is still on older version. Today the
>> > > > APIVersionsRequest
>> > > > > > can
>> > > > > > > only help upgrade / downgrade the request version, but not
>> > > forbidding
>> > > > > > > sending any. Are you suggesting we add additional logic on the
>> > > broker
>> > > > > > side
>> > > > > > > to handle the case of "not sending the request"? If yes my
>> > concern
>> > > is
>> > > > > > that
>> > > > > > > this will be some tech-debt code that will live long before
>> being
>> > > > > > removed.
>> > > > > > >
>> > > > > > > Some additional minor comments:
>> > > > > > >
>> > > > > > > 1. "last epoch" and "instance epoch" seem to be referring to
>> the
>> > > same
>> > > > > > thing
>> > > > > > > in your wiki.
>> > > > > > > 2. "The broker must verify after receiving the response that
>> the
>> > > > > producer
>> > > > > > > state is still unknown.": not sure why we have to validate? If
>> > the
>> > > > > > metadata
>> > > > > > > returned from the txn-coordinator can always be considered the
>> > > > > > > source-of-truth, can't we just bindly use it to update the
>> cache?
>> > > > > > >
>> > > > > > >
>> > > > > > > Guozhang
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
>> > > > matthias@confluent.io>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > I am +1 on this :)
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > -Matthias
>> > > > > > > >
>> > > > > > > > On 9/4/18 9:55 AM, Jason Gustafson wrote:
>> > > > > > > > > Bump. Thanks to Magnus for noticing that I forgot to link
>> to
>> > > the
>> > > > > KIP:
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
>> > > > > > > > > .
>> > > > > > > > >
>> > > > > > > > > -Jason
>> > > > > > > > >
>> > > > > > > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
>> > > > > jason@confluent.io
>> > > > > > >
>> > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > >> Hi All,
>> > > > > > > > >>
>> > > > > > > > >> I have a proposal to improve the transactional/idempotent
>> > > > > producer's
>> > > > > > > > >> handling of the UNKNOWN_PRODUCER error, which is the
>> result
>> > of
>> > > > > > losing
>> > > > > > > > >> producer state following segment removal. The current
>> > behavior
>> > > > is
>> > > > > > both
>> > > > > > > > >> complex and limited. Please take a look and let me know
>> what
>> > > you
>> > > > > > > think.
>> > > > > > > > >>
>> > > > > > > > >> Thanks in advance to Matthias Sax for feedback on the
>> > initial
>> > > > > draft.
>> > > > > > > > >>
>> > > > > > > > >> -Jason
>> > > > > > > > >>
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > -- Guozhang
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > -- Guozhang
>> > > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > -- Guozhang
>> > >
>> >
>>
>>
>> --
>> -- Guozhang
>>
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Adam Bellemare <ad...@gmail.com>.

Hi John

What is the status of this KIP?

My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on
2.1.1 for a multitude of our internal topics, and I suspect that a proper
fix is needed.

Adam

On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang <wa...@gmail.com> wrote:

> Thanks Jason. The proposed solution sounds good to me.
>
>
> Guozhang
>
> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io> wrote:
>
> > Hey Guozhang,
> >
> > Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
> > occurs following expiration of the producerId. It's possible that another
> > producerId has been installed in its place following expiration (if
> another
> > producer instance has become active), or the mapping is empty. We can
> > safely retry the InitProducerId with the logic in this KIP in order to
> > detect which case it is. So I'd suggest something like this:
> >
> > 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
> > InitProducerId using the current producerId and epoch.
> > 2. If no mapping exists, the coordinator can generate a new producerId
> and
> > return it. If a transaction is in progress on the client, it will have to
> > be aborted, but the producer can continue afterwards.
> > 3. Otherwise if a different producerId has been assigned, then we can
> > return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can
> > probably raise this as ProducerFencedException since that is effectively
> > what has happened. Ideally this is the only fatal case that users have to
> > handle.
> >
> > I'll give it a little more thought and update the KIP.
> >
> > Thanks,
> > Jason
> >
> > On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com> wrote:
> >
> > > You're right about the dangling txn since it will actually block
> > > read-committed consumers from proceeding at all. I'd agree that since
> > this
> > > is a very rare case, we can consider fixing it not via broker-side
> logic
> > > but via tooling in a future work.
> > >
> > > I've also discovered some related error handling logic inside producer
> > that
> > > may be addressed together with this KIP (since it is mostly for
> internal
> > > implementations the wiki itself does not need to be modified):
> > >
> > >
> > >
> >
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> > >
> > > Guozhang
> > >
> > >
> > >
> > > On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hey Guozhang,
> > > >
> > > > To clarify, the broker does not actually use the ApiVersion API for
> > > > inter-broker communications. The use of an API and its corresponding
> > > > version is controlled by `inter.broker.protocol.version`.
> > > >
> > > > Nevertheless, it sounds like we're on the same page about removing
> > > > DescribeTransactionState. The impact of a dangling transaction is a
> > > little
> > > > worse than what you describe though. Consumers with the
> read_committed
> > > > isolation level will be stuck. Still, I think we agree that this case
> > > > should be rare and we can reconsider for future work. Rather than
> > > > preventing dangling transactions, perhaps we should consider options
> > > which
> > > > allows us to detect them and recover. Anyway, this needs more
> thought.
> > I
> > > > will update the KIP.
> > > >
> > > > Best,
> > > > Jason
> > > >
> > > > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > > > 0. My original question is about the implementation details
> > primarily,
> > > > > since current the handling logic of the APIVersionResponse is
> simply
> > > "use
> > > > > the highest supported version of the corresponding request", but if
> > the
> > > > > returned response from APIVersionRequest says "I don't even know
> > about
> > > > the
> > > > > DescribeTransactionStateRequest at all", then we need additional
> > logic
> > > > for
> > > > > the falling back logic. Currently this logic is embedded in
> > > NetworkClient
> > > > > which is shared by all clients, so I'd like to avoid making this
> > logic
> > > > more
> > > > > complicated.
> > > > >
> > > > > As for the general issue that a broker does not recognize a
> producer
> > > with
> > > > > sequence number 0, here's my thinking: as you mentioned in the
> wiki,
> > > this
> > > > > is only a concern for transactional producer since for idempotent
> > > > producer
> > > > > it can just bump the epoch and go. For transactional producer, even
> > if
> > > > the
> > > > > producer request from a fenced producer gets accepted, its
> > transaction
> > > > will
> > > > > never be committed and hence messages not exposed to read-committed
> > > > > consumers as well. The drawback is though, 1) read-uncommitted
> > > consumers
> > > > > will still read those messages, 2) unnecessary storage for those
> > fenced
> > > > > produce messages, but in practice should not accumulate to a large
> > > amount
> > > > > since producer should soon try to commit and be told it is fenced
> and
> > > > then
> > > > > stop, 3) there will be no markers for those transactional messages
> > > ever.
> > > > > Looking at the list and thinking about the likelihood it may happen
> > > > > assuming we retain the producer up to transactional.id.timeout
> > (default
> > > > is
> > > > > 7 days), I feel comfortable leaving it as is.
> > > > >
> > > > > Guozhang
> > > > >
> > > > > On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <
> jason@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Hey Guozhang,
> > > > > >
> > > > > > Thanks for the comments. Responses below:
> > > > > >
> > > > > > 0. The new API is used between brokers, so we govern its usage
> > using
> > > > > > `inter.broker.protocol.version`. If the other broker hasn't
> > upgraded,
> > > > we
> > > > > > will just fallback to the old logic, which is to accept the
> write.
> > > This
> > > > > is
> > > > > > similar to how we introduced the OffsetsForLeaderEpoch API. Does
> > that
> > > > > seem
> > > > > > reasonable?
> > > > > >
> > > > > > To tell the truth, after digging this KIP up and reading it
> over, I
> > > am
> > > > > > doubting how crucial this API is. It is attempting to protect a
> > write
> > > > > from
> > > > > > a zombie which has just reset its sequence number after that
> > producer
> > > > had
> > > > > > had its state cleaned up. However, one of the other improvements
> in
> > > > this
> > > > > > KIP is to maintain producer state beyond its retention in the
> log.
> > I
> > > > > think
> > > > > > that makes this case sufficiently unlikely that we can leave it
> for
> > > > > future
> > > > > > work. I am not 100% sure this is the only scenario where
> > transaction
> > > > > state
> > > > > > and log state can diverge anyway, so it would be better to
> consider
> > > > this
> > > > > > problem more generally. What do you think?
> > > > > >
> > > > > > 1. Thanks, from memory, the term changed after the first
> iteration.
> > > > I'll
> > > > > > make a pass and try to clarify usage.
> > > > > > 2. I was attempting to handle some edge cases since this check
> > would
> > > be
> > > > > > asynchronous. In any case, if we drop this validation as
> suggested
> > > > above,
> > > > > > then we can ignore this.
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <
> wangguoz@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hello Jason, thanks for the great write-up.
> > > > > > >
> > > > > > > 0. One question about the migration plan: "The new
> > > > GetTransactionState
> > > > > > API
> > > > > > > and the new version of the transaction state message will not
> be
> > > used
> > > > > > until
> > > > > > > the inter-broker version supports it." I'm not so clear about
> the
> > > > > > > implementation details here: say a broker is on the newer
> version
> > > and
> > > > > the
> > > > > > > txn-coordinator is still on older version. Today the
> > > > APIVersionsRequest
> > > > > > can
> > > > > > > only help upgrade / downgrade the request version, but not
> > > forbidding
> > > > > > > sending any. Are you suggesting we add additional logic on the
> > > broker
> > > > > > side
> > > > > > > to handle the case of "not sending the request"? If yes my
> > concern
> > > is
> > > > > > that
> > > > > > > this will be some tech-debt code that will live long before
> being
> > > > > > removed.
> > > > > > >
> > > > > > > Some additional minor comments:
> > > > > > >
> > > > > > > 1. "last epoch" and "instance epoch" seem to be referring to
> the
> > > same
> > > > > > thing
> > > > > > > in your wiki.
> > > > > > > 2. "The broker must verify after receiving the response that
> the
> > > > > producer
> > > > > > > state is still unknown.": not sure why we have to validate? If
> > the
> > > > > > metadata
> > > > > > > returned from the txn-coordinator can always be considered the
> > > > > > > source-of-truth, can't we just bindly use it to update the
> cache?
> > > > > > >
> > > > > > >
> > > > > > > Guozhang
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> > > > matthias@confluent.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I am +1 on this :)
> > > > > > > >
> > > > > > > >
> > > > > > > > -Matthias
> > > > > > > >
> > > > > > > > On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > > > > > > > > Bump. Thanks to Magnus for noticing that I forgot to link
> to
> > > the
> > > > > KIP:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > > > > > > > > .
> > > > > > > > >
> > > > > > > > > -Jason
> > > > > > > > >
> > > > > > > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> > > > > jason@confluent.io
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Hi All,
> > > > > > > > >>
> > > > > > > > >> I have a proposal to improve the transactional/idempotent
> > > > > producer's
> > > > > > > > >> handling of the UNKNOWN_PRODUCER error, which is the
> result
> > of
> > > > > > losing
> > > > > > > > >> producer state following segment removal. The current
> > behavior
> > > > is
> > > > > > both
> > > > > > > > >> complex and limited. Please take a look and let me know
> what
> > > you
> > > > > > > think.
> > > > > > > > >>
> > > > > > > > >> Thanks in advance to Matthias Sax for feedback on the
> > initial
> > > > > draft.
> > > > > > > > >>
> > > > > > > > >> -Jason
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Guozhang
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Guozhang Wang <wa...@gmail.com>.

Thanks Jason. The proposed solution sounds good to me.


Guozhang

On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Guozhang,
>
> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
> occurs following expiration of the producerId. It's possible that another
> producerId has been installed in its place following expiration (if another
> producer instance has become active), or the mapping is empty. We can
> safely retry the InitProducerId with the logic in this KIP in order to
> detect which case it is. So I'd suggest something like this:
>
> 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
> InitProducerId using the current producerId and epoch.
> 2. If no mapping exists, the coordinator can generate a new producerId and
> return it. If a transaction is in progress on the client, it will have to
> be aborted, but the producer can continue afterwards.
> 3. Otherwise if a different producerId has been assigned, then we can
> return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can
> probably raise this as ProducerFencedException since that is effectively
> what has happened. Ideally this is the only fatal case that users have to
> handle.
>
> I'll give it a little more thought and update the KIP.
>
> Thanks,
> Jason
>
> On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > You're right about the dangling txn since it will actually block
> > read-committed consumers from proceeding at all. I'd agree that since
> this
> > is a very rare case, we can consider fixing it not via broker-side logic
> > but via tooling in a future work.
> >
> > I've also discovered some related error handling logic inside producer
> that
> > may be addressed together with this KIP (since it is mostly for internal
> > implementations the wiki itself does not need to be modified):
> >
> >
> >
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
> >
> > Guozhang
> >
> >
> >
> > On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hey Guozhang,
> > >
> > > To clarify, the broker does not actually use the ApiVersion API for
> > > inter-broker communications. The use of an API and its corresponding
> > > version is controlled by `inter.broker.protocol.version`.
> > >
> > > Nevertheless, it sounds like we're on the same page about removing
> > > DescribeTransactionState. The impact of a dangling transaction is a
> > little
> > > worse than what you describe though. Consumers with the read_committed
> > > isolation level will be stuck. Still, I think we agree that this case
> > > should be rare and we can reconsider for future work. Rather than
> > > preventing dangling transactions, perhaps we should consider options
> > which
> > > allows us to detect them and recover. Anyway, this needs more thought.
> I
> > > will update the KIP.
> > >
> > > Best,
> > > Jason
> > >
> > > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > 0. My original question is about the implementation details
> primarily,
> > > > since current the handling logic of the APIVersionResponse is simply
> > "use
> > > > the highest supported version of the corresponding request", but if
> the
> > > > returned response from APIVersionRequest says "I don't even know
> about
> > > the
> > > > DescribeTransactionStateRequest at all", then we need additional
> logic
> > > for
> > > > the falling back logic. Currently this logic is embedded in
> > NetworkClient
> > > > which is shared by all clients, so I'd like to avoid making this
> logic
> > > more
> > > > complicated.
> > > >
> > > > As for the general issue that a broker does not recognize a producer
> > with
> > > > sequence number 0, here's my thinking: as you mentioned in the wiki,
> > this
> > > > is only a concern for transactional producer since for idempotent
> > > producer
> > > > it can just bump the epoch and go. For transactional producer, even
> if
> > > the
> > > > producer request from a fenced producer gets accepted, its
> transaction
> > > will
> > > > never be committed and hence messages not exposed to read-committed
> > > > consumers as well. The drawback is though, 1) read-uncommitted
> > consumers
> > > > will still read those messages, 2) unnecessary storage for those
> fenced
> > > > produce messages, but in practice should not accumulate to a large
> > amount
> > > > since producer should soon try to commit and be told it is fenced and
> > > then
> > > > stop, 3) there will be no markers for those transactional messages
> > ever.
> > > > Looking at the list and thinking about the likelihood it may happen
> > > > assuming we retain the producer up to transactional.id.timeout
> (default
> > > is
> > > > 7 days), I feel comfortable leaving it as is.
> > > >
> > > > Guozhang
> > > >
> > > > On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hey Guozhang,
> > > > >
> > > > > Thanks for the comments. Responses below:
> > > > >
> > > > > 0. The new API is used between brokers, so we govern its usage
> using
> > > > > `inter.broker.protocol.version`. If the other broker hasn't
> upgraded,
> > > we
> > > > > will just fallback to the old logic, which is to accept the write.
> > This
> > > > is
> > > > > similar to how we introduced the OffsetsForLeaderEpoch API. Does
> that
> > > > seem
> > > > > reasonable?
> > > > >
> > > > > To tell the truth, after digging this KIP up and reading it over, I
> > am
> > > > > doubting how crucial this API is. It is attempting to protect a
> write
> > > > from
> > > > > a zombie which has just reset its sequence number after that
> producer
> > > had
> > > > > had its state cleaned up. However, one of the other improvements in
> > > this
> > > > > KIP is to maintain producer state beyond its retention in the log.
> I
> > > > think
> > > > > that makes this case sufficiently unlikely that we can leave it for
> > > > future
> > > > > work. I am not 100% sure this is the only scenario where
> transaction
> > > > state
> > > > > and log state can diverge anyway, so it would be better to consider
> > > this
> > > > > problem more generally. What do you think?
> > > > >
> > > > > 1. Thanks, from memory, the term changed after the first iteration.
> > > I'll
> > > > > make a pass and try to clarify usage.
> > > > > 2. I was attempting to handle some edge cases since this check
> would
> > be
> > > > > asynchronous. In any case, if we drop this validation as suggested
> > > above,
> > > > > then we can ignore this.
> > > > >
> > > > > -Jason
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <wa...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hello Jason, thanks for the great write-up.
> > > > > >
> > > > > > 0. One question about the migration plan: "The new
> > > GetTransactionState
> > > > > API
> > > > > > and the new version of the transaction state message will not be
> > used
> > > > > until
> > > > > > the inter-broker version supports it." I'm not so clear about the
> > > > > > implementation details here: say a broker is on the newer version
> > and
> > > > the
> > > > > > txn-coordinator is still on older version. Today the
> > > APIVersionsRequest
> > > > > can
> > > > > > only help upgrade / downgrade the request version, but not
> > forbidding
> > > > > > sending any. Are you suggesting we add additional logic on the
> > broker
> > > > > side
> > > > > > to handle the case of "not sending the request"? If yes my
> concern
> > is
> > > > > that
> > > > > > this will be some tech-debt code that will live long before being
> > > > > removed.
> > > > > >
> > > > > > Some additional minor comments:
> > > > > >
> > > > > > 1. "last epoch" and "instance epoch" seem to be referring to the
> > same
> > > > > thing
> > > > > > in your wiki.
> > > > > > 2. "The broker must verify after receiving the response that the
> > > > producer
> > > > > > state is still unknown.": not sure why we have to validate? If
> the
> > > > > metadata
> > > > > > returned from the txn-coordinator can always be considered the
> > > > > > source-of-truth, can't we just bindly use it to update the cache?
> > > > > >
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> > > matthias@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > I am +1 on this :)
> > > > > > >
> > > > > > >
> > > > > > > -Matthias
> > > > > > >
> > > > > > > On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > > > > > > > Bump. Thanks to Magnus for noticing that I forgot to link to
> > the
> > > > KIP:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > > > > > > > .
> > > > > > > >
> > > > > > > > -Jason
> > > > > > > >
> > > > > > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> > > > jason@confluent.io
> > > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi All,
> > > > > > > >>
> > > > > > > >> I have a proposal to improve the transactional/idempotent
> > > > producer's
> > > > > > > >> handling of the UNKNOWN_PRODUCER error, which is the result
> of
> > > > > losing
> > > > > > > >> producer state following segment removal. The current
> behavior
> > > is
> > > > > both
> > > > > > > >> complex and limited. Please take a look and let me know what
> > you
> > > > > > think.
> > > > > > > >>
> > > > > > > >> Thanks in advance to Matthias Sax for feedback on the
> initial
> > > > draft.
> > > > > > > >>
> > > > > > > >> -Jason
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Jason Gustafson <ja...@confluent.io>.

Hey Guozhang,

Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error
occurs following expiration of the producerId. It's possible that another
producerId has been installed in its place following expiration (if another
producer instance has become active), or the mapping is empty. We can
safely retry the InitProducerId with the logic in this KIP in order to
detect which case it is. So I'd suggest something like this:

1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send
InitProducerId using the current producerId and epoch.
2. If no mapping exists, the coordinator can generate a new producerId and
return it. If a transaction is in progress on the client, it will have to
be aborted, but the producer can continue afterwards.
3. Otherwise if a different producerId has been assigned, then we can
return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can
probably raise this as ProducerFencedException since that is effectively
what has happened. Ideally this is the only fatal case that users have to
handle.

I'll give it a little more thought and update the KIP.

Thanks,
Jason

On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang <wa...@gmail.com> wrote:

> You're right about the dangling txn since it will actually block
> read-committed consumers from proceeding at all. I'd agree that since this
> is a very rare case, we can consider fixing it not via broker-side logic
> but via tooling in a future work.
>
> I've also discovered some related error handling logic inside producer that
> may be addressed together with this KIP (since it is mostly for internal
> implementations the wiki itself does not need to be modified):
>
>
> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181
>
> Guozhang
>
>
>
> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Guozhang,
> >
> > To clarify, the broker does not actually use the ApiVersion API for
> > inter-broker communications. The use of an API and its corresponding
> > version is controlled by `inter.broker.protocol.version`.
> >
> > Nevertheless, it sounds like we're on the same page about removing
> > DescribeTransactionState. The impact of a dangling transaction is a
> little
> > worse than what you describe though. Consumers with the read_committed
> > isolation level will be stuck. Still, I think we agree that this case
> > should be rare and we can reconsider for future work. Rather than
> > preventing dangling transactions, perhaps we should consider options
> which
> > allows us to detect them and recover. Anyway, this needs more thought. I
> > will update the KIP.
> >
> > Best,
> > Jason
> >
> > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > 0. My original question is about the implementation details primarily,
> > > since current the handling logic of the APIVersionResponse is simply
> "use
> > > the highest supported version of the corresponding request", but if the
> > > returned response from APIVersionRequest says "I don't even know about
> > the
> > > DescribeTransactionStateRequest at all", then we need additional logic
> > for
> > > the falling back logic. Currently this logic is embedded in
> NetworkClient
> > > which is shared by all clients, so I'd like to avoid making this logic
> > more
> > > complicated.
> > >
> > > As for the general issue that a broker does not recognize a producer
> with
> > > sequence number 0, here's my thinking: as you mentioned in the wiki,
> this
> > > is only a concern for transactional producer since for idempotent
> > producer
> > > it can just bump the epoch and go. For transactional producer, even if
> > the
> > > producer request from a fenced producer gets accepted, its transaction
> > will
> > > never be committed and hence messages not exposed to read-committed
> > > consumers as well. The drawback is though, 1) read-uncommitted
> consumers
> > > will still read those messages, 2) unnecessary storage for those fenced
> > > produce messages, but in practice should not accumulate to a large
> amount
> > > since producer should soon try to commit and be told it is fenced and
> > then
> > > stop, 3) there will be no markers for those transactional messages
> ever.
> > > Looking at the list and thinking about the likelihood it may happen
> > > assuming we retain the producer up to transactional.id.timeout (default
> > is
> > > 7 days), I feel comfortable leaving it as is.
> > >
> > > Guozhang
> > >
> > > On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hey Guozhang,
> > > >
> > > > Thanks for the comments. Responses below:
> > > >
> > > > 0. The new API is used between brokers, so we govern its usage using
> > > > `inter.broker.protocol.version`. If the other broker hasn't upgraded,
> > we
> > > > will just fallback to the old logic, which is to accept the write.
> This
> > > is
> > > > similar to how we introduced the OffsetsForLeaderEpoch API. Does that
> > > seem
> > > > reasonable?
> > > >
> > > > To tell the truth, after digging this KIP up and reading it over, I
> am
> > > > doubting how crucial this API is. It is attempting to protect a write
> > > from
> > > > a zombie which has just reset its sequence number after that producer
> > had
> > > > had its state cleaned up. However, one of the other improvements in
> > this
> > > > KIP is to maintain producer state beyond its retention in the log. I
> > > think
> > > > that makes this case sufficiently unlikely that we can leave it for
> > > future
> > > > work. I am not 100% sure this is the only scenario where transaction
> > > state
> > > > and log state can diverge anyway, so it would be better to consider
> > this
> > > > problem more generally. What do you think?
> > > >
> > > > 1. Thanks, from memory, the term changed after the first iteration.
> > I'll
> > > > make a pass and try to clarify usage.
> > > > 2. I was attempting to handle some edge cases since this check would
> be
> > > > asynchronous. In any case, if we drop this validation as suggested
> > above,
> > > > then we can ignore this.
> > > >
> > > > -Jason
> > > >
> > > >
> > > >
> > > > On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > > > Hello Jason, thanks for the great write-up.
> > > > >
> > > > > 0. One question about the migration plan: "The new
> > GetTransactionState
> > > > API
> > > > > and the new version of the transaction state message will not be
> used
> > > > until
> > > > > the inter-broker version supports it." I'm not so clear about the
> > > > > implementation details here: say a broker is on the newer version
> and
> > > the
> > > > > txn-coordinator is still on older version. Today the
> > APIVersionsRequest
> > > > can
> > > > > only help upgrade / downgrade the request version, but not
> forbidding
> > > > > sending any. Are you suggesting we add additional logic on the
> broker
> > > > side
> > > > > to handle the case of "not sending the request"? If yes my concern
> is
> > > > that
> > > > > this will be some tech-debt code that will live long before being
> > > > removed.
> > > > >
> > > > > Some additional minor comments:
> > > > >
> > > > > 1. "last epoch" and "instance epoch" seem to be referring to the
> same
> > > > thing
> > > > > in your wiki.
> > > > > 2. "The broker must verify after receiving the response that the
> > > producer
> > > > > state is still unknown.": not sure why we have to validate? If the
> > > > metadata
> > > > > returned from the txn-coordinator can always be considered the
> > > > > source-of-truth, can't we just bindly use it to update the cache?
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> > matthias@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > I am +1 on this :)
> > > > > >
> > > > > >
> > > > > > -Matthias
> > > > > >
> > > > > > On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > > > > > > Bump. Thanks to Magnus for noticing that I forgot to link to
> the
> > > KIP:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > > > > > > .
> > > > > > >
> > > > > > > -Jason
> > > > > > >
> > > > > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> > > jason@confluent.io
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > >> Hi All,
> > > > > > >>
> > > > > > >> I have a proposal to improve the transactional/idempotent
> > > producer's
> > > > > > >> handling of the UNKNOWN_PRODUCER error, which is the result of
> > > > losing
> > > > > > >> producer state following segment removal. The current behavior
> > is
> > > > both
> > > > > > >> complex and limited. Please take a look and let me know what
> you
> > > > > think.
> > > > > > >>
> > > > > > >> Thanks in advance to Matthias Sax for feedback on the initial
> > > draft.
> > > > > > >>
> > > > > > >> -Jason
> > > > > > >>
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Guozhang Wang <wa...@gmail.com>.

You're right about the dangling txn since it will actually block
read-committed consumers from proceeding at all. I'd agree that since this
is a very rare case, we can consider fixing it not via broker-side logic
but via tooling in a future work.

I've also discovered some related error handling logic inside producer that
may be addressed together with this KIP (since it is mostly for internal
implementations the wiki itself does not need to be modified):

https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181

Guozhang



On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Guozhang,
>
> To clarify, the broker does not actually use the ApiVersion API for
> inter-broker communications. The use of an API and its corresponding
> version is controlled by `inter.broker.protocol.version`.
>
> Nevertheless, it sounds like we're on the same page about removing
> DescribeTransactionState. The impact of a dangling transaction is a little
> worse than what you describe though. Consumers with the read_committed
> isolation level will be stuck. Still, I think we agree that this case
> should be rare and we can reconsider for future work. Rather than
> preventing dangling transactions, perhaps we should consider options which
> allows us to detect them and recover. Anyway, this needs more thought. I
> will update the KIP.
>
> Best,
> Jason
>
> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > 0. My original question is about the implementation details primarily,
> > since current the handling logic of the APIVersionResponse is simply "use
> > the highest supported version of the corresponding request", but if the
> > returned response from APIVersionRequest says "I don't even know about
> the
> > DescribeTransactionStateRequest at all", then we need additional logic
> for
> > the falling back logic. Currently this logic is embedded in NetworkClient
> > which is shared by all clients, so I'd like to avoid making this logic
> more
> > complicated.
> >
> > As for the general issue that a broker does not recognize a producer with
> > sequence number 0, here's my thinking: as you mentioned in the wiki, this
> > is only a concern for transactional producer since for idempotent
> producer
> > it can just bump the epoch and go. For transactional producer, even if
> the
> > producer request from a fenced producer gets accepted, its transaction
> will
> > never be committed and hence messages not exposed to read-committed
> > consumers as well. The drawback is though, 1) read-uncommitted consumers
> > will still read those messages, 2) unnecessary storage for those fenced
> > produce messages, but in practice should not accumulate to a large amount
> > since producer should soon try to commit and be told it is fenced and
> then
> > stop, 3) there will be no markers for those transactional messages ever.
> > Looking at the list and thinking about the likelihood it may happen
> > assuming we retain the producer up to transactional.id.timeout (default
> is
> > 7 days), I feel comfortable leaving it as is.
> >
> > Guozhang
> >
> > On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hey Guozhang,
> > >
> > > Thanks for the comments. Responses below:
> > >
> > > 0. The new API is used between brokers, so we govern its usage using
> > > `inter.broker.protocol.version`. If the other broker hasn't upgraded,
> we
> > > will just fallback to the old logic, which is to accept the write. This
> > is
> > > similar to how we introduced the OffsetsForLeaderEpoch API. Does that
> > seem
> > > reasonable?
> > >
> > > To tell the truth, after digging this KIP up and reading it over, I am
> > > doubting how crucial this API is. It is attempting to protect a write
> > from
> > > a zombie which has just reset its sequence number after that producer
> had
> > > had its state cleaned up. However, one of the other improvements in
> this
> > > KIP is to maintain producer state beyond its retention in the log. I
> > think
> > > that makes this case sufficiently unlikely that we can leave it for
> > future
> > > work. I am not 100% sure this is the only scenario where transaction
> > state
> > > and log state can diverge anyway, so it would be better to consider
> this
> > > problem more generally. What do you think?
> > >
> > > 1. Thanks, from memory, the term changed after the first iteration.
> I'll
> > > make a pass and try to clarify usage.
> > > 2. I was attempting to handle some edge cases since this check would be
> > > asynchronous. In any case, if we drop this validation as suggested
> above,
> > > then we can ignore this.
> > >
> > > -Jason
> > >
> > >
> > >
> > > On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Hello Jason, thanks for the great write-up.
> > > >
> > > > 0. One question about the migration plan: "The new
> GetTransactionState
> > > API
> > > > and the new version of the transaction state message will not be used
> > > until
> > > > the inter-broker version supports it." I'm not so clear about the
> > > > implementation details here: say a broker is on the newer version and
> > the
> > > > txn-coordinator is still on older version. Today the
> APIVersionsRequest
> > > can
> > > > only help upgrade / downgrade the request version, but not forbidding
> > > > sending any. Are you suggesting we add additional logic on the broker
> > > side
> > > > to handle the case of "not sending the request"? If yes my concern is
> > > that
> > > > this will be some tech-debt code that will live long before being
> > > removed.
> > > >
> > > > Some additional minor comments:
> > > >
> > > > 1. "last epoch" and "instance epoch" seem to be referring to the same
> > > thing
> > > > in your wiki.
> > > > 2. "The broker must verify after receiving the response that the
> > producer
> > > > state is still unknown.": not sure why we have to validate? If the
> > > metadata
> > > > returned from the txn-coordinator can always be considered the
> > > > source-of-truth, can't we just bindly use it to update the cache?
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > >
> > > > On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <
> matthias@confluent.io>
> > > > wrote:
> > > >
> > > > > I am +1 on this :)
> > > > >
> > > > >
> > > > > -Matthias
> > > > >
> > > > > On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > > > > > Bump. Thanks to Magnus for noticing that I forgot to link to the
> > KIP:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > > > > > .
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> > jason@confluent.io
> > > >
> > > > > wrote:
> > > > > >
> > > > > >> Hi All,
> > > > > >>
> > > > > >> I have a proposal to improve the transactional/idempotent
> > producer's
> > > > > >> handling of the UNKNOWN_PRODUCER error, which is the result of
> > > losing
> > > > > >> producer state following segment removal. The current behavior
> is
> > > both
> > > > > >> complex and limited. Please take a look and let me know what you
> > > > think.
> > > > > >>
> > > > > >> Thanks in advance to Matthias Sax for feedback on the initial
> > draft.
> > > > > >>
> > > > > >> -Jason
> > > > > >>
> > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Jason Gustafson <ja...@confluent.io>.

Hey Guozhang,

To clarify, the broker does not actually use the ApiVersion API for
inter-broker communications. The use of an API and its corresponding
version is controlled by `inter.broker.protocol.version`.

Nevertheless, it sounds like we're on the same page about removing
DescribeTransactionState. The impact of a dangling transaction is a little
worse than what you describe though. Consumers with the read_committed
isolation level will be stuck. Still, I think we agree that this case
should be rare and we can reconsider for future work. Rather than
preventing dangling transactions, perhaps we should consider options which
allows us to detect them and recover. Anyway, this needs more thought. I
will update the KIP.

Best,
Jason

On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang <wa...@gmail.com> wrote:

> 0. My original question is about the implementation details primarily,
> since current the handling logic of the APIVersionResponse is simply "use
> the highest supported version of the corresponding request", but if the
> returned response from APIVersionRequest says "I don't even know about the
> DescribeTransactionStateRequest at all", then we need additional logic for
> the falling back logic. Currently this logic is embedded in NetworkClient
> which is shared by all clients, so I'd like to avoid making this logic more
> complicated.
>
> As for the general issue that a broker does not recognize a producer with
> sequence number 0, here's my thinking: as you mentioned in the wiki, this
> is only a concern for transactional producer since for idempotent producer
> it can just bump the epoch and go. For transactional producer, even if the
> producer request from a fenced producer gets accepted, its transaction will
> never be committed and hence messages not exposed to read-committed
> consumers as well. The drawback is though, 1) read-uncommitted consumers
> will still read those messages, 2) unnecessary storage for those fenced
> produce messages, but in practice should not accumulate to a large amount
> since producer should soon try to commit and be told it is fenced and then
> stop, 3) there will be no markers for those transactional messages ever.
> Looking at the list and thinking about the likelihood it may happen
> assuming we retain the producer up to transactional.id.timeout (default is
> 7 days), I feel comfortable leaving it as is.
>
> Guozhang
>
> On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Guozhang,
> >
> > Thanks for the comments. Responses below:
> >
> > 0. The new API is used between brokers, so we govern its usage using
> > `inter.broker.protocol.version`. If the other broker hasn't upgraded, we
> > will just fallback to the old logic, which is to accept the write. This
> is
> > similar to how we introduced the OffsetsForLeaderEpoch API. Does that
> seem
> > reasonable?
> >
> > To tell the truth, after digging this KIP up and reading it over, I am
> > doubting how crucial this API is. It is attempting to protect a write
> from
> > a zombie which has just reset its sequence number after that producer had
> > had its state cleaned up. However, one of the other improvements in this
> > KIP is to maintain producer state beyond its retention in the log. I
> think
> > that makes this case sufficiently unlikely that we can leave it for
> future
> > work. I am not 100% sure this is the only scenario where transaction
> state
> > and log state can diverge anyway, so it would be better to consider this
> > problem more generally. What do you think?
> >
> > 1. Thanks, from memory, the term changed after the first iteration. I'll
> > make a pass and try to clarify usage.
> > 2. I was attempting to handle some edge cases since this check would be
> > asynchronous. In any case, if we drop this validation as suggested above,
> > then we can ignore this.
> >
> > -Jason
> >
> >
> >
> > On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Hello Jason, thanks for the great write-up.
> > >
> > > 0. One question about the migration plan: "The new GetTransactionState
> > API
> > > and the new version of the transaction state message will not be used
> > until
> > > the inter-broker version supports it." I'm not so clear about the
> > > implementation details here: say a broker is on the newer version and
> the
> > > txn-coordinator is still on older version. Today the APIVersionsRequest
> > can
> > > only help upgrade / downgrade the request version, but not forbidding
> > > sending any. Are you suggesting we add additional logic on the broker
> > side
> > > to handle the case of "not sending the request"? If yes my concern is
> > that
> > > this will be some tech-debt code that will live long before being
> > removed.
> > >
> > > Some additional minor comments:
> > >
> > > 1. "last epoch" and "instance epoch" seem to be referring to the same
> > thing
> > > in your wiki.
> > > 2. "The broker must verify after receiving the response that the
> producer
> > > state is still unknown.": not sure why we have to validate? If the
> > metadata
> > > returned from the txn-coordinator can always be considered the
> > > source-of-truth, can't we just bindly use it to update the cache?
> > >
> > >
> > > Guozhang
> > >
> > >
> > >
> > > On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <ma...@confluent.io>
> > > wrote:
> > >
> > > > I am +1 on this :)
> > > >
> > > >
> > > > -Matthias
> > > >
> > > > On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > > > > Bump. Thanks to Magnus for noticing that I forgot to link to the
> KIP:
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > > > > .
> > > > >
> > > > > -Jason
> > > > >
> > > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <
> jason@confluent.io
> > >
> > > > wrote:
> > > > >
> > > > >> Hi All,
> > > > >>
> > > > >> I have a proposal to improve the transactional/idempotent
> producer's
> > > > >> handling of the UNKNOWN_PRODUCER error, which is the result of
> > losing
> > > > >> producer state following segment removal. The current behavior is
> > both
> > > > >> complex and limited. Please take a look and let me know what you
> > > think.
> > > > >>
> > > > >> Thanks in advance to Matthias Sax for feedback on the initial
> draft.
> > > > >>
> > > > >> -Jason
> > > > >>
> > > > >
> > > >
> > > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Guozhang Wang <wa...@gmail.com>.

0. My original question is about the implementation details primarily,
since current the handling logic of the APIVersionResponse is simply "use
the highest supported version of the corresponding request", but if the
returned response from APIVersionRequest says "I don't even know about the
DescribeTransactionStateRequest at all", then we need additional logic for
the falling back logic. Currently this logic is embedded in NetworkClient
which is shared by all clients, so I'd like to avoid making this logic more
complicated.

As for the general issue that a broker does not recognize a producer with
sequence number 0, here's my thinking: as you mentioned in the wiki, this
is only a concern for transactional producer since for idempotent producer
it can just bump the epoch and go. For transactional producer, even if the
producer request from a fenced producer gets accepted, its transaction will
never be committed and hence messages not exposed to read-committed
consumers as well. The drawback is though, 1) read-uncommitted consumers
will still read those messages, 2) unnecessary storage for those fenced
produce messages, but in practice should not accumulate to a large amount
since producer should soon try to commit and be told it is fenced and then
stop, 3) there will be no markers for those transactional messages ever.
Looking at the list and thinking about the likelihood it may happen
assuming we retain the producer up to transactional.id.timeout (default is
7 days), I feel comfortable leaving it as is.

Guozhang

On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Guozhang,
>
> Thanks for the comments. Responses below:
>
> 0. The new API is used between brokers, so we govern its usage using
> `inter.broker.protocol.version`. If the other broker hasn't upgraded, we
> will just fallback to the old logic, which is to accept the write. This is
> similar to how we introduced the OffsetsForLeaderEpoch API. Does that seem
> reasonable?
>
> To tell the truth, after digging this KIP up and reading it over, I am
> doubting how crucial this API is. It is attempting to protect a write from
> a zombie which has just reset its sequence number after that producer had
> had its state cleaned up. However, one of the other improvements in this
> KIP is to maintain producer state beyond its retention in the log. I think
> that makes this case sufficiently unlikely that we can leave it for future
> work. I am not 100% sure this is the only scenario where transaction state
> and log state can diverge anyway, so it would be better to consider this
> problem more generally. What do you think?
>
> 1. Thanks, from memory, the term changed after the first iteration. I'll
> make a pass and try to clarify usage.
> 2. I was attempting to handle some edge cases since this check would be
> asynchronous. In any case, if we drop this validation as suggested above,
> then we can ignore this.
>
> -Jason
>
>
>
> On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hello Jason, thanks for the great write-up.
> >
> > 0. One question about the migration plan: "The new GetTransactionState
> API
> > and the new version of the transaction state message will not be used
> until
> > the inter-broker version supports it." I'm not so clear about the
> > implementation details here: say a broker is on the newer version and the
> > txn-coordinator is still on older version. Today the APIVersionsRequest
> can
> > only help upgrade / downgrade the request version, but not forbidding
> > sending any. Are you suggesting we add additional logic on the broker
> side
> > to handle the case of "not sending the request"? If yes my concern is
> that
> > this will be some tech-debt code that will live long before being
> removed.
> >
> > Some additional minor comments:
> >
> > 1. "last epoch" and "instance epoch" seem to be referring to the same
> thing
> > in your wiki.
> > 2. "The broker must verify after receiving the response that the producer
> > state is still unknown.": not sure why we have to validate? If the
> metadata
> > returned from the txn-coordinator can always be considered the
> > source-of-truth, can't we just bindly use it to update the cache?
> >
> >
> > Guozhang
> >
> >
> >
> > On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <ma...@confluent.io>
> > wrote:
> >
> > > I am +1 on this :)
> > >
> > >
> > > -Matthias
> > >
> > > On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > > > Bump. Thanks to Magnus for noticing that I forgot to link to the KIP:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > > > .
> > > >
> > > > -Jason
> > > >
> > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <jason@confluent.io
> >
> > > wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> I have a proposal to improve the transactional/idempotent producer's
> > > >> handling of the UNKNOWN_PRODUCER error, which is the result of
> losing
> > > >> producer state following segment removal. The current behavior is
> both
> > > >> complex and limited. Please take a look and let me know what you
> > think.
> > > >>
> > > >> Thanks in advance to Matthias Sax for feedback on the initial draft.
> > > >>
> > > >> -Jason
> > > >>
> > > >
> > >
> > >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Jason Gustafson <ja...@confluent.io>.

Hey Guozhang,

Thanks for the comments. Responses below:

0. The new API is used between brokers, so we govern its usage using
`inter.broker.protocol.version`. If the other broker hasn't upgraded, we
will just fallback to the old logic, which is to accept the write. This is
similar to how we introduced the OffsetsForLeaderEpoch API. Does that seem
reasonable?

To tell the truth, after digging this KIP up and reading it over, I am
doubting how crucial this API is. It is attempting to protect a write from
a zombie which has just reset its sequence number after that producer had
had its state cleaned up. However, one of the other improvements in this
KIP is to maintain producer state beyond its retention in the log. I think
that makes this case sufficiently unlikely that we can leave it for future
work. I am not 100% sure this is the only scenario where transaction state
and log state can diverge anyway, so it would be better to consider this
problem more generally. What do you think?

1. Thanks, from memory, the term changed after the first iteration. I'll
make a pass and try to clarify usage.
2. I was attempting to handle some edge cases since this check would be
asynchronous. In any case, if we drop this validation as suggested above,
then we can ignore this.

-Jason

On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang <wa...@gmail.com> wrote:

> Hello Jason, thanks for the great write-up.
>
> 0. One question about the migration plan: "The new GetTransactionState API
> and the new version of the transaction state message will not be used until
> the inter-broker version supports it." I'm not so clear about the
> implementation details here: say a broker is on the newer version and the
> txn-coordinator is still on older version. Today the APIVersionsRequest can
> only help upgrade / downgrade the request version, but not forbidding
> sending any. Are you suggesting we add additional logic on the broker side
> to handle the case of "not sending the request"? If yes my concern is that
> this will be some tech-debt code that will live long before being removed.
>
> Some additional minor comments:
>
> 1. "last epoch" and "instance epoch" seem to be referring to the same thing
> in your wiki.
> 2. "The broker must verify after receiving the response that the producer
> state is still unknown.": not sure why we have to validate? If the metadata
> returned from the txn-coordinator can always be considered the
> source-of-truth, can't we just bindly use it to update the cache?
>
>
> Guozhang
>
>
>
> On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <ma...@confluent.io>
> wrote:
>
> > I am +1 on this :)
> >
> >
> > -Matthias
> >
> > On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > > Bump. Thanks to Magnus for noticing that I forgot to link to the KIP:
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > > .
> > >
> > > -Jason
> > >
> > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <ja...@confluent.io>
> > wrote:
> > >
> > >> Hi All,
> > >>
> > >> I have a proposal to improve the transactional/idempotent producer's
> > >> handling of the UNKNOWN_PRODUCER error, which is the result of losing
> > >> producer state following segment removal. The current behavior is both
> > >> complex and limited. Please take a look and let me know what you
> think.
> > >>
> > >> Thanks in advance to Matthias Sax for feedback on the initial draft.
> > >>
> > >> -Jason
> > >>
> > >
> >
> >
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Jason, thanks for the great write-up.

0. One question about the migration plan: "The new GetTransactionState API
and the new version of the transaction state message will not be used until
the inter-broker version supports it." I'm not so clear about the
implementation details here: say a broker is on the newer version and the
txn-coordinator is still on older version. Today the APIVersionsRequest can
only help upgrade / downgrade the request version, but not forbidding
sending any. Are you suggesting we add additional logic on the broker side
to handle the case of "not sending the request"? If yes my concern is that
this will be some tech-debt code that will live long before being removed.

Some additional minor comments:

1. "last epoch" and "instance epoch" seem to be referring to the same thing
in your wiki.
2. "The broker must verify after receiving the response that the producer
state is still unknown.": not sure why we have to validate? If the metadata
returned from the txn-coordinator can always be considered the
source-of-truth, can't we just bindly use it to update the cache?

Guozhang

On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax <ma...@confluent.io>
wrote:

> I am +1 on this :)
>
>
> -Matthias
>
> On 9/4/18 9:55 AM, Jason Gustafson wrote:
> > Bump. Thanks to Magnus for noticing that I forgot to link to the KIP:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> > .
> >
> > -Jason
> >
> > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <ja...@confluent.io>
> wrote:
> >
> >> Hi All,
> >>
> >> I have a proposal to improve the transactional/idempotent producer's
> >> handling of the UNKNOWN_PRODUCER error, which is the result of losing
> >> producer state following segment removal. The current behavior is both
> >> complex and limited. Please take a look and let me know what you think.
> >>
> >> Thanks in advance to Matthias Sax for feedback on the initial draft.
> >>
> >> -Jason
> >>
> >
>
>

-- 
-- Guozhang

Re: [DISCUSS] KIP-360: Improve handling of unknown producer

Posted by "Matthias J. Sax" <ma...@confluent.io>.

I am +1 on this :)


-Matthias

On 9/4/18 9:55 AM, Jason Gustafson wrote:
> Bump. Thanks to Magnus for noticing that I forgot to link to the KIP:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
> .
> 
> -Jason
> 
> On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson <ja...@confluent.io> wrote:
> 
>> Hi All,
>>
>> I have a proposal to improve the transactional/idempotent producer's
>> handling of the UNKNOWN_PRODUCER error, which is the result of losing
>> producer state following segment removal. The current behavior is both
>> complex and limited. Please take a look and let me know what you think.
>>
>> Thanks in advance to Matthias Sax for feedback on the initial draft.
>>
>> -Jason
>>
>